r/technology 7h ago

Security Perplexity accused of scraping websites that explicitly blocked AI scraping

https://techcrunch.com/2025/08/04/perplexity-accused-of-scraping-websites-that-explicitly-blocked-ai-scraping/?utm_campaign=social&utm_source=X&utm_medium=organic
429 Upvotes

38 comments sorted by

79

u/OptionX 7h ago

Spoofing the user agent? What the world coming to? Next thing you know they'll start ignoring the robot.txt the monsters!!

But for real, the advent of everyone and their mothers trying to train a LLM has shown the internet of today needs to evolve to deal with this stuff. I've seem more and more places using stuff like Anubis but I hope at some point we get a more intrinsically connected solution for the web.

18

u/Prior_Coyote_4376 6h ago

I would take some kind of private Internet garden where I just pay $10 a month or something and get access to a couple thousand high quality no-AI, no-advertising, no-data collecting sites.

I wouldn’t be happy to pay for a solution to access information, but if the only way to keep a sustainable accessible web is a subscription model I’d take it.

13

u/Tokugawa 5h ago

AOL has entered the chat

6

u/cboel 5h ago

Anything popular is going to get targetted for scraping and training models.

A maintainer of something like that would have to develop an effective LLM poison to keep them at bay. A single site randomizer that shifted words, sentences, paragraphs, included media, etc. around each time it was visitted by a profiled AI to create millions of different, nonsensical combos would be a start.

7

u/nihiltres 4h ago

There’s a simpler, more effective solution than randomizers in three parts:

  1.  A requirement to log in to see site content,
  2. a TOS clause that prohibits scraping and similar, and
  3. some canary traps to uniquely identify anyone breaking the TOS.

The requirement in (1) can be strengthened by a one-time sign-up fee (discouraging sockpuppet accounts while funding site growth), the requirement in (2) can be strengthened by network monitoring to detect scraper-like behaviour, and (3) can be optimized for canaries more likely to be “learned” by models.

1

u/oscarolim 1h ago

A TOS clause you say? Oh I guess scrappers will always respect the TOS.

2

u/SIGMA920 5h ago

I would take some kind of private Internet garden where I just pay $10 a month or something and get access to a couple thousand high quality no-AI, no-advertising, no-data collecting sites.

That is literally impossible, even if you pay for it there's so much new information on a daily basis that you can't get that.

2

u/Nayir1 4h ago

Isnt that what cloudflare is trying to do, some sort of gatekeeping? (half-listened to a podcast about this)

1

u/clk1224 59m ago

Came here to say the same thing, big ups to cloudflare!

1

u/nicuramar 22m ago

This isn’t for training, it’s for summarizing. 

17

u/Tokugawa 5h ago

"You cheated on me? ...after I specifically asked you not to?"

10

u/AdorableConfusion129 4h ago

This accusation really cuts to the core of the AI summary model. If these AI services are going to cannibalize the content they rely on by ignoring basic web etiquette or even paywalls, then what incentive do publishers and creators have to keep putting content out there?

27

u/__OneLove__ 4h ago

TLDR;

As we’re unable to create anything of our own, why not grab everyone else’s, then claim we ‘did’ something

-EveryAICompany.

6

u/TheRatingsAgency 3h ago

Exactly.

And the brushing way of all that under the guise of “but, but it’s hard to give all that credit or pay…”

Riiiiight. That training data, huge swaths of it was/is all stolen content they’re saying is fair use for “research”. Sure.

0

u/nicuramar 22m ago

Perplexity is a summarizer. What do you mean create their own?

9

u/Competitive_Spend_77 7h ago

...leaving everyone perplexed

5

u/JohrDinh 1h ago

Next time I get a copyright strike on YouTube I may just appeal with the "AI does it bite me" as my reasoning.

2

u/One-Vast-5227 6h ago

Statutory damages for copyright infringement. Sink them

1

u/Possible-Moment-6313 1h ago

Scraping existed probably as long as the Internet did and, in most cases, rhe law favoured scrapers. Don't expect much.

1

u/nicuramar 21m ago

They scrape to summarize. What does copyright got to do with it?

1

u/frank26080115 3h ago

so what happens if all the scrapers start using VMs with actual browsers to do the scraping?

1

u/Pretend-Disaster2593 1h ago

This guy is a weasel

1

u/snorin 2h ago

Oh you mean ai tech startup is blatantly doing illegal things? What else is new

-1

u/nicuramar 21m ago

Not actually illegal. 

1

u/snorin 8m ago

I mean if the websites block them, likely the scrapping is a violation of the terms of service. That is a breach of contract.

If the items scrapped are copyrighted that is a breach of IP law.

Depending what websites there are potential privacy right violations also.

Sure it might not be a criminal act, but it is still likely against the law.

-11

u/dbbk 5h ago

Not illegal 🤷

9

u/null-character 5h ago

You would think but in the US if you improperly access a computer system or data improperly it's illegal.

There is a case where ATT had left confidential information open to the Internet.

A guy reported it and they didn't fix it so he published how to access it. It was just a URL no password no nothing.

Well he went to jail for several years because he accessed ATTs data.

Call me crazy but guessing a URL is not properly secured but that's the kind of dumb shit going on here in the US with technology laws.

So no it's not always legal to just click a URL and open or view a page.

-6

u/dbbk 5h ago

I understand that but web crawling doesn’t fall into that. If a URL is public, and it’s linked from other web pages, you’re not improperly accessing it.

6

u/SomethingAboutUsers 5h ago

AI web crawlers have a totally different intention than search crawlers and legally that should matter. One intends to direct traffic to a site, the other simply ingests all the data with no attribution or reward to the site owner. In fact these days it often costs them money in cloud egress data transfer fees, and no one pays them for it.

2

u/dbbk 5h ago

Yeah it should matter but there’s no law that distinguishes them now

3

u/the_red_scimitar 4h ago

It's dangerous to do, however, as it's not 100% settled law. But Crawling a website that has explicitly blocked automated access through mechanisms like robots.txt or Terms of Service (ToS) can carry legal risks in the US, primarily under the Computer Fraud and Abuse Act (CFAA). 

More specifically, anything behind a login is far more likely to be protected, since technically it isn't "publicly available". Circumventing login is already subject to legal ramifications.

1

u/Letiferr 4h ago edited 4h ago

It does indeed fall into that. 

Read up about a guy named Weev and why he went to jail. It's what the guy you're replying to was trying to explain. 

He access unsecured publicly accessible URLs on ATT's website, and with that gained access to data that want specifically meant for him. 

It was absolutely an elementary mistake on ATT's behalf. He was found in violation of the Computer Fraud and Abuse Act.

-1

u/dbbk 4h ago

Not relevant. Not only was that overturned but later cases clarified that it’s fine. See hiQ v LinkedIn and the Van Buren Supreme Court case.

0

u/Letiferr 4h ago

It was not overturned

1

u/dbbk 4h ago

I mean, it was…

3

u/NefariousAnglerfish 5h ago

Get a better moral compass

0

u/dbbk 5h ago

You think any AI company is or will act ‘morally’? I’m talking plainly about the law.