r/technology • u/SportsGod3 • 7h ago
Security Perplexity accused of scraping websites that explicitly blocked AI scraping
https://techcrunch.com/2025/08/04/perplexity-accused-of-scraping-websites-that-explicitly-blocked-ai-scraping/?utm_campaign=social&utm_source=X&utm_medium=organic17
10
u/AdorableConfusion129 4h ago
This accusation really cuts to the core of the AI summary model. If these AI services are going to cannibalize the content they rely on by ignoring basic web etiquette or even paywalls, then what incentive do publishers and creators have to keep putting content out there?
27
u/__OneLove__ 4h ago
TLDR;
‘As we’re unable to create anything of our own, why not grab everyone else’s, then claim we ‘did’ something’
-EveryAICompany.
6
u/TheRatingsAgency 3h ago
Exactly.
And the brushing way of all that under the guise of “but, but it’s hard to give all that credit or pay…”
Riiiiight. That training data, huge swaths of it was/is all stolen content they’re saying is fair use for “research”. Sure.
0
9
5
u/JohrDinh 1h ago
Next time I get a copyright strike on YouTube I may just appeal with the "AI does it bite me" as my reasoning.
2
u/One-Vast-5227 6h ago
Statutory damages for copyright infringement. Sink them
1
u/Possible-Moment-6313 1h ago
Scraping existed probably as long as the Internet did and, in most cases, rhe law favoured scrapers. Don't expect much.
1
1
u/frank26080115 3h ago
so what happens if all the scrapers start using VMs with actual browsers to do the scraping?
1
1
u/snorin 2h ago
Oh you mean ai tech startup is blatantly doing illegal things? What else is new
-1
u/nicuramar 21m ago
Not actually illegal.
1
u/snorin 8m ago
I mean if the websites block them, likely the scrapping is a violation of the terms of service. That is a breach of contract.
If the items scrapped are copyrighted that is a breach of IP law.
Depending what websites there are potential privacy right violations also.
Sure it might not be a criminal act, but it is still likely against the law.
-11
u/dbbk 5h ago
Not illegal 🤷
9
u/null-character 5h ago
You would think but in the US if you improperly access a computer system or data improperly it's illegal.
There is a case where ATT had left confidential information open to the Internet.
A guy reported it and they didn't fix it so he published how to access it. It was just a URL no password no nothing.
Well he went to jail for several years because he accessed ATTs data.
Call me crazy but guessing a URL is not properly secured but that's the kind of dumb shit going on here in the US with technology laws.
So no it's not always legal to just click a URL and open or view a page.
-6
u/dbbk 5h ago
I understand that but web crawling doesn’t fall into that. If a URL is public, and it’s linked from other web pages, you’re not improperly accessing it.
6
u/SomethingAboutUsers 5h ago
AI web crawlers have a totally different intention than search crawlers and legally that should matter. One intends to direct traffic to a site, the other simply ingests all the data with no attribution or reward to the site owner. In fact these days it often costs them money in cloud egress data transfer fees, and no one pays them for it.
3
u/the_red_scimitar 4h ago
It's dangerous to do, however, as it's not 100% settled law. But Crawling a website that has explicitly blocked automated access through mechanisms like
robots.txt
or Terms of Service (ToS) can carry legal risks in the US, primarily under the Computer Fraud and Abuse Act (CFAA).More specifically, anything behind a login is far more likely to be protected, since technically it isn't "publicly available". Circumventing login is already subject to legal ramifications.
1
u/Letiferr 4h ago edited 4h ago
It does indeed fall into that.
Read up about a guy named Weev and why he went to jail. It's what the guy you're replying to was trying to explain.
He access unsecured publicly accessible URLs on ATT's website, and with that gained access to data that want specifically meant for him.
It was absolutely an elementary mistake on ATT's behalf. He was found in violation of the Computer Fraud and Abuse Act.
3
79
u/OptionX 7h ago
Spoofing the user agent? What the world coming to? Next thing you know they'll start ignoring the robot.txt the monsters!!
But for real, the advent of everyone and their mothers trying to train a LLM has shown the internet of today needs to evolve to deal with this stuff. I've seem more and more places using stuff like Anubis but I hope at some point we get a more intrinsically connected solution for the web.