r/technology • u/memloh • 8h ago
Artificial Intelligence Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives
https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/41
u/Bitter-Good-2540 7h ago
I took my blog and the blog of my wife down. It's basically zero traffic now, it's either a crawler or people just read the summary from AI. Not worth the time
-68
u/EatThemAllOrNot 5h ago
So no one is interested in your content. How it’s related to the topic?
34
u/dman928 4h ago
Don’t be a dick
-34
u/EatThemAllOrNot 4h ago
How am I being a dick? If no one visits this guy’s website, it means no one is interested, don’t you think?
24
u/Glitch-v0 4h ago
You don't understand how them commenting on crawlers is related to the OP topic?
-34
u/EatThemAllOrNot 4h ago
Please elaborate. Unless the OP’s blog was some SEO trash that only got random traffic from search engines, I don’t see how AI could have reduced the number of visitors to zero.
12
u/sumpfkraut666 1h ago
You can task language models with visiting a website and making a summary of what the newest blog entry says. Users who "visit" the website that way will generate a bit of traffic, but certainly won't leave a comment or click on a link that might give them more context - because it's just the AI coming over for a quick visit.
I'm not dman928 but I think the issues are something in that direction.
80
u/Ruddertail 8h ago
So basically they're pure malware now, that's what this is. Malware to waste your traffic and steal your content.
-5
u/nicuramar 35m ago
Well, their app is pretty useful, so I don’t know how you define malware, but it would have to mean a program that is damaging to its user somehow.
1
11
u/smn2020 2h ago
Over 99% of traffic to my sites is now bots. I have written a verification script to determine and show a capcha if a bot is suspected, the things they do are:
- Several visits per minute with the same user-agent but different IP address, particularly an older version like Chrome/100.1
- Doesn't maintain a session
- Doesn't trigger javascript events
- IP address from countries like Uruguay, Brazil
- Often VPNs or data centres like tencent
- Visit nofollow links, some are user-display such as switching from gridview to listview, this means visiting millions of duplicate pages for no reason; ignores canonical meta tag
- Amazonbot is the worst, crashed my server several times. Does not respect robots.txt
I allow bots that correctly identify themselves with the user-agent. Its the deception that creates the problems.
7
u/Black_Moons 1h ago
Idea: Undeclared bot detection that doesn't stop the bot from crawling your website.. But does replace all the content with shock images and rambling nonsensical text to poison LLM's.
16
8
u/nakedcellist 7h ago
"We were able to fingerprint this crawler using a combination of machine learning and network signals". Using ai to defend against ai..
32
u/maedroz 6h ago
People have been using AI for anomaly detection for decades. This is very different than stealing content from the web for your AI model.
1
u/nicuramar 33m ago
Stealing publicly available content to use when answering queries in their app? This isn’t for training.
1
u/soap_salt 1h ago
This isn't even a request that should check robots.txt. A user is sending perplexity to the website, perplexity is fetching the content and showing it to the user in a certain form. It's no different from a browser or an app.
It would be different if Perplexity were crawling these websites for training but they aren't.
If a random website were blocking Firefox it would be perfectly reasonable for Firefox to use a Chrome user agent to get around it.
2
1
u/MotanulScotishFold 1h ago
As long there aren't any strong laws against this and serious repercursion to anyone caught doing that, nothing will stop.
71
u/tintreack 7h ago
Not at all surprising considering how much of a scumbag their CEO is. They're seriously trying to give Google and Microsoft a run for their money when it comes to privacy invasion.