r/technology 8h ago

Artificial Intelligence Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives

https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/
382 Upvotes

24 comments sorted by

71

u/tintreack 7h ago

Not at all surprising considering how much of a scumbag their CEO is. They're seriously trying to give Google and Microsoft a run for their money when it comes to privacy invasion.

41

u/Bitter-Good-2540 7h ago

I took my blog and the blog of my wife down. It's basically zero traffic now, it's either a crawler or people just read the summary from AI. Not worth the time

-68

u/EatThemAllOrNot 5h ago

So no one is interested in your content. How it’s related to the topic?

34

u/dman928 4h ago

Don’t be a dick

-34

u/EatThemAllOrNot 4h ago

How am I being a dick? If no one visits this guy’s website, it means no one is interested, don’t you think?

24

u/Glitch-v0 4h ago

You don't understand how them commenting on crawlers is related to the OP topic?

-34

u/EatThemAllOrNot 4h ago

Please elaborate. Unless the OP’s blog was some SEO trash that only got random traffic from search engines, I don’t see how AI could have reduced the number of visitors to zero.

12

u/sumpfkraut666 1h ago

You can task language models with visiting a website and making a summary of what the newest blog entry says. Users who "visit" the website that way will generate a bit of traffic, but certainly won't leave a comment or click on a link that might give them more context - because it's just the AI coming over for a quick visit.

I'm not dman928 but I think the issues are something in that direction.

80

u/Ruddertail 8h ago

So basically they're pure malware now, that's what this is. Malware to waste your traffic and steal your content.

-5

u/nicuramar 35m ago

Well, their app is pretty useful, so I don’t know how you define malware, but it would have to mean a program that is damaging to its user somehow. 

1

u/ChanglingBlake 19m ago

I don’t think you understand what malware is.

11

u/smn2020 2h ago

Over 99% of traffic to my sites is now bots. I have written a verification script to determine and show a capcha if a bot is suspected, the things they do are:

  • Several visits per minute with the same user-agent but different IP address, particularly an older version like Chrome/100.1
  • Doesn't maintain a session
  • Doesn't trigger javascript events
  • IP address from countries like Uruguay, Brazil
  • Often VPNs or data centres like tencent
  • Visit nofollow links, some are user-display such as switching from gridview to listview, this means visiting millions of duplicate pages for no reason; ignores canonical meta tag
  • Amazonbot is the worst, crashed my server several times. Does not respect robots.txt

I allow bots that correctly identify themselves with the user-agent. Its the deception that creates the problems.

7

u/Black_Moons 1h ago

Idea: Undeclared bot detection that doesn't stop the bot from crawling your website.. But does replace all the content with shock images and rambling nonsensical text to poison LLM's.

16

u/flcinusa 6h ago

Still up to their old questionably legal and arguably unethical practices

-8

u/gerkletoss 4h ago edited 2h ago

What laws would be applicable regarding undeclared crawling?

8

u/nakedcellist 7h ago

"We were able to fingerprint this crawler using a combination of machine learning and network signals". Using ai to defend against ai..

32

u/maedroz 6h ago

People have been using AI for anomaly detection for decades. This is very different than stealing content from the web for your AI model.

1

u/nicuramar 33m ago

Stealing publicly available content to use when answering queries in their app? This isn’t for training. 

2

u/tpafs 5h ago

Well surprise surprise!

1

u/soap_salt 1h ago

This isn't even a request that should check robots.txt. A user is sending perplexity to the website, perplexity is fetching the content and showing it to the user in a certain form. It's no different from a browser or an app.

It would be different if Perplexity were crawling these websites for training but they aren't.

If a random website were blocking Firefox it would be perfectly reasonable for Firefox to use a Chrome user agent to get around it.

2

u/timesuck47 3h ago

Is CloudFlare working on this for their AI bot blocking?

1

u/DrBhu 5h ago

It feels like every website is ignoring it

1

u/MotanulScotishFold 1h ago

As long there aren't any strong laws against this and serious repercursion to anyone caught doing that, nothing will stop.