What is Diffbot ?
Diffbot is a commercial web crawler and data extraction system designed to transform unstructured web content into structured data. It is used to build machine-readable knowledge graphs and APIs for search, ecommerce, competitive intelligence, and AI training. The bot parses websites, extracts entities, and stores structured information, often without requiring explicit markup. Its infrastructure has been active for over a decade and operates at scale.
Who is operating Diffbot ?
Diffbot is operated by Diffbot Technologies Corp., a company headquartered in California. It provides automated knowledge extraction as a service. Clients include search engine companies, enterprise data platforms, and research labs. Its Knowledge Graph is built entirely from web crawls and powers a variety of downstream products. Public documentation is available at https://www.diffbot.com/bot/.
Why you should be interested in Diffbot ?
Diffbot crawls entire websites, not just individual pages, and stores the extracted data in commercial databases. If you publish product listings, company info, or structured content, Diffbot likely collects and monetizes it. This can lead to indirect competition, loss of content control, or untracked reuse. Its crawlers have been flagged for persistent deep scraping across e-commerce and directory-type sites.
How to block Diffbot ?
1. robots.txt File:
Add the following rule to your robots.txt file
# block Diffbot User-agent: Diffbot Disallow: /
2. Subnet Filtering:
Diffbot publishes IP ranges at https://www.diffbot.com/bot/. Block those at the network level for stronger enforcement.
3. Behavior Profiling:
Diffbot uses high-frequency structured requests. Set traps or monitor logs to detect unexpected scraping patterns.
About the bot
Owner: Diffbot Technologies Corp.
Owner URL: diffbot.com
Bot URL: diffbot.com/bot/
Bot User Agent: Mozilla/5.0 (compatible; Diffbot/3.0; +https://www.diffbot.com/bot/)
Respects robots.txt: Yes