img2dataset

What is img2dataset ?

img2dataset is not a web crawler in the traditional sense, but an open-source tool designed to download large image datasets from public URLs, typically stored in files like CSV, TSV, or Parquet. The tool is often used to build datasets for training computer vision models. When deployed at scale, some users configure it with a custom user-agent such as “img2dataset”, which may appear in server logs.

Who is operating img2dataset ?

There is no centralized operator for img2dataset. It is an open-source project created and maintained by independent developer Romain Beaumont. However, any entity—individuals, labs, or companies—can use the tool to download images at scale. If you see requests from the user-agent “img2dataset”, it means someone is using the tool to retrieve images from your site, not necessarily the developer or a specific organization.

Why you should be interested in img2dataset ?

If your site hosts images, img2dataset may be used to download them in bulk—potentially thousands at a time. This creates risks: server overload, bandwidth consumption, and unauthorized data harvesting for model training. Since usage of the tool is decentralized, you cannot rely on any opt-out process other than explicit server-side blocking. Moreover, many users of img2dataset do not respect crawl-delay or robots.txt.

How to block img2dataset ?

1. robots.txt File:
Although there’s no guarantee the tool respects it, you can still add:

# block img2dataset
User-agent: img2dataset
Disallow: /

2. User-Agent Filtering:
Block any requests containing the string “img2dataset” at your web server level.

3. IP monitoring:
Since the tool is run by third parties, IPs vary. Track sudden image scraping patterns in logs to identify and filter abusive usage.

About the bot

Owner: Open-source / Decentralized
Owner URL: github.com
Bot URL: github.com
Bot User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0 (compatible; img2dataset; +https://github.com/rom1504/img2dataset)
Respects robots.txt: Perhaps

Ready to understand your AI-driven traffic?

Join thousands of websites that use PeripL to track and optimize for AI platforms.

Try our beta

We currently support WordPress and PrestaShop 1.6 exclusively. Support for additional platforms will be available soon.