Training
These bots do one thing: harvest the public Web to fatten up large-scale training corpora. They’re not here for classic search indexing; they vacuum everything that’s legally accessible, then disappear while engineers turn the raw text into model weights. Take GPTBot from OpenAI—arguably the most aggressive of the lot: if your robots.txt lets it through, your copy may resurface verbatim in GPT-4o’s next update. Or look at AI2Bot-Dolma from the Allen Institute: it scrapes with the specific goal of feeding the open-source Dolma dataset that researchers repackage into lighter academic models. Bottom line: if your content is proprietary or premium, serve them a 403; if visibility and citation matter more than exclusivity, let them crawl and move on.
Instant
These agents work at the other end of the pipeline: live fetching just enough pages to craft an immediate answer for the user. When someone toggles “Browse with Bing” in ChatGPT, the ChatGPT-User crawler fires, grabs three or four URLs, and hands the snippets to the LLM for on-the-fly synthesis—sources included. Perplexity-User behaves similarly but pulls from a broader set, insisting on explicit citations to build trust. For SEO, the playbook shifts: you must rank within the first dozen results on Bing or Google and serve a concise, fact-rich paragraph that the model can quote verbatim. Structure, EEAT, and a tight TL;DR are your ticket into the answer box.