It's just a text file, similar to robots.txt, but for AI crawlers, rather than search engine ones. Probably not very effective, as of now, but at least it's a way to make it clear you don't conset to your site being used for AI training, without making it suck for human users, in the process.
It's just a text file, similar to robots.txt, but for AI crawlers, rather than search engine ones. Probably not very effective, as of now, but at least it's a way to make it clear you don't conset to your site being used for AI training, without making it suck for human users, in the process.
Their first site (haveibeentrained.com) was offering a way to search through all the training datasets, not realizing, they were full of illegal porn - so it was quickly shut down.
Now their main gimmick is offering a browser extension, that lets you see what data on any given site you visit, was used for AI training, what has already been marked as "opted out" and a way to add your stuff, to that list.
I don't like that idea either, adding URLs to a list, should not require questionable browser extensions and in general, opting out all the places that might have your images, doesn't seem worth the time, if the companies, don't even have to respect this request.
If you just want the txt file, without additional nonsense, feel free to take the default one, that I use here: https://thecanine.ueuo.com/ai.txt and use or edit it, to match your needs.
Why does this generator add tons of
*.ext
rules when it also has a simple *
to catch them all? I'm not a robot.txt expert, but that feels redundant. If I do not have an ai.txt, is their crawler consulting my robots.txt? I could not find an answer to that – in my opinion – obvious question. I don't want any bots on my site.
*
anyway. :-D Evidence from logs suggests "Spawning-AI".Yup, @thecanine, I thought so, too. Reminds me a bit of Google using the least restrictive robots.txt rule when in doubt (at least you could argue for improved searchability; but it smells a bit fishy).
In the logs I see these three 404s in a row from someone claiming to be their bot:
* /.well-known/tdmrep.json
* /ai.txt?t=1704481081.54321
* /.well-known/ai.txt?t=1704481081.54321
I never heard of TDM Reservation Protocol before:
> This specification defines a simple and practical Web protocol, capable of
> expressing the reservation of rights relative to text & data mining (TDM)
> applied to lawfully accessible Web content, and to ease the discovery of TDM
> licensing policies associated with such content.
>
> This initiative is a technical answer to the constraints set by the Article 4
> of the new European Directive on copyright and related rights in the Digital
> Single Market.