The Watcher

thecanine

twtxt.net

Those of you who have your own sites, might want to give this a quick look: https://spawning.ai/ai-txt

It's just a text file, similar to robots.txt, but for AI crawlers, rather than search engine ones. Probably not very effective, as of now, but at least it's a way to make it clear you don't conset to your site being used for AI training, without making it suck for human users, in the process.

prologic

twtxt.net

23 Jan 24 00:22 UTC

View Thread

@thecanine Ahh good idea!

prologic

twtxt.net

23 Jan 24 00:22 UTC

View Thread

@thecanine Ahh good idea!

prologic

twtxt.net

23 Jan 24 00:22 UTC

View Thread

@thecanine Ahh good idea!

eldersnake

we.loveprivacy.club

23 Jan 24 13:38 UTC+1100

View Thread

Thank you for this!

mckinley

twtxt.net

23 Jan 24 07:19 UTC

View Thread

I had so many complaints about this Web page it wouldn't fit in a twt. https://mckinley.cc/notes/20240122-terrible-website.xhtml

thecanine

twtxt.net

23 Jan 24 08:30 UTC

View Thread

@mckinley @prologic Yes, I agree the website itself sucks and the company behind it is incompetent at best - even more so, with their other websites.

Their first site (haveibeentrained.com) was offering a way to search through all the training datasets, not realizing, they were full of illegal porn - so it was quickly shut down.

Now their main gimmick is offering a browser extension, that lets you see what data on any given site you visit, was used for AI training, what has already been marked as "opted out" and a way to add your stuff, to that list.

I don't like that idea either, adding URLs to a list, should not require questionable browser extensions and in general, opting out all the places that might have your images, doesn't seem worth the time, if the companies, don't even have to respect this request.

If you just want the txt file, without additional nonsense, feel free to take the default one, that I use here: https://thecanine.ueuo.com/ai.txt and use or edit it, to match your needs.

lyse

lyse.isobeef.org

23 Jan 24 18:45 UTC+0100

View Thread

@thecanine @prologic @eldersnake @mckinley This page is just a terrible joke. Great writeup, mckinley! Exactly my thoughts, but you forgot to mention that you see zero contents unless you scroll a full page down. Boy do I hate this. Luckily, I did not watch this stupid video.

Why does this generator add tons of *.ext rules when it also has a simple * to catch them all? I'm not a robot.txt expert, but that feels redundant. If I do not have an ai.txt, is their crawler consulting my robots.txt? I could not find an answer to that – in my opinion – obvious question. I don't want any bots on my site.

mckinley

twtxt.net

23 Jan 24 22:16 UTC

View Thread

@lyse I also can't find the user agent string they use, which seems like it would be important information.

thecanine

twtxt.net

24 Jan 24 14:30 UTC

View Thread

@lyse their crawler does not read the robots txt and to my knowledge, neither do any other AI crawlers. As always, they considered themselves exempt, from everything they find inconvenient.

lyse

lyse.isobeef.org

24 Jan 24 21:00 UTC+0100

View Thread

@mckinley Haha, right. They might have figured that everybody is just using * anyway. :-D Evidence from logs suggests "Spawning-AI".

Yup, @thecanine, I thought so, too. Reminds me a bit of Google using the least restrictive robots.txt rule when in doubt (at least you could argue for improved searchability; but it smells a bit fishy).

In the logs I see these three 404s in a row from someone claiming to be their bot:

* /.well-known/tdmrep.json
* /ai.txt?t=1704481081.54321
* /.well-known/ai.txt?t=1704481081.54321

I never heard of TDM Reservation Protocol before:

> This specification defines a simple and practical Web protocol, capable of
> expressing the reservation of rights relative to text & data mining (TDM)
> applied to lawfully accessible Web content, and to ease the discovery of TDM
> licensing policies associated with such content.
>
> This initiative is a technical answer to the constraints set by the Article 4
> of the new European Directive on copyright and related rights in the Digital
> Single Market.*