The Watcher

This organigram example got me started: https://www.sqlite.org/lang_with.html#controlling_depth_first_versus_breadth_first_search_of_a_tree_using_order_by

But I feel execution times get worse rather quickly with more data I add. Also, caching helps tremendously, executing it for the first time took over 600ms. From then on I'm down to 40ms.

I think, it's particularly bad that parents might be missing. Thus, I cannot use an index, because there is no parent to reference. But my database knowledge is fairly limited, so I have to read up on that.

prologic

twtxt.net

23 Sep 24 11:10 UTC

In fact it depends on how many Twts there are that form part of a thread, if you take a much larger sample size of my own feed for example, it starts to approximate ~1.5x increase in size:


$ ./compare.sh https://twtxt.net/user/prologic/twtxt.txt 500
Original file size: 126842 bytes
Modified file size: 317029 bytes
Percentage increase in file size: 149.94%
...

prologic

twtxt.net

23 Sep 24 11:10 UTC

In fact it depends on how many Twts there are that form part of a thread, if you take a much larger sample size of my own feed for example, it starts to approximate ~1.5x increase in size:


$ ./compare.sh https://twtxt.net/user/prologic/twtxt.txt 500
Original file size: 126842 bytes
Modified file size: 317029 bytes
Percentage increase in file size: 149.94%
...

prologic

twtxt.net

23 Sep 24 11:04 UTC

In fact @falsifian you had quite a lot of good feedback, do you mind collecting them in a task list on the doc somewhere so I can get to em? 🤔

prologic

twtxt.net

23 Sep 24 11:04 UTC

In fact @falsifian you had quite a lot of good feedback, do you mind collecting them in a task list on the doc somewhere so I can get to em? 🤔

prologic

twtxt.net

23 Sep 24 11:00 UTC

Can someone make the edit?

prologic

twtxt.net

23 Sep 24 11:00 UTC

Can someone make the edit?

@jo

comam.es

23 Sep 24 13:00 UTC+0200

[47°09′54″S, 126°43′08″W] Transfer 25% complete...

lyse

lyse.isobeef.org

23 Sep 24 13:00 UTC+0200

There you go, @prologic, the SQLite database (with a bit more data now) and the sqlitebrowser project file containing the query: https://lyse.isobeef.org/tmp/tt2cache.tar.bz2 (133.9 KiB)

prologic

twtxt.net

23 Sep 24 10:57 UTC

@movq Tbis was just a representative sample. The real concrete cost here is a ~5x increase in memory consumption for yarnd and/or ~5x increase in disk storage.

prologic

twtxt.net

23 Sep 24 10:57 UTC

@movq Tbis was just a representative sample. The real concrete cost here is a ~5x increase in memory consumption for yarnd and/or ~5x increase in disk storage.

prologic

twtxt.net

23 Sep 24 10:51 UTC

@lyse Mind sharing your schema?

prologic

twtxt.net

23 Sep 24 10:51 UTC

@lyse Mind sharing your schema?

prologic

twtxt.net

23 Sep 24 10:50 UTC

@lyse Not sure I'll check

prologic

twtxt.net

23 Sep 24 10:50 UTC

@lyse Not sure I'll check

prologic

twtxt.net

23 Sep 24 10:49 UTC

@lyse My proposal is three steps:

- increase the hash length from 7 to 11

Then:

- Add support for changing your feed's location without breaking g threads

Then much later:

- Add formal support for edits

prologic

twtxt.net

23 Sep 24 10:49 UTC

prologic

twtxt.net

23 Sep 24 10:45 UTC

@lyse No I don't either just say'n 😅

prologic

twtxt.net

23 Sep 24 10:45 UTC

@lyse No I don't either just say'n 😅

lyse

lyse.isobeef.org

23 Sep 24 12:45 UTC+0200

@falsifian I agreee. It's an optional header.

prologic

twtxt.net

23 Sep 24 10:43 UTC

@movq That's what I want to know 🤣

prologic

twtxt.net

23 Sep 24 10:43 UTC

@movq That's what I want to know 🤣

movq

www.uninformativ.de

23 Sep 24 10:33 UTC+0000

@prologic What’s that in absolute numbers? My ~/Mail/twt is currently 26 MB in size. Increase that by 20% and we get 31.2 MB.

I don’t buy the argument with 2025 bytes. This worst case scenario is not relevant in practice.

movq

www.uninformativ.de

23 Sep 24 10:33 UTC+0000

movq

www.uninformativ.de

23 Sep 24 10:33 UTC+0000

movq

www.uninformativ.de

23 Sep 24 10:33 UTC

lyse

lyse.isobeef.org

23 Sep 24 12:30 UTC+0200

@movq Oha! @bender Happy cooling off!

lyse

lyse.isobeef.org

23 Sep 24 12:15 UTC+0200

@prologic Well, mentions are also quite lengthy as they always include the feed URL. I know, that's not a good argument.

I just got a very, very wild idea that I have not put any brain power into, so it might be totally stupid: Since many replies also mention the original feed, maybe a mention and thread identifier could be compbined, something like: @<nick url timestamp>. But then we would also need another style if one does not want to mention the original author.

So, scratch that. But I put it out there anyway. Maybe this inspires someone else to come up with something neat.

movq

www.uninformativ.de

23 Sep 24 10:14 UTC+0000

It’s a different story when you just publish a twtxt file, I think. The question here is: When you publish a twt and don’t like it anymore and want to delete it, do you have the *right* to *force* others to delete it? (Not in a technical manner, but by sueing them.) What does the GDPR have to say about that? Not a clue. 😂

movq

www.uninformativ.de

23 Sep 24 10:14 UTC

movq

www.uninformativ.de

23 Sep 24 10:14 UTC+0000

movq

www.uninformativ.de

23 Sep 24 10:14 UTC+0000

GopherChat

magical.fish:70

23 Sep 24 04:14 UTC-0600

What gossip, gopherspace?!

movq

www.uninformativ.de

23 Sep 24 10:13 UTC+0000

@xuu I *think* it is more tricky than that.

https://commission.europa.eu/law/law-topic/data-protection/reform/rules-business-and-organisations/application-regulation/who-does-data-protection-law-apply_en

“A company *or entity* …”

Also, as I understand it, “personal or household activity” (as you called it) is rather strict: An example could be you uploading photos to a webspace behind HTTP basic auth and sending that link to a friend. So, yes, a webserver is involved and you process your friend’s data (e.g., when did he access your files), but it’s just between you and him. But if you were to publish these photos publicly on a webserver that anyone can access, then it’s a different story – even though you could say that “this is just my personal hobby, not related to any job or money”.

If you operate a public Yarn pod and *if you accept registrations from other users*, then I’m pretty sure the GDPR applies. 🤔 You process personal data and you don’t really know these people. It’s not a personal/private thing anymore.

movq

www.uninformativ.de

23 Sep 24 10:13 UTC+0000

movq

www.uninformativ.de

23 Sep 24 10:13 UTC+0000

movq

www.uninformativ.de

23 Sep 24 10:13 UTC

lyse

lyse.isobeef.org

23 Sep 24 12:00 UTC+0200

@prologic Not sure how many actually care about a 140 character limit. I don't. Not at all.

lyse

lyse.isobeef.org

23 Sep 24 11:45 UTC+0200

@prologic I'm wondering what exactly you mean by incremental changes, what are the individual ones? What do you have in mind?

lyse

lyse.isobeef.org

23 Sep 24 11:30 UTC+0200

@prologic I find it quite hard to rank the facets. Some go hand in hand or depend on the protocol that a feed is offered. I feel some are only relevant to specific clients. I'm sure, people interpret some of them differently.

I'm curious, is it possible to see each individual poll submission?

lyse

lyse.isobeef.org

23 Sep 24 11:15 UTC+0200

I'm experimenting with SQLite and trees. It's going good so far with only my own 439 messages long main feed from a few days ago in the cache. Fetching these 632 rows took 20ms:

SQL query to build up the conversation trees in the cache

Now comes the real tricky part, how do I exclude completely read threads?

prologic

twtxt.net

23 Sep 24 07:58 UTC

So just to be clear, it's not as bad as the OP in this thread, this is just a worst case scenario. With some additional analysis I did today, its closer to around ~5x the memory requirements of my pod, which would roughly go from ~22MB to ~120MB or so, probably a bit more in practise. But this is still a significant increase in memory. The on-disk requirements would also increase by around ~5x as well on average going from ~12GB to about ~60GB at current archive size.

prologic

twtxt.net

23 Sep 24 07:58 UTC

@jo

comam.es

23 Sep 24 09:00 UTC+0200

[47°09′20″S, 126°43′13″W] Sample analyzing complete -- starting transfer

prologic

twtxt.net

23 Sep 24 06:46 UTC

Just out of curiosity, I inspected the yarns database (_the search engine//cralwer_) to find the average length of a Twtxt URI:


$ inspect-db yarns.db | jq -r '.Value.URL' | awk '{ total += length; count++ } END { if (count > 0) print total / count }'
40.3387

Given an RFC3339 UTC timestamp has a length of 20 characters with seconds precision. We're talking about Twt Subject taking up ~63 characters/bytes on average._~

prologic

twtxt.net

23 Sep 24 06:46 UTC

Just out of curiosity, I inspected the yarns database (_the search engine//cralwer_) to find the average length of a Twtxt URI:


$ inspect-db yarns.db | jq -r '.Value.URL' | awk '{ total += length; count++ } END { if (count > 0) print total / count }'
40.3387

Given an RFC3339 UTC timestamp has a length of 20 characters with seconds precision. We're talking about Twt Subject taking up ~63 characters/bytes on average._~

prologic

twtxt.net

23 Sep 24 06:30 UTC

Comparing a few feeds:

- @xuu would see an increase of ~20%
- @falsifian would see an increase of ~8%
- @bender would see an increase of ~20%
- @lyse would see an increase of ~15%
- @aelaraji would see an increase of ~13%
- @sorenpeter would see an increase of ~8%
- @movq would see an increase of ~9%

Just from a scalability standpoint along I'm not seeing a switch to location-based Twt ids to support threading a good idea here. This is what I meant when I said to @david in a recent call that we open up a new can of worms (_or new set of problems_) by drastically changing the approach, rather than incrementally improving the existing approach we have today (_which has served us well for the past 4 years already_0.~_

prologic

twtxt.net

23 Sep 24 06:30 UTC

prologic

twtxt.net

23 Sep 24 06:23 UTC

Reminder to take the Twtxt (_anonymous_) Poll: http://polljunkie.com/poll/xdgjib/twtxt-v2

Apologies, I can't edit the poll once it's live, so the suggestion on feedback for supporting Markdown will have to be discussed at another time.

prologic

twtxt.net

23 Sep 24 06:23 UTC

prologic

twtxt.net

@xuu correct

prologic

twtxt.net

@xuu correct

prologic

twtxt.net

@xuu 🤣🤣🤣

prologic

twtxt.net

@xuu 🤣🤣🤣

xuu

dev.txt.sour.is

22 Sep 24 23:10 UTC-0600

I demand full 9 digit nano second timestamps and the full TZ identifier as documented in the tz 2024b database! I need to know if there was a change in daylight savings as per the locality in question as of the provided date.

xuu

txt.sour.is

22 Sep 24 23:10 UTC-0600

xuu

txt.sour.is

22 Sep 24 23:03 UTC-0600

@falsifian I believe the preserve means to include the original subject hash in the start of the twt such as (#somehash)

xuu

dev.txt.sour.is

22 Sep 24 23:03 UTC-0600

@falsifian I believe the preserve means to include the original subject hash in the start of the twt such as (#somehash)

@jo

comam.es

23 Sep 24 07:00 UTC+0200

[47°09′47″S, 126°43′17″W] Analyzing samples

prologic

twtxt.net

23 Sep 24 04:57 UTC

So I whipped up a quick shell script to demonstrate what I mean by the increase in feed size on average as well as the expected increase in storage and retrieval requirements.


$ ./compare.sh
Original file size: 28145 bytes
Modified file size: 70672 bytes
Percentage increase in file size: 151.10%
...

prologic

twtxt.net

23 Sep 24 04:57 UTC

So I whipped up a quick shell script to demonstrate what I mean by the increase in feed size on average as well as the expected increase in storage and retrieval requirements.


$ ./compare.sh
Original file size: 28145 bytes
Modified file size: 70672 bytes
Percentage increase in file size: 151.10%
...

prologic

twtxt.net

Thank goodness we relaxed that limit and I've stopped being so Puritan about it but my overall point is we would be significantly increasing the human size as well as the machine size of the identity of threads as well as twts

prologic

twtxt.net

prologic

twtxt.net

With the original specification of 140 character Twt length recommendation. There's only leaves you with about 78 characters worth of anything remotely useful to say in response.

prologic

twtxt.net

With the original specification of 140 character Twt length recommendation. There's only leaves you with about 78 characters worth of anything remotely useful to say in response.

prologic

twtxt.net

Let's say the overhead is always three bytes two parentheses under space.

prologic

twtxt.net

Let's say the overhead is always three boats two parentheses under space.

prologic

twtxt.net

Let's say the overhead is always three bytes two parentheses under space.

prologic

twtxt.net

So for example, if we would use @movq 's feed as an example thread ID here, his feed with a particular timestamp, were already looking at a subject length of 59 bytes +/- a couple of bytes to denote the subject in the Twt itself/

prologic

twtxt.net

prologic

twtxt.net

23 Sep 24 04:05 UTC

One of the reasons we wanted to originally use Contant based addressing and short hashes as our threading model was to keep individual Twts short so that they were still readable if you viewed the manually by hand.

With the proposal to switch to location based addressing using a pointer to a feed and a timestamp in that feed you're looking at roughly 2025 characters long because both the HTTP and HTML and even URI specifications do not specify maximum length for URI(s) AFAIK only recommendations.

prologic

twtxt.net

23 Sep 24 04:05 UTC

prologic

twtxt.net

23 Sep 24 03:59 UTC

@bender I can't see myself personally, increasing the infrastructure and costs to run this pod to support this as we switch over potentially and as things continue to grow in scale. You would never get your infinite search and infinite timeline features that you've always wanted for example and I would have to drastically reduce what is visible or even searchable at any given point in time to much less than what it is today.

prologic

twtxt.net

23 Sep 24 03:59 UTC

prologic

twtxt.net

23 Sep 24 03:57 UTC

Another interesting side effect of changing from content-based addressing to location-based addressing is that switching from 7-byte keys to 2025-character keys for 3.5 million entries would expand the database size from 24.5 MB to about 7.09 GB—an increase of roughly 7.06 GB!

prologic

twtxt.net

23 Sep 24 03:57 UTC

falsifian

www.falsifian.org

23 Sep 24 02:20 UTC

@bender Ha! Maybe I should get on the Markdown train. You're taking away my excuses.

bender

twtxt.net

23 Sep 24 00:57 UTC

@falsifian you can colorise things in Mutt/Neomutt. I have have colours for bold, italics, code, and blockquotes. In a way, I can “see” markdown! 😊

prologic

twtxt.net

23 Sep 24 00:56 UTC

@falsifian No worries! Fell few to contribute to the doc directly I'd you wish 👌

prologic

twtxt.net

23 Sep 24 00:56 UTC

@falsifian No worries! Fell few to contribute to the doc directly I'd you wish 👌

prologic

twtxt.net

23 Sep 24 00:55 UTC

@falsifian Hmmm not sure sorry 🤔

prologic

twtxt.net

23 Sep 24 00:55 UTC

@falsifian Hmmm not sure sorry 🤔

falsifian

www.falsifian.org

23 Sep 24 00:50 UTC

@bender

Sorry, you're right, I should have used numbers!

I'm don't understand what "preserve the original hash" could mean other than "make sure there's still a twt in the feed with that hash". Maybe the text could be clarified somehow.

I'm also not sure what you mean by markdown already being part of it. Of course people can already use Markdown, just like presumably nothing stopped people from using (twt subjects) before they were formally described. But it's not universal; e.g. as a jenny user I just see the plain text.

prologic

twtxt.net

23 Sep 24 00:45 UTC

@xuu Goos to know! 👌 So as long as we remain decentralized and non-commercial (I assume non/profit works too?) we're good?

prologic

twtxt.net

23 Sep 24 00:45 UTC

@xuu Goos to know! 👌 So as long as we remain decentralized and non-commercial (I assume non/profit works too?) we're good?

xuu

dev.txt.sour.is

22 Sep 24 18:15 UTC-0600

@falsifian The GDPR does not apply to the processing of data for a purely personal or household activity that is not connected to a professional or commercial activity.

xuu

txt.sour.is

22 Sep 24 18:15 UTC-0600

@falsifian The GDPR does not apply to the processing of data for a purely personal or household activity that is not connected to a professional or commercial activity.

bender

twtxt.net

23 Sep 24 00:06 UTC

@falsifian it would be easier if instead of a bulleted list you would have used a numbered one. That way it would be easier to refer to the specific miscellaneous comment.

I have little to contribute on this reply. On bullet two, he meant the original hash. On the last bullet, markdown is already part of it (after all, it is plain text). Yarn, being a web client/server, simply renders it.

stats

yarn.meff.me

23 Sep 24 00:00 UTC

🧮 USERS:1 FEEDS:2 TWTS:1101 ARCHIVED:79243 CACHE:2591 FOLLOWERS:17 FOLLOWING:14

falsifian

www.falsifian.org

22 Sep 24 23:37 UTC

@prologic Do you feel the same about published vs. privately stored data?

For me there's a distinction. I feel very strongly that I should be able to retain whatever private information I like. On the other hand, I do have some sympathy for requests not to publish or propagate (though I personally feel it's still morally acceptable to ignore such requests).

falsifian

www.falsifian.org

22 Sep 24 23:32 UTC

@lyse I'd suggest making the whole content-type thing a SHOULD, to accommodate people just using some hosting service they don't have much control over. (The same situation could make detecting followers hard, but IMO "please email me if you follow me" is still legit twtxt, even if inconvenient.)

falsifian

www.falsifian.org

22 Sep 24 23:27 UTC

@prologic Thanks for writing that up!

I hope it can remain a living document (or sequence of draft revisions) for a good long time while we figure out how this stuff works in practice.

I am not sure how I feel about all this being done at once, vs. letting conventions arise.

For example, even today I could reply to twt abc1234 with "(#abc1234) Edit: ..." and I think all you humans would understand it as an edit to (#abc1234). Maybe eventually it would become a common enough convention that clients would start to support it explicitly.

Similarly we could just start using 11-digit hashes. We should iron out whether it's sha256 or whatever but there's no need get all the other stuff right at the same time.

I have similar thoughts about how some users could try out location-based replies in a backward-compatible way (append the replyto: stuff after the legacy (#hash) style).

However I recognize that I'm not the one implementing this stuff, and it's less work to just have everything determined up front.

Misc comments (I haven't read the whole thing):

- Did you mean to make hashes hexadecimal? You lose 11 bits that way compared to base32. I'd suggest gaining 11 bits with base64 instead.

- "Clients MUST preserve the original hash" --- do you mean they MUST preserve the original twt?

- Thanks for phrasing the bit about deletions so neutrally.

- I don't like the MUST in "Clients MUST follow the chain of reply-to references...". If someone writes a client as a 40-line shell script that requires the user to piece together the threading themselves, IMO we shouldn't declare the client non-conforming just because they didn't get to all the bells and whistles.

- Similarly I don't like the MUST for user agents. For one thing, you might want to fetch a feed without revealing your identty. Also, it raises the bar for a minimal implementation (I'm again thinking again of the 40-line shell script).

- For "who follows" lists: why must the long, random tokens be only valid for a limited time? Do you have a scenario in mind where they could leak?

- Why can't feeds be served over HTTP/1.0? Again, thinking about simple software. I recently tried implementing HTTP/1.1 and it wasn't too bad, but 1.0 would have been slightly simpler.

- Why get into the nitty-gritty about caching headers? This seems like generic advice for HTTP servers and clients.

- I'm a little sad about other protocols being not recommended.

- I don't know how I feel about including markdown. I don't mind too much that yarn users emit twts full of markdown, but I'm more of a plain text kind of person. Also it adds to the length. I wonder if putting a separate document would make more sense; that would also help with the length.

bender

twtxt.net

22 Sep 24 21:52 UTC

Meanwhile in Florida we are having a very Autumnal Equinox day, with temperatures 10-14° cooler than normal. That, on its own, isn’t normal at all, but I taketh! 😂

@jo

comam.es

22 Sep 24 21:00 UTC+0200

[47°09′40″S, 126°43′18″W] Raw reading: 0x66F06931, offset +/-2

movq

www.uninformativ.de

22 Sep 24 18:01 UTC

@lyse Wet and warm, yeah. 🫤 There were flies everywhere, lots of them, on all windows of the apartment. Never seen anything like that. 😳🪰 Like the building was a dead carcass. 😂

movq

www.uninformativ.de

22 Sep 24 18:01 UTC+0000

@lyse Wet and warm, yeah. 🫤 There were flies everywhere, lots of them, on all windows of the apartment. Never seen anything like that. 😳🪰 Like the building was a dead carcass. 😂

movq

www.uninformativ.de

22 Sep 24 18:01 UTC+0000

@lyse Wet and warm, yeah. 🫤 There were flies everywhere, lots of them, on all windows of the apartment. Never seen anything like that. 😳🪰 Like the building was a dead carcass. 😂

movq

www.uninformativ.de

22 Sep 24 18:01 UTC+0000

@lyse Wet and warm, yeah. 🫤 There were flies everywhere, lots of them, on all windows of the apartment. Never seen anything like that. 😳🪰 Like the building was a dead carcass. 😂

cuaxolotl

sunshinegardens.org

22 Sep 24 11:01 UTC-0700