The Watcher

lyse

lyse.isobeef.org

I'm experimenting with SQLite and trees. It's going good so far with only my own 439 messages long main feed from a few days ago in the cache. Fetching these 632 rows took 20ms:

SQL query to build up the conversation trees in the cache

Now comes the real tricky part, how do I exclude completely read threads?

prologic

twtxt.net

23 Sep 24 10:51 UTC

View Thread

@lyse Mind sharing your schema?

prologic

twtxt.net

23 Sep 24 10:51 UTC

View Thread

@lyse Mind sharing your schema?

lyse

lyse.isobeef.org

23 Sep 24 13:00 UTC+0200

View Thread

There you go, @prologic, the SQLite database (with a bit more data now) and the sqlitebrowser project file containing the query: https://lyse.isobeef.org/tmp/tt2cache.tar.bz2 (133.9 KiB)

lyse

lyse.isobeef.org

23 Sep 24 13:15 UTC+0200

View Thread

This organigram example got me started: https://www.sqlite.org/lang_with.html#controlling_depth_first_versus_breadth_first_search_of_a_tree_using_order_by

But I feel execution times get worse rather quickly with more data I add. Also, caching helps tremendously, executing it for the first time took over 600ms. From then on I'm down to 40ms.

I think, it's particularly bad that parents might be missing. Thus, I cannot use an index, because there is no parent to reference. But my database knowledge is fairly limited, so I have to read up on that.

prologic

twtxt.net

23 Sep 24 11:18 UTC

View Thread

@lyse And your query to construct a tree? Can you share the full query (_screenshot looks scary 🤣_) -- On another note, SQL and relational databases aren't really that conduces to tree-like structures are they? 🤣_

prologic

twtxt.net

23 Sep 24 11:18 UTC

View Thread

lyse

lyse.isobeef.org

23 Sep 24 14:45 UTC+0200

View Thread

@prologic Yeah, relational databases are definitely not the perfect fit for trees, but I want to give it a shot anyway. :-)

Using EXPLAIN QUERY PLAN I was able to create two indices, to avoid some table scans:

CREATE INDEX parent ON messages (hash, subject);
CREATE INDEX subject_created_at ON messages (subject, created_at);

Also, since strings are sortable, instead of str_col <> '' I now use str_col > '' to allow the use of an index.

But somehow, my output seems to be broken at the end for some reason, I just noticed. :-? Hmm.

The read status still gives me headache. I think I either have to filter in the application or create more meta data structures in the database.

I'm wondering if anyone here already used certain storages for tree data.

prologic

twtxt.net

23 Sep 24 13:00 UTC

View Thread

@lyse Yeah I _think_ it's one of the reasons why yarnd's cache became so complicated really. I mean it's a bunch of maps and lists that is recalculated every ~5m. I don't know of any better way to do this right now, but maybe one day I'll figure out a better way to represent the same information that is displayed today that works reasonably well.~

prologic

twtxt.net

23 Sep 24 13:00 UTC

View Thread

lyse

lyse.isobeef.org

23 Sep 24 15:30 UTC+0200

View Thread

Okay, I figured out the cause of the broken output. I also replaced the first subject = '' for the existing conversation roots with subject > ''. Somehow, my brain must have read subject <> ''. That equality check should not have been touched at all. I just updated the updated archive for anyone who is interested to follow along: https://lyse.isobeef.org/tmp/tt2cache.tar.bz2 (151.1 KiB)

lyse

lyse.isobeef.org

24 Sep 24 17:00 UTC+0200

View Thread

Three feeds (prologic, movq and mine) and my database is already 1.3 MiB in size. Hmm. I actually got the read filter working. More on that later after polishing it.

prologic

twtxt.net

24 Sep 24 15:21 UTC

View Thread

@lyse Congrats 🙌

prologic

twtxt.net

24 Sep 24 15:21 UTC

View Thread

@lyse Congrats 🙌

lyse

lyse.isobeef.org

24 Sep 24 18:45 UTC+0200

View Thread

@prologic Ta! Somehow, my unit tests break, though. Running the same query manually looks like it's producing a plausible looking result, though. I do not understand it.

lyse

lyse.isobeef.org

24 Sep 24 19:45 UTC+0200

View Thread

Finally! After hours I figured out my problems.

1. The clever Go code to filter out completely read conversations got in the way with the filtering now moved into SQL. Yeah, I also did not think that this could ever conflict. But it did. Initializing the completeConversationRead flag to true got now in my way, this caused a conversation to be removed. Simply deleting all the code around that flag solved it.

2. Generation of missing conversation roots in SQL simply used the oldest (smallest) timestamp from any direct reply in the tree. To find the missing roots I grouped by subject and then aggregated using min(created_at). Now that I optimized this to only take unread messages into consideration in the first place, I do not necessarily see the smallest child anymore (when it's already read), so the timestamp is then moved forward to the next oldest unread reply. As I do not care too much about an accurate timestamp for something made up, I just adjusted my test case accordingly. Good enough for me. :-)

It's an interesting experiment with SQLite so far. I certainly did learn a few things along the way. Mission accomplished.