# I am the Watcher. I am your guide through this vast new twtiverse.
#
# Usage:
# https://watcher.sour.is/api/plain/users View list of users and latest twt date.
# https://watcher.sour.is/api/plain/twt View all twts.
# https://watcher.sour.is/api/plain/mentions?uri=:uri View all mentions for uri.
# https://watcher.sour.is/api/plain/conv/:hash View all twts for a conversation subject.
#
# Options:
# uri Filter to show a specific users twts.
# offset Start index for quey.
# limit Count of items to return (going back in time).
#
# twt range = 1 80
# self = https://watcher.sour.is?uri=https://twtxt.stackeffect.de/stackeffect.txt&offset=80
@movq OK, to be more specific: it does to the point of adding twts to the correct file.
I've not checked actual file rotation. With max_twts_per_rotation
set to 100 and me posting ~ once a week first roatation will take place in two years ;-)
@movq
> I feel like README
will need a rework soon. There’s a lot of options now. Or maybe a manpage instead.
For example that local_twtxt_dir
MUST end in path separator should be mentioned somewhere ;-)
@movq Works like a charm!
@movq Great work! I wish we could make all those BIG twtxt writers to use it ;-)
I've a problem with local_twtxt_file
not beeing supported any more. Being forced to use twtxt.txt
as file name breaks at least my URL.
@movq I always understood it as good practice to early catch hardware errors.
@movq Indeed! I'm sorry for that!
@movq Manpage says
> The user is supposed to run it manually or via a periodic system
> service. The recommended period is a month but could be less.
So me doing it weekly is a bit over cautious. It's often overseen by users that they are supposed to perform this task regularly.
@movq Don't forget to btrfs scrub
e.g. once a week.
I'm using btrfs scrub -B /dev/xyz
and mail the result to myself.
@prologic Very nice board and figures. Do they actually fit in the drawer?
@movq Thank you very much for implementing this! It's very useful (at least for me)!
@adi What about this one?
SRCFILES = $(wildcard *)
# remove existing *.gz (actually doubles entries)
CLEANSRC = $(SRCFILES:.gz=)
DSTFILES = $(addsuffix .gz, $(CLEANSRC))
%.gz: %
\tgzip -c $< > $<.gz
all: $(DSTFILES)
You must not have subdirectories in that folder, though.
@adi What about this one?
SRCFILES = $(wildcard *)
# remove existing *.gz (actually doubles entries)
CLEANSRC = $(SRCFILES:.gz=)
DSTFILES = $(addsuffix .gz, $(CLEANSRC))
%.gz: %
gzip -c $< > $<.gz
all: $(DSTFILES)
You must not have subdirectories in that folder, though.
@xuu Well, the point is, things do not work like this.
Actually in nano you would have to ctrl-k ctrl-k ctrl-x y to discard your reply.
@movq I don't by your example (rebasing behaviour), sorry.
Writing a twt is more similiar to writing a commit message. Git does quite some checks to detect that nothing **new** was written and happily discards a commit if you just leave the editor. You don't need any special action, just quit your editor. Git will take care for the rest.
But it's OK as it is. I just didn't expect that I have to select and delete all to discard a twt. So it's C-x h C-w C-x C-c for me.
@movq Yes, this may be enough to check.
I only know this "feature" from my revision control software where I get "abort: empty Commit message" or "Aborting commit due to empty commit message" when I do not change whatever is already in there. Can be quite some text about which files changed and so on.
@movq My workflow is as follows.
I hit "reply" hotkey and my editor comes up.
With or without writing something I close my editor **without saving the content**.
Of course I close it by C-x C-c, not by :q! ;-)
Jenny finds the temp file unchanged, e.g. it's content is the same as it was when my editor was started. I would like that jenny discards the reply then.
Autosaving is no problem either. Real editors do this to a temporary (kind of backup) file. Only in case of a crash that file is consulted and the user is asked if she would like to continue with that stored content.
@movq Your scenario would produce observed behaviour, agreed. On the other side I'm sure I've set very URL in lasttwt > 1630000000.0 (manually, in my editor).
But I can't reproduce any weird behaviour right now. I've tried to "blackhole" twt.nfld.uk temporarily. That does not have any effect.
I've also tried to force twt.nfld.uk to deliver an empty twtxt. That does not have any effect either.
So I guess everything is fine with jenny.
I have wrapped jenny into some shell script to versionize ~/.cache/jenney
. This way I have better data if anything unexprected is showing again.
@prologic I've deleted eleven
and utf8test
, https://search.twtxt.net is the only follower. Maybe you can stop it to follow those twtxts? They were meant for testing purposes only.
Funny bug in LG TV: last Saturday I scheduled some film from yesterday for recording. Actual recording yesterday started 1 hour late. Looks like although TV knows actual time perfectly well it was not capable to "translate" schedule from CEST to CET.
@movq Yes, it was exactly those twts. I don't think I've managed to "match" the downtime while fetching twts. But even if I had, how can this lead to inserting old twts?
@movq Another feature request: sometimes I start writing a twt but then would like to discard it. It would be great if jeny could detect that I did not wrote (or saved) anything and then discards the twt instead of creating an "empty" one.
@movq Today I had unexpected old twts after jenny -f
. Have now jennys cache under revision control, automatically commiting changes after each fetch. Let's see if this helps finding a (possible) bug.
@movq What do you think about this?
diff --git a/jenny b/jenny
index b47c78e..20cf659 100755
--- a/jenny
+++ b/jenny
@@ -278,7 +278,8 @@ def prefill_for(email, reply_to_this, self_mentions):
def process_feed(config, nick, url, content, lasttwt):
nick_address, nick_desc = decide_nick(content, nick)
url_for_hash = decide_url_for_hash(content, url)
- new_lasttwt = parse('1800-01-01T12:00:00+00:00').timestamp()
+ # new_lasttwt = parse('1800-01-01T12:00:00+00:00').timestamp()
+ new_lasttwt = None
for line in twt_lines_from_content(content):
res = twt_line_to_mail(
@@ -296,7 +297,7 @@ def process_feed(config, nick, url, content, lasttwt):
twt_stamp = twt_date.timestamp()
if lasttwt is not None and lasttwt >= twt_stamp:
continue
- if twt_stamp > new_lasttwt:
+ if not new_lasttwt or twt_stamp > new_lasttwt:
new_lasttwt = twt_stamp
mailname_new = join(config['maildir_target'], 'new', twt_hash)
@movq What do you think about this?\n\ndiff --git a/jenny b/jenny\nindex b47c78e..20cf659 100755\n--- a/jenny\n+++ b/jenny\n@@ -278,7 +278,8 @@ def prefill_for(email, reply_to_this, self_mentions):\n def process_feed(config, nick, url, content, lasttwt):\n nick_address, nick_desc = decide_nick(content, nick)\n url_for_hash = decide_url_for_hash(content, url)\n- new_lasttwt = parse('1800-01-01T12:00:00+00:00').timestamp()\n+ # new_lasttwt = parse('1800-01-01T12:00:00+00:00').timestamp()\n+ new_lasttwt = None\n \n for line in twt_lines_from_content(content):\n res = twt_line_to_mail(\n@@ -296,7 +297,7 @@ def process_feed(config, nick, url, content, lasttwt):\n twt_stamp = twt_date.timestamp()\n if lasttwt is not None and lasttwt >= twt_stamp:\n continue\n- if twt_stamp > new_lasttwt:\n+ if not new_lasttwt or twt_stamp > new_lasttwt:\n new_lasttwt = twt_stamp\n \n mailname_new = join(config['maildir_target'], 'new', twt_hash)
@movq I just observed unexpected old twts coming back.\n\nIt looks like lasttwts
is reset to -5364619200.0 every time no new content wasfetched for example if if-modified-since
did not produce new twts?
@movq I just observed unexpected old twts coming back.
It looks like lasttwts
is reset to -5364619200.0 every time no new content wasfetched for example if if-modified-since
did not produce new twts?
@lyse I'm seeing your response as reply to #p522joq, where it doesn't seem to belong to. Did this happen by accident or is there a bug hiding somewhere?
@prologic I'm seeing your response as reply to #p522joq, where it doesn't seem to belong to. Did this happen by accident or is there a bug hiding somewhere?
@movq Ha, but when you control lastmods
, lastseen
and lasttwts
it's easy to test.
Works like a charm!
@movq Ha, but when you control lastmods
, lastseen
and lasttwts
it's easy to test.\n\nWorks like a charm!
@movq Not that easy to test when pods honor if-modified-since
;-)
I've almost only timestamps -5364619200.0...
Diff looks good to me!
@movq Not that easy to test when pods honor if-modified-since
;-)\nI've almost only timestamps -5364619200.0...\nDiff looks good to me!
@movq \nI'll test it tomorrow. Thank's for starting this feature!
@movq
I'll test it tomorrow. Thank's for starting this feature!
@prologic \n> (#el7d3ja) I believe glob ()
is an O(n)
algorithm\nYes, I see. But don't underestimate OS caching for files and directories!\nIf you look up files in the same directory many times then OS may use cached results from earlier lookups.\nI'm not totally sure but I believe this is how things work for both, Windows and Linux at least.
@prologic
> (#el7d3ja) I believe glob ()
is an O(n)
algorithm
Yes, I see. But don't underestimate OS caching for files and directories!
If you look up files in the same directory many times then OS may use cached results from earlier lookups.
I'm not totally sure but I believe this is how things work for both, Windows and Linux at least.
@movq\nWhen I look in my twtxt maildir for duplicated messages they all have F
in their name.\n\nI see that in mail_file_exists
jenny does not consider flagged messages when testing if a message already exists.\n\nI understand that looking up only 12 combinations is faster than reading huge directories. I'm astonished that globbing would be slower. Learning something new every day...
@movq
When I look in my twtxt maildir for duplicated messages they all have F
in their name.
I see that in mail_file_exists
jenny does not consider flagged messages when testing if a message already exists.
I understand that looking up only 12 combinations is faster than reading huge directories. I'm astonished that globbing would be slower. Learning something new every day...
@movq
I just pulled it, works like a charm (as expected) ;-)
@movq \nI just pulled it, works like a charm (as expected) ;-)
@movq
I'm not a Python programmer, so please bear with me.
The doc about encodings does also mention:
If you require a different encoding, you can manually set the Response.encoding property
Wouldn't that be a one liner like (Ruby example)?
'some text'.force_encoding('utf-8')
I understand that you do not want to interfere with requests
. On the other hand we know that received data must be utf-8 (by twtxt spec) and it does burden "publishers" to somehow add charset
property to content-type
header. But again I'm not sure what "the right thing to do" (TM) is.
@movq \nI'm not a Python programmer, so please bear with me.\nThe doc about encodings does also mention:\n\n If you require a different encoding, you can manually set the Response.encoding property\n\nWouldn't that be a one liner like (Ruby example)?\n\n 'some text'.force_encoding('utf-8')\n\nI understand that you do not want to interfere with requests
. On the other hand we know that received data must be utf-8 (by twtxt spec) and it does burden "publishers" to somehow add charset
property to content-type
header. But again I'm not sure what "the right thing to do" (TM) is.
@prologic @movq \nExactly, you see correct UTF-8 encoded version (even with content-type: text/plain
leaving out charset declaration).\n\nAfter following utf8test twtxt myself I now see that jenny
does not handle it as UTF-8 when charset is missing from HTTP header, just like @quark has observed.\n\nSo should jenny
treat twtxt files always as UTF-8 encoded? I'm not sure about this.
@prologic @movq
Exactly, you see correct UTF-8 encoded version (even with content-type: text/plain
leaving out charset declaration).
After following utf8test twtxt myself I now see that jenny
does not handle it as UTF-8 when charset is missing from HTTP header, just like @quark has observed.
So should jenny
treat twtxt files always as UTF-8 encoded? I'm not sure about this.
@lyse
Sorry, I should have mentioned your twt #vjjdara where you already described the same idea.
@lyse \nSorry, I should have mentioned your twt #vjjdara where you already described the same idea.
@prologic \n\n> I believe Yarn assumes utf-8 anyway which is why we don’t see encoding issues\n\nAre you sure? I think in #kj2c5oa @quark mentioned exactly that problem. My logs say "jenny/latest" was fetching my twtxt for quark.\n\nAll I did to fix this was to adding AddCharset utf-8 .txt
to .htaccess. Especially I did not change encoding of stackeffect.txt.
@prologic
> I believe Yarn assumes utf-8 anyway which is why we don’t see encoding issues
Are you sure? I think in #kj2c5oa @quark mentioned exactly that problem. My logs say "jenny/latest" was fetching my twtxt for quark.
All I did to fix this was to adding AddCharset utf-8 .txt
to .htaccess. Especially I did not change encoding of stackeffect.txt.
@movq \n\nDon't miss step 0 (I should have made this a separate point): having a meta header promising appending twts with strictly monotonically increasing timestamps.\n\n> (Also, I’d first like to see the pagination thingy implemented.)\n\nIn jenny I would like to see "don't process previously fetched twts" AKA "Allow the user to archive/delete old twts" feature implemented ;-)
@movq
Don't miss step 0 (I should have made this a separate point): having a meta header promising appending twts with strictly monotonically increasing timestamps.
> (Also, I’d first like to see the pagination thingy implemented.)
In jenny I would like to see "don't process previously fetched twts" AKA "Allow the user to archive/delete old twts" feature implemented ;-)
What about a meta header for setting charset?\n\nI myself stumbled upon .txt files not being delivered with charset: utf-8
by default.\n\nI had to set/modify .htaccess
to correct that.\n\nIt would have been easier if there had been a charset header entry "overwriting" what http server is delivering.\n\nWhat do you think?
What about a meta header for setting charset?
I myself stumbled upon .txt files not being delivered with charset: utf-8
by default.
I had to set/modify .htaccess
to correct that.
It would have been easier if there had been a charset header entry "overwriting" what http server is delivering.
What do you think?
My thoughts about range requests\n\nAdditionally to pagination also range request should be used to reduce traffic.\n\nI understand that there are corner cases making this a complicated matter.\n\nI would like to see a meta header saying that the given twtxt is append only with increasing timestamps so that a simple strategy can detect valid content fetched per range request.\n\n1. read meta part per range request\n2. read last fetched twt at expected range (as known from last fetch)\n3. if fetched content starts with expected twt then process rest of data\n4. if fetched content doesn't start with expected twt discard all and fall back to fetching whole twtxt\n\nPagination (e.g. archiving old content in a different file) will lead to point 4.\n\nOf course especially pods should support range requests, correct @prologic?
My thoughts about range requests
Additionally to pagination also range request should be used to reduce traffic.
I understand that there are corner cases making this a complicated matter.
I would like to see a meta header saying that the given twtxt is append only with increasing timestamps so that a simple strategy can detect valid content fetched per range request.
1. read meta part per range request
2. read last fetched twt at expected range (as known from last fetch)
3. if fetched content starts with expected twt then process rest of data
4. if fetched content doesn't start with expected twt discard all and fall back to fetching whole twtxt
Pagination (e.g. archiving old content in a different file) will lead to point 4.
Of course especially pods should support range requests, correct @prologic?
My thoughts about pagination (paging)\n\nFollowing the discussion about pagination (paging) I think that's the right thing to do.\n\nFetching the same content again and again with only a marginal portion of actually new twts is unbearable and does not scale in any way. It's not only a waste of bandwidth but with increasing number of fetchers it will also become a problem for pods to serve all requests.\n\nBecause it's so easy to implement and simple to understand, splitting twtxt file in parts with next
and prev
pointers seems a really amazing solution.\n\nAs in RFC5005 there should also be a meta header pointing to the **main** URL, e.g. current
or baseurl
or something like that. This way hashes can calculated correctly even for archived twts.
My thoughts about pagination (paging)
Following the discussion about pagination (paging) I think that's the right thing to do.
Fetching the same content again and again with only a marginal portion of actually new twts is unbearable and does not scale in any way. It's not only a waste of bandwidth but with increasing number of fetchers it will also become a problem for pods to serve all requests.
Because it's so easy to implement and simple to understand, splitting twtxt file in parts with next
and prev
pointers seems a really amazing solution.
As in RFC5005 there should also be a meta header pointing to the **main** URL, e.g. current
or baseurl
or something like that. This way hashes can calculated correctly even for archived twts.
@movq \n> I’m curious, what is your use case for deleting twts?\nNot just deleting, also sorting into other folders is impossible.\nIt also doesn't scale in the long term. When I cannot delete twts then I have a full copy of every twtxt I follow - forever. That's a waste of bandwidth and disk space.
@movq
> I’m curious, what is your use case for deleting twts?
Not just deleting, also sorting into other folders is impossible.
It also doesn't scale in the long term. When I cannot delete twts then I have a full copy of every twtxt I follow - forever. That's a waste of bandwidth and disk space.
@movq How is deletion supposed to work? In mutt I deleted by D~d>1m
and then fetched by !jenny -f
. This brings back all deleted twts. Isn't lastmods
used to skip older twts?
@prologic
No, it would be sufficient to skip avatar discovery when metadata does contain an avatar.
@prologic \n\nNo, it would be sufficient to skip avatar discovery when metadata does contain an avatar.
@prologic \nThank you, that's the correct one.\n\nStill I have this in my logs (first access of "eleven" by yarnd):\n\nip.ip.ip.ip - - [21/Oct/2021:20:05:36 +0000] "GET /eleven.txt HTTP/2.0" 200 344 "-" "yarnd/0.2.0@46bea3f (Pod: twtxt.net Support: https://twtxt.net/support)"\nip.ip.ip.ip - - [21/Oct/2021:20:05:36 +0000] "HEAD /avatar.png HTTP/2.0" 200 0 "-" "yarnd/0.2.0@46bea3f (Pod: twtxt.net Support: https://twtxt.net/support)"\n\nAnd I guess without avatar.png sitting there I would have seen even more requests like /eleven.txt/avatar.png.\n\nI've copied stackeffect.png to avatar.png to make yarnd happy when accessing stackeffect.txt.\n\nSo in this setup yarnd fetched eleven.txt along with avatar.png which belongs to another twtxt. This feels buggy.
@prologic
Thank you, that's the correct one.
Still I have this in my logs (first access of "eleven" by yarnd):
ip.ip.ip.ip - - [21/Oct/2021:20:05:36 +0000] "GET /eleven.txt HTTP/2.0" 200 344 "-" "yarnd/0.2.0@46bea3f (Pod: twtxt.net Support: https://twtxt.net/support)"
ip.ip.ip.ip - - [21/Oct/2021:20:05:36 +0000] "HEAD /avatar.png HTTP/2.0" 200 0 "-" "yarnd/0.2.0@46bea3f (Pod: twtxt.net Support: https://twtxt.net/support)"
And I guess without avatar.png sitting there I would have seen even more requests like /eleven.txt/avatar.png.
I've copied stackeffect.png to avatar.png to make yarnd happy when accessing stackeffect.txt.
So in this setup yarnd fetched eleven.txt along with avatar.png which belongs to another twtxt. This feels buggy.
@quark No client, those were created using date -Is
and emacs. Off course all is UTF-8 encoded, but now Apache also announces content-type: text/plain; charset=utf-8
@prologic Are you sure? The avatar file announced in my twtxt.txt was never fetched. Only non existing default avatars were fetched.
@movq What I would really like to see if jenny could use HTTP range requests to fetch only new content.
E.g. it could refetch only last twtext line of last request to make sure it starts off at correct position.
I guess there are twtxt files that only grow, then this will save a lot bandwidth over time.
For twtxt files that "forget" older content this situation would be detected and as a fallback the whole twtxt file could then be fetched.
@movq What I would really like to see if jenny could use HTTP range requests to fetch only new content.\n\nE.g. it could refetch only last twtext line of last request to make sure it starts off at correct position.\n\nI guess there are twtxt files that only grow, then this will save a lot bandwidth over time.\n\nFor twtxt files that "forget" older content this situation would be detected and as a fallback the whole twtxt file could then be fetched.
@eldersnake Maybe they are just lurking (and learning)?
;-)
@eldersnake Maybe they are just lurking (and learning)?\n\n;-)
@prologic I would like to see "header" lines in twtxt.txt parsed.
Personally I started looking at some twtxt files with curl and saw information about avatar images.
I assumed that to be sort of standard and mentioned my avatar image in my stackeffect.txt. But it was not "avatar.png".
Later I saw in logfiles that the info was totally ignored and instead several "avatar.png" locations were tried by the pulling side.
When information in "header" of twtxt file were respected one could easily change avatar file to one with a new filename and there would be no caching problem.
@prologic I would like to see "header" lines in twtxt.txt parsed.\n\nPersonally I started looking at some twtxt files with curl and saw information about avatar images.\n\nI assumed that to be sort of standard and mentioned my avatar image in my stackeffect.txt. But it was not "avatar.png".\n\nLater I saw in logfiles that the info was totally ignored and instead several "avatar.png" locations were tried by the pulling side.\n\nWhen information in "header" of twtxt file were respected one could easily change avatar file to one with a new filename and there would be no caching problem.