# I am the Watcher. I am your guide through this vast new twtiverse.
# 
# Usage:
#     https://watcher.sour.is/api/plain/users              View list of users and latest twt date.
#     https://watcher.sour.is/api/plain/twt                View all twts.
#     https://watcher.sour.is/api/plain/mentions?uri=:uri  View all mentions for uri.
#     https://watcher.sour.is/api/plain/conv/:hash         View all twts for a conversation subject.
# 
# Options:
#     uri     Filter to show a specific users twts.
#     offset  Start index for quey.
#     limit   Count of items to return (going back in time).
# 
# twt range = 1 10
# self = https://watcher.sour.is/conv/qlgshhq
Hmmmm, I somehow run into an encoding problem where my inserted data end up mangled in the database. But, both SQLite and Go use UTF-8. What's happening here? :-?
@lyse One of them is lying. >:-)

Mangled in what way? How does it look like? 🤔
@lyse One of them is lying. >:-)

Mangled in what way? How does it look like? 🤔
@lyse One of them is lying. >:-)

Mangled in what way? How does it look like? 🤔
@lyse One of them is lying. >:-)

Mangled in what way? How does it look like? 🤔
@movq Non-ASCII characters were broken. Like U+2028, degrees (°), etc.

Turns out I used a silly library to detect the encoding and transform to UTF-8 if needed. When there is no Content-Type header, like for local files, it looks at the first 1024 bytes. Since it only saw ASCII in that region, the damn thing assumed the data to be in Windows-1252 (which for web pages kinda makes sense):

// TODO: change default depending on user's locale?
return charmap.Windows1252, "windows-1252", false

https://cs.opensource.google/go/x/net/+/master:html/charset/charset.go;l=102

This default is hardcoded and cannot be changed.

Trying to be smart and adding automatic support for other encodings turned out to be a bad move on my end. At least I can reduce my dependency list again. :-)

I now just reject everything that explicitly specifies something different than text/plain and an optional charset other than utf-8 (ignoring casing). Otherwise I assume it's in UTF-8 (just like the twtxt file format specification mandates) and hope for the best.
@lyse Ouch. 🥴 Well, jenny always decodes as UTF-8 (because the spec says so) and this hasn’t caused any issues – yet.
@lyse Ouch. 🥴 Well, jenny always decodes as UTF-8 (because the spec says so) and this hasn’t caused any issues – yet.
@lyse Ouch. 🥴 Well, jenny always decodes as UTF-8 (because the spec says so) and this hasn’t caused any issues – yet.
@lyse Ouch. 🥴 Well, jenny always decodes as UTF-8 (because the spec says so) and this hasn’t caused any issues – yet.