The Watcher

	
# I am the Watcher. I am your guide through this vast new twtiverse.
# 
# Usage:
#     https://watcher.sour.is/api/plain/users              View list of users and latest twt date.
#     https://watcher.sour.is/api/plain/twt                View all twts.
#     https://watcher.sour.is/api/plain/mentions?uri=:uri  View all mentions for uri.
#     https://watcher.sour.is/api/plain/conv/:hash         View all twts for a conversation subject.
# 
# Options:
#     uri     Filter to show a specific users twts.
#     offset  Start index for quey.
#     limit   Count of items to return (going back in time).
# 
# twt range = 1 33
# self = https://watcher.sour.is/conv/r7wnzda

movq

www.uninformativ.de

10 Jan 22 16:29 UTC

@prologic The slashdot feed at https://feeds.twtxt.net/slashdot/twtxt.txt appears to be invalid UTF-8 🥴

$ wget -O foo https://feeds.twtxt.net/slashdot/twtxt.txt && python -c 'open("foo", "rb").read().decode("UTF-8")'

2022-01-10 17:26:36 (309 KB/s) - ‘foo’ saved [499169/499169]

Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 45143-45144: invalid continuation byte

movq

www.uninformativ.de

10 Jan 22 16:29 UTC

@prologic The slashdot feed at https://feeds.twtxt.net/slashdot/twtxt.txt appears to be invalid UTF-8 🥴

$ wget -O foo https://feeds.twtxt.net/slashdot/twtxt.txt && python -c 'open("foo", "rb").read().decode("UTF-8")'

2022-01-10 17:26:36 (309 KB/s) - ‘foo’ saved [499169/499169]

Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 45143-45144: invalid continuation byte

movq

www.uninformativ.de

10 Jan 22 16:29 UTC

@prologic The slashdot feed at https://feeds.twtxt.net/slashdot/twtxt.txt appears to be invalid UTF-8 🥴

$ wget -O foo https://feeds.twtxt.net/slashdot/twtxt.txt && python -c 'open("foo", "rb").read().decode("UTF-8")'

2022-01-10 17:26:36 (309 KB/s) - ‘foo’ saved [499169/499169]

Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 45143-45144: invalid continuation byte

prologic

twtxt.net

10 Jan 22 22:26 UTC

@movq Really?! 🤦‍♂️ Is this because I don’t set a proper encoding header?

prologic

twtxt.net

10 Jan 22 22:26 UTC

@movq Really?! 🤦‍♂️ Is this because I don’t set a proper encoding header?

movq

www.uninformativ.de

11 Jan 22 14:54 UTC

@prologic Hmm, nope, it appears to cut strings on a byte-level instead of codepoint-level. So stuff ends in the middle of a multibyte sequence. 🤔

movq

www.uninformativ.de

11 Jan 22 14:54 UTC

@prologic Hmm, nope, it appears to cut strings on a byte-level instead of codepoint-level. So stuff ends in the middle of a multibyte sequence. 🤔

movq

www.uninformativ.de

11 Jan 22 14:54 UTC

@prologic Hmm, nope, it appears to cut strings on a byte-level instead of codepoint-level. So stuff ends in the middle of a multibyte sequence. 🤔

prologic

twtxt.net

11 Jan 22 15:00 UTC

@movq Would you have time to see what I'm doing wrong at all? 🤔

prologic

twtxt.net

11 Jan 22 15:00 UTC

@movq Would you have time to see what I'm doing wrong at all? 🤔

movq

www.uninformativ.de

11 Jan 22 15:32 UTC

@prologic I think this is the problem:

https://git.mills.io/yarnsocial/feeds/src/branch/master/feeds.go#L53

I don’t really know how Go works, but this appears to work on a byte level? The following snippet produces similar results:

package main

import "fmt"

func main() {
foo := "🐧"
fmt.Printf("borked: [%s]\n", foo[:2])
}

This prints just two bytes from the multibyte penguin emoji and that’s invalid UTF-8.

No idea how to fix this in Go, though. 🤷=

movq

www.uninformativ.de

11 Jan 22 15:32 UTC

@prologic I think this is the problem:

https://git.mills.io/yarnsocial/feeds/src/branch/master/feeds.go#L53

I don’t really know how Go works, but this appears to work on a byte level? The following snippet produces similar results:

package main

import "fmt"

func main() {
foo := "🐧"
fmt.Printf("borked: [%s]\\n", foo[:2])
}

This prints just two bytes from the multibyte penguin emoji and that’s invalid UTF-8.

No idea how to fix this in Go, though. 🤷=

movq

www.uninformativ.de

11 Jan 22 15:32 UTC

@prologic I think this is the problem:

https://git.mills.io/yarnsocial/feeds/src/branch/master/feeds.go#L53

I don’t really know how Go works, but this appears to work on a byte level? The following snippet produces similar results:

package main

import "fmt"

func main() {
foo := "🐧"
fmt.Printf("borked: [%s]\n", foo[:2])
}

This prints just two bytes from the multibyte penguin emoji and that’s invalid UTF-8.

No idea how to fix this in Go, though. 🤷=

movq

www.uninformativ.de

11 Jan 22 15:32 UTC

@prologic I think this is the problem:

https://git.mills.io/yarnsocial/feeds/src/branch/master/feeds.go#L53

I don’t really know how Go works, but this appears to work on a byte level? The following snippet produces similar results:

package main

import "fmt"

func main() {
foo := "🐧"
fmt.Printf("borked: [%s]\n", foo[:2])
}

This prints just two bytes from the multibyte penguin emoji and that’s invalid UTF-8.

No idea how to fix this in Go, though. 🤷=

prologic

twtxt.net

11 Jan 22 15:54 UTC

@movq Ahh! Thanks for looking into this! I _think_ I know how to fix this 👌 I'll try to fix it tomorrow!

prologic

twtxt.net

11 Jan 22 15:54 UTC

@movq Ahh! Thanks for looking into this! I _think_ I know how to fix this 👌 I'll try to fix it tomorrow!

lyse

lyse.isobeef.org

11 Jan 22 16:55 UTC+0100

@movq Yes, subscripting a string works on byte level. You would have to use its runes to get the Unicode codepoints, e.g. by looping over it. https://stackoverflow.com/a/18130880

lyse

lyse.isobeef.org

11 Jan 22 17:05 UTC+0100

@movq @prologic Didn't try, but something along those lines should do the trick: string([]rune(markdown)[:max])

prologic

twtxt.net

11 Jan 22 16:46 UTC

@lyse I believe you are right! That would fix this problem indeed 👌

prologic

twtxt.net

11 Jan 22 16:46 UTC

@lyse I believe you are right! That would fix this problem indeed 👌

lyse

lyse.isobeef.org

11 Jan 22 18:00 UTC+0100

@movq @prologic I'm trying to fix this right now and getting panics with out of range things in my unit test. I'm on it.

lyse

lyse.isobeef.org

11 Jan 22 18:55 UTC+0100

@movq @prologic Obviously, comparing byte lengths but working with rune indices is asking for trouble… I'm now working with rune lengths rather byte lengths. https://git.mills.io/yarnsocial/feeds/pulls/21

prologic

twtxt.net

11 Jan 22 21:35 UTC

@lyse Oh thank you very much 🙇‍♂️

prologic

twtxt.net

11 Jan 22 21:35 UTC

@lyse Oh thank you very much 🙇‍♂️

lyse

lyse.isobeef.org

12 Jan 22 19:00 UTC+0100

No worries, @prologic! @movq will report whether it's actually fixed or not. ;-)

movq

www.uninformativ.de

12 Jan 22 21:39 UTC

@lyse I’m seeing slashdot twts in my timeline now, so it appears to be solved! 👌 Thank you!

movq

www.uninformativ.de

12 Jan 22 21:39 UTC

@lyse I’m seeing slashdot twts in my timeline now, so it appears to be solved! 👌 Thank you!

movq

www.uninformativ.de

12 Jan 22 21:39 UTC

@lyse I’m seeing slashdot twts in my timeline now, so it appears to be solved! 👌 Thank you!

lyse

lyse.isobeef.org

12 Jan 22 23:15 UTC+0100

@movq Excellent, glad to have made the internet a better place now. ;-) You might have ended up with kind of duplicates because of the change, not sure. Actually, I didn't look at any other code, just that single function.

movq

www.uninformativ.de

13 Jan 22 15:48 UTC

@lyse Given that *no* twts from those feeds made it to my inbox, because my client couldn’t parse that broken UTF-8: Nope, no duplicates. 😂

movq

www.uninformativ.de

13 Jan 22 15:48 UTC

@lyse Given that *no* twts from those feeds made it to my inbox, because my client couldn’t parse that broken UTF-8: Nope, no duplicates. 😂

movq

www.uninformativ.de

13 Jan 22 15:48 UTC

@lyse Given that *no* twts from those feeds made it to my inbox, because my client couldn’t parse that broken UTF-8: Nope, no duplicates. 😂

lyse

lyse.isobeef.org

13 Jan 22 18:35 UTC+0100

@movq Ah, excellent. I assumed that you had been subscribed to that feed for some time and it just broke now.