# I am the Watcher. I am your guide through this vast new twtiverse.
# 
# Usage:
#     https://watcher.sour.is/api/plain/users              View list of users and latest twt date.
#     https://watcher.sour.is/api/plain/twt                View all twts.
#     https://watcher.sour.is/api/plain/mentions?uri=:uri  View all mentions for uri.
#     https://watcher.sour.is/api/plain/conv/:hash         View all twts for a conversation subject.
# 
# Options:
#     uri     Filter to show a specific users twts.
#     offset  Start index for quey.
#     limit   Count of items to return (going back in time).
# 
# twt range = 1 33
# self = https://watcher.sour.is/conv/r7wnzda
@prologic The slashdot feed at https://feeds.twtxt.net/slashdot/twtxt.txt appears to be invalid UTF-8 🥴

$ wget -O foo https://feeds.twtxt.net/slashdot/twtxt.txt && python -c 'open("foo", "rb").read().decode("UTF-8")'

2022-01-10 17:26:36 (309 KB/s) - ‘foo’ saved [499169/499169]

Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 45143-45144: invalid continuation byte
@prologic The slashdot feed at https://feeds.twtxt.net/slashdot/twtxt.txt appears to be invalid UTF-8 🥴

$ wget -O foo https://feeds.twtxt.net/slashdot/twtxt.txt && python -c 'open("foo", "rb").read().decode("UTF-8")'

2022-01-10 17:26:36 (309 KB/s) - ‘foo’ saved [499169/499169]

Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 45143-45144: invalid continuation byte
@prologic The slashdot feed at https://feeds.twtxt.net/slashdot/twtxt.txt appears to be invalid UTF-8 🥴

$ wget -O foo https://feeds.twtxt.net/slashdot/twtxt.txt && python -c 'open("foo", "rb").read().decode("UTF-8")'

2022-01-10 17:26:36 (309 KB/s) - ‘foo’ saved [499169/499169]

Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 45143-45144: invalid continuation byte
@movq Really?! 🤦‍♂️ Is this because I don’t set a proper encoding header?
@movq Really?! 🤦‍♂️ Is this because I don’t set a proper encoding header?
@prologic Hmm, nope, it appears to cut strings on a byte-level instead of codepoint-level. So stuff ends in the middle of a multibyte sequence. 🤔
@prologic Hmm, nope, it appears to cut strings on a byte-level instead of codepoint-level. So stuff ends in the middle of a multibyte sequence. 🤔
@prologic Hmm, nope, it appears to cut strings on a byte-level instead of codepoint-level. So stuff ends in the middle of a multibyte sequence. 🤔
@movq Would you have time to see what I'm doing wrong at all? 🤔
@movq Would you have time to see what I'm doing wrong at all? 🤔
@prologic I think this is the problem:

https://git.mills.io/yarnsocial/feeds/src/branch/master/feeds.go#L53

I don’t really know how Go works, but this appears to work on a byte level? The following snippet produces similar results:

package main

import "fmt"

func main() {
foo := "🐧"
fmt.Printf("borked: [%s]\n", foo[:2])
}

This prints just two bytes from the multibyte penguin emoji and that’s invalid UTF-8.

No idea how to fix this in Go, though. 🤷=
@prologic I think this is the problem:

https://git.mills.io/yarnsocial/feeds/src/branch/master/feeds.go#L53

I don’t really know how Go works, but this appears to work on a byte level? The following snippet produces similar results:

package main

import "fmt"

func main() {
foo := "🐧"
fmt.Printf("borked: [%s]\\n", foo[:2])
}

This prints just two bytes from the multibyte penguin emoji and that’s invalid UTF-8.

No idea how to fix this in Go, though. 🤷=
@prologic I think this is the problem:

https://git.mills.io/yarnsocial/feeds/src/branch/master/feeds.go#L53

I don’t really know how Go works, but this appears to work on a byte level? The following snippet produces similar results:

package main

import "fmt"

func main() {
foo := "🐧"
fmt.Printf("borked: [%s]\n", foo[:2])
}

This prints just two bytes from the multibyte penguin emoji and that’s invalid UTF-8.

No idea how to fix this in Go, though. 🤷=
@prologic I think this is the problem:

https://git.mills.io/yarnsocial/feeds/src/branch/master/feeds.go#L53

I don’t really know how Go works, but this appears to work on a byte level? The following snippet produces similar results:

package main

import "fmt"

func main() {
foo := "🐧"
fmt.Printf("borked: [%s]\n", foo[:2])
}

This prints just two bytes from the multibyte penguin emoji and that’s invalid UTF-8.

No idea how to fix this in Go, though. 🤷=
@movq Ahh! Thanks for looking into this! I _think_ I know how to fix this 👌 I'll try to fix it tomorrow!
@movq Ahh! Thanks for looking into this! I _think_ I know how to fix this 👌 I'll try to fix it tomorrow!
@movq Yes, subscripting a string works on byte level. You would have to use its runes to get the Unicode codepoints, e.g. by looping over it. https://stackoverflow.com/a/18130880
@movq @prologic Didn't try, but something along those lines should do the trick: string([]rune(markdown)[:max])
@lyse I believe you are right! That would fix this problem indeed 👌
@lyse I believe you are right! That would fix this problem indeed 👌
@movq @prologic I'm trying to fix this right now and getting panics with out of range things in my unit test. I'm on it.
@movq @prologic Obviously, comparing byte lengths but working with rune indices is asking for trouble… I'm now working with rune lengths rather byte lengths. https://git.mills.io/yarnsocial/feeds/pulls/21
@lyse Oh thank you very much 🙇‍♂️
@lyse Oh thank you very much 🙇‍♂️
No worries, @prologic! @movq will report whether it's actually fixed or not. ;-)
@lyse I’m seeing slashdot twts in my timeline now, so it appears to be solved! 👌 Thank you!
@lyse I’m seeing slashdot twts in my timeline now, so it appears to be solved! 👌 Thank you!
@lyse I’m seeing slashdot twts in my timeline now, so it appears to be solved! 👌 Thank you!
@movq Excellent, glad to have made the internet a better place now. ;-) You might have ended up with kind of duplicates because of the change, not sure. Actually, I didn't look at any other code, just that single function.
@lyse Given that *no* twts from those feeds made it to my inbox, because my client couldn’t parse that broken UTF-8: Nope, no duplicates. 😂
@lyse Given that *no* twts from those feeds made it to my inbox, because my client couldn’t parse that broken UTF-8: Nope, no duplicates. 😂
@lyse Given that *no* twts from those feeds made it to my inbox, because my client couldn’t parse that broken UTF-8: Nope, no duplicates. 😂
@movq Ah, excellent. I assumed that you had been subscribed to that feed for some time and it just broke now.