# I am the Watcher. I am your guide through this vast new twtiverse.
# 
# Usage:
#     https://watcher.sour.is/api/plain/users              View list of users and latest twt date.
#     https://watcher.sour.is/api/plain/twt                View all twts.
#     https://watcher.sour.is/api/plain/mentions?uri=:uri  View all mentions for uri.
#     https://watcher.sour.is/api/plain/conv/:hash         View all twts for a conversation subject.
# 
# Options:
#     uri     Filter to show a specific users twts.
#     offset  Start index for quey.
#     limit   Count of items to return (going back in time).
# 
# twt range = 1 57
# self = https://watcher.sour.is/conv/37xr3ra
I just built a poc search engine / crawler for Twtxt. I managed to crawl this pod (twtxt.net) and a couple of others (sorry @etux and @xuu I used your pods in the tests too!). So far so good. I _might_ keep going with this and see what happens 😀
I just built a poc search engine / crawler for Twtxt. I managed to crawl this pod (twtxt.net) and a couple of others (sorry @etux and @xuu I used your pods in the tests too!). So far so good. I _might_ keep going with this and see what happens 😀
@prologic @etux @xuu This is the result so far in this veery quick piece of code:\n\n
\n$ ./twtxt-search-engine\n...\nAll done!\nFound 14909 twts in 344 feeds\n
@prologic @etux @xuu This is the result so far in this veery quick piece of code:\n\n
\n$ ./twtxt-search-engine\n...\nAll done!\nFound 14909 twts in 344 feeds\n
@prologic @etux @xuu This is the result so far in this veery quick piece of code:


$ ./twtxt-search-engine
...
All done!
Found 14909 twts in 344 feeds
@prologic @etux @xuu This is the result so far in this veery quick piece of code:


$ ./twtxt-search-engine
...
All done!
Found 14909 twts in 344 feeds
@prologic @etux @xuu Now I want to remove the "domain" restriction, add a rate-limit and _try_ to crawl as much of the Twtxt wider network as I can and see how deep it goes 🤔
@prologic @etux @xuu Now I want to remove the "domain" restriction, add a rate-limit and _try_ to crawl as much of the Twtxt wider network as I can and see how deep it goes 🤔
@prologic @etux @xuu Now I want to remove the "domain" restriction, add a rate-limit and _try_ to crawl as much of the Twtxt wider network as I can and see how deep it goes 🤔
@prologic Cool!
@lyse @prologic very curious... i worked on a very similar track. i built a spider that will trace off any follows = comments and mentions from other users and came up with:\n
\ntwters:  744\ntotal:  52073\n
@lyse @prologic very curious... i worked on a very similar track. i built a spider that will trace off any follows = comments and mentions from other users and came up with:

twters:  744
total:  52073
@lyse @prologic very curious... i worked on a very similar track. i built a spider that will trace off any follows = comments and mentions from other users and came up with:\n

twters:  744
total:  52073
@lyse @prologic very curious... i worked on a very similar track. i built a spider that will trace off any follows = comments and mentions from other users and came up with:

twters:  744
total:  52073
@lyse @prologic very curious... i worked on a very similar track. i built a spider that will trace off any follows = comments and mentions from other users and came up with:

twters:  744
total:  52073
@lyse @xuu Hmmm very interesting ! Let me put my code up somewhere
@lyse @xuu Hmmm very interesting ! Let me put my code up somewhere
@lyse @xuu Hmmm very interesting ! Let me put my code up somewhere
@prologic It is pretty basic, and depends on some local changes i am still working out on my branch.. https://gist.github.com/JonLundy/dc19028ec81eb4ad6af74c50255e7cee
@prologic It is pretty basic, and depends on some local changes i am still working out on my branch.. https://gist.github.com/JonLundy/dc19028ec81eb4ad6af74c50255e7cee
@prologic It is pretty basic, and depends on some local changes i am still working out on my branch.. https://gist.github.com/JonLundy/dc19028ec81eb4ad6af74c50255e7cee
@prologic It is pretty basic, and depends on some local changes i am still working out on my branch.. https://gist.github.com/JonLundy/dc19028ec81eb4ad6af74c50255e7cee
@xuu I _think_ what I have put together last night is a little different... 🤔 https://gist.github.com/prologic/c64a00affbf14eb3a508ce43ffce1cbb. -- What you've got is a lot more code and looks way more polished 🤗 At a high-level what does yours do?
@xuu I _think_ what I have put together last night is a little different... 🤔 https://gist.github.com/prologic/c64a00affbf14eb3a508ce43ffce1cbb. -- What you've got is a lot more code and looks way more polished 🤗 At a high-level what does yours do?
@xuu I _think_ what I have put together last night is a little different... 🤔 https://gist.github.com/prologic/c64a00affbf14eb3a508ce43ffce1cbb. -- What you've got is a lot more code and looks way more polished 🤗 At a high-level what does yours do?
Ahh I don't think your code actually _crawls_ the Twtxt space right? Just parses urls given to it and adds it to a database file?
Ahh I don't think your code actually _crawls_ the Twtxt space right? Just parses urls given to it and adds it to a database file?
Ahh I don't think your code actually _crawls_ the Twtxt space right? Just parses urls given to it and adds it to a database file?
It _might_ be worthwhile combining the two approaches and _actually_ building a goodness to gracious search engine and crawler for twtxt? 🤔 🤣
It _might_ be worthwhile combining the two approaches and _actually_ building a goodness to gracious search engine and crawler for twtxt? 🤔 🤣
It _might_ be worthwhile combining the two approaches and _actually_ building a goodness to gracious search engine and crawler for twtxt? 🤔 🤣
@prologic yeah it reads a seed file. I'm using mine. it scans for any mention links and then scans them recursively. it reads from http/s or gopher. i don't have much of a db yet.. it just writes to disk the feed and checks modified dates.. but I will add a db that has hashs/mentions/subjects and such.
@prologic yeah it reads a seed file. I'm using mine. it scans for any mention links and then scans them recursively. it reads from http/s or gopher. i don't have much of a db yet.. it just writes to disk the feed and checks modified dates.. but I will add a db that has hashs/mentions/subjects and such.
@prologic yeah it reads a seed file. I'm using mine. it scans for any mention links and then scans them recursively. it reads from http/s or gopher. i don't have much of a db yet.. it just writes to disk the feed and checks modified dates.. but I will add a db that has hashs/mentions/subjects and such.
@prologic yeah it reads a seed file. I'm using mine. it scans for any mention links and then scans them recursively. it reads from http/s or gopher. i don't have much of a db yet.. it just writes to disk the feed and checks modified dates.. but I will add a db that has hashs/mentions/subjects and such.
@prologic the add function just scans recursivley everything.. but the idea is to just add and any new mentions then have a cron to update all known feeds
@prologic the add function just scans recursivley everything.. but the idea is to just add and any new mentions then have a cron to update all known feeds
@prologic the add function just scans recursivley everything.. but the idea is to just add and any new mentions then have a cron to update all known feeds
@prologic the add function just scans recursivley everything.. but the idea is to just add and any new mentions then have a cron to update all known feeds
Wait... So you actually wrote a more elaborate crawler without taking a shortcut like I did using colly (_not that it really helps much_) Hmmm? 🤔 Can we take it a bit further, make a daemon/server out of it, a web interface to search what it crawls using bleve and some tools (_API, Web UI_) to let people add more "feeds" to crawl? 🤔
Wait... So you actually wrote a more elaborate crawler without taking a shortcut like I did using colly (_not that it really helps much_) Hmmm? 🤔 Can we take it a bit further, make a daemon/server out of it, a web interface to search what it crawls using bleve and some tools (_API, Web UI_) to let people add more "feeds" to crawl? 🤔
Wait... So you actually wrote a more elaborate crawler without taking a shortcut like I did using colly (_not that it really helps much_) Hmmm? 🤔 Can we take it a bit further, make a daemon/server out of it, a web interface to search what it crawls using bleve and some tools (_API, Web UI_) to let people add more "feeds" to crawl? 🤔
@prologic sounds about right. I tend to try to build my own before pulling in libs. learn more that way. I was looking at using it as a way to build my twt mirroring idea. and testing the lex parser with a wide ranging corpus to find edge cases. (the pgp signed feeds for one)
@prologic sounds about right. I tend to try to build my own before pulling in libs. learn more that way. I was looking at using it as a way to build my twt mirroring idea. and testing the lex parser with a wide ranging corpus to find edge cases. (the pgp signed feeds for one)
@prologic sounds about right. I tend to try to build my own before pulling in libs. learn more that way. I was looking at using it as a way to build my twt mirroring idea. and testing the lex parser with a wide ranging corpus to find edge cases. (the pgp signed feeds for one)
@prologic sounds about right. I tend to try to build my own before pulling in libs. learn more that way. I was looking at using it as a way to build my twt mirroring idea. and testing the lex parser with a wide ranging corpus to find edge cases. (the pgp signed feeds for one)
@prologic in theory shouldn't need to let users add feeds.. if they get mentioned by a tracked feed they will get added automagically. on a pod it would just need to scan the twtxt feed to know about everyone.
@prologic in theory shouldn't need to let users add feeds.. if they get mentioned by a tracked feed they will get added automagically. on a pod it would just need to scan the twtxt feed to know about everyone.
@prologic in theory shouldn't need to let users add feeds.. if they get mentioned by a tracked feed they will get added automagically. on a pod it would just need to scan the twtxt feed to know about everyone.
@prologic in theory shouldn't need to let users add feeds.. if they get mentioned by a tracked feed they will get added automagically. on a pod it would just need to scan the twtxt feed to know about everyone.
@xuu This is true!
@xuu This is true!
@xuu This is true!
As a quick experiment, I modified my code to remove the domain restrictions and low and behold:\n\n
\nAll done!\nCrawled 516 feeds\nFound 52464 twts\nFound 736 feeds\n
\n\nThe Twtxt network is larger than I thought. A significant no. of feeds no longer work obviously, but that's okay, we can prune dead feeds out.
As a quick experiment, I modified my code to remove the domain restrictions and low and behold:


All done!
Crawled 516 feeds
Found 52464 twts
Found 736 feeds


The Twtxt network is larger than I thought. A significant no. of feeds no longer work obviously, but that's okay, we can prune dead feeds out.
As a quick experiment, I modified my code to remove the domain restrictions and low and behold:


All done!
Crawled 516 feeds
Found 52464 twts
Found 736 feeds


The Twtxt network is larger than I thought. A significant no. of feeds no longer work obviously, but that's okay, we can prune dead feeds out.
As a quick experiment, I modified my code to remove the domain restrictions and low and behold:\n\n
\nAll done!\nCrawled 516 feeds\nFound 52464 twts\nFound 736 feeds\n
\n\nThe Twtxt network is larger than I thought. A significant no. of feeds no longer work obviously, but that's okay, we can prune dead feeds out.