The Watcher

	
# I am the Watcher. I am your guide through this vast new twtiverse.
# 
# Usage:
#     https://watcher.sour.is/api/plain/users              View list of users and latest twt date.
#     https://watcher.sour.is/api/plain/twt                View all twts.
#     https://watcher.sour.is/api/plain/mentions?uri=:uri  View all mentions for uri.
#     https://watcher.sour.is/api/plain/conv/:hash         View all twts for a conversation subject.
# 
# Options:
#     uri     Filter to show a specific users twts.
#     offset  Start index for quey.
#     limit   Count of items to return (going back in time).
# 
# twt range = 1 50
# self = https://watcher.sour.is/conv/momapxa

mckinley

twtxt.net

26 Oct 22 06:10 UTC

Anyone know of a tool that will crawl a website, run JavaScript, and then save the resulting DOM as HTML?

I tried Wpull, but I can't get it to stop crashing on startup and development seems to have stopped.

I'm sure there's a joke to be made about Python here.

prologic

twtxt.net

26 Oct 22 06:39 UTC

@mckinley Wait what?! ... Why?! What are you doing? Are you trying to write a web search engine / crawler? 😅

prologic

twtxt.net

26 Oct 22 06:39 UTC

@mckinley Wait what?! ... Why?! What are you doing? Are you trying to write a web search engine / crawler? 😅

prologic

twtxt.net

26 Oct 22 06:40 UTC

What about this?

geziyor/geziyor: Geziyor, blazing fast web crawling

prologic

twtxt.net

26 Oct 22 06:40 UTC

What about this?

geziyor/geziyor: Geziyor, blazing fast web crawling

mckinley

twtxt.net

26 Oct 22 07:27 UTC

@prologic I'm trying to make a static local mirror of MDN Web Docs. It's all free information on GitHub, but the whole system is extremely complicated.

I think it's so they can sell more MDN plus subscriptions, making people use their terrible MDN Offline system that uses the local storage of your browser.

At this point, I'm willing to run a local dev server and just save each generated page and its dependencies.

I really only need it to run JavaScript so it can request the browser compatibility JSON. It's https://github.com/mdn/browser-compat-data but the MDN server, annoyingly, transforms it.

Once the BCD data is rendered statically, I should be able to remove the references to the JavaScript.

That will solve another issue I'm having where the JavaScript is constantly trying to download /api/v1/whoami, which seemingly has no purpose aside from user tracking.

mckinley

twtxt.net

26 Oct 22 07:27 UTC

@prologic I'm trying to make a static local mirror of MDN Web Docs. It's all free information on GitHub, but the whole system is extremely complicated.

<tinfoil-hat>I think it's so they can sell more MDN plus subscriptions, making people use their terrible MDN Offline system that uses the local storage of your browser.</tinfoil-hat>

At this point, I'm willing to run a local dev server and just save each generated page and its dependencies.

I really only need it to run JavaScript so it can request the browser compatibility JSON. It's https://github.com/mdn/browser-compat-data but the MDN server, annoyingly, transforms it.

Once the BCD data is rendered statically, I should be able to remove the references to the JavaScript.

That will solve another issue I'm having where the JavaScript is constantly trying to download /api/v1/whoami, which seemingly has no purpose aside from user tracking.

mckinley

twtxt.net

26 Oct 22 07:27 UTC

@prologic I'm trying to make a static local mirror of MDN Web Docs. It's all free information on GitHub, but the whole system is extremely complicated.

<tinfoil-hat>I think it's so they can sell more MDN plus subscriptions, making people use their terrible MDN Offline system that uses the local storage of your browser.</tinfoil-hat>

At this point, I'm willing to run a local dev server and just save each generated page and its dependencies.

I really only need it to run JavaScript so it can request the browser compatibility JSON. It's https://github.com/mdn/browser-compat-data but the MDN server, annoyingly, transforms it.

Once the BCD is rendered statically, I should be able to remove the references to the JavaScript.

That will solve another issue I'm having where the JavaScript is constantly trying to download /api/v1/whoami, which seemingly has no purpose aside from user tracking.

mckinley

twtxt.net

26 Oct 22 07:27 UTC

@prologic I'm trying to make a static local mirror of MDN Web Docs. It's all free information on GitHub, but the whole system is extremely complicated.

<tinfoil-hat>I think it's so they can sell more MDN plus subscriptions, making people use their terrible MDN Offline system that uses the local storage of your browser.

At this point, I'm willing to run a local dev server and just save each generated page and its dependencies.

I really only need it to run JavaScript so it can request the browser compatibility JSON. It's https://github.com/mdn/browser-compat-data but the MDN server, annoyingly, transforms it.

Once the BCD data is rendered statically, I should be able to remove the references to the JavaScript.

That will solve another issue I'm having where the JavaScript is constantly trying to download /api/v1/whoami, which seemingly has no purpose aside from user tracking.

mckinley

twtxt.net

26 Oct 22 07:28 UTC

Doing it this way will also solve *another* issue I'm having. You actually can "build" the site and you get almost all the information in static files. However, all the links have capitalization, e.g. /en-US/docs/Web/CSS/border, and all the filenames are in lowercase, e.g. /en-us/docs/web/css/border.

prologic

twtxt.net

26 Oct 22 07:32 UTC

@mckinley Ahh I see 👌 Well fortunately tool l linked above looks like it might do the trick for you 🤗

prologic

twtxt.net

26 Oct 22 07:32 UTC

@mckinley Ahh I see 👌 Well fortunately tool l linked above looks like it might do the trick for you 🤗

adi

twtxt.net

26 Oct 22 07:41 UTC

@mckinley I was personally considering writing man pages for HTML elements at some point. man pages are very cool!

mckinley

twtxt.net

26 Oct 22 07:46 UTC

@prologic It's close, but it's just a Web scraping library. I'm looking for something of the command line variety.

prologic

twtxt.net

26 Oct 22 07:48 UTC

@mckinley cause of probably write a command-line tool for you that uses that library. If you tell me what basic things you need it to do?

prologic

twtxt.net

26 Oct 22 07:48 UTC

@mckinley cause of probably write a command-line tool for you that uses that library. If you tell me what basic things you need it to do?

mckinley

twtxt.net

26 Oct 22 07:56 UTC

@prologic That's awfully nice of you, but you don't need to do that. I know you're a busy guy.

I'm sure I can find something if I look around some more. I can't be the only one that wants to make a static mirror of a dynamic website.

prologic

twtxt.net

26 Oct 22 07:57 UTC

@mckinley Yes yes but if this library will do the trick happy to write said tool for you and who knows anyone else in need?

prologic

twtxt.net

26 Oct 22 07:57 UTC

@mckinley Yes yes but if this library will do the trick happy to write said tool for you and who knows anyone else in need?

mckinley

twtxt.net

26 Oct 22 08:15 UTC

@prologic What I need it to do is crawl a website, executing JavaScript along the way, and saving the resulting DOMs to HTML files. It isn't necessary to save the files downloaded via XHR and the like, but I would need it to save page requisites. CSS, JavaScript, favicons, etc.

Something that I'd like to have, but isn't required, is mirroring of content (+ page requisites) in frames. (Example) This would involve spanning hosts, but I only need to span hosts for this specific purpose.

It would also be nice if the program could resolve absolute paths to relative paths (/en-US/docs/Web/HTML/Global_attributes -> ../../Global_attributes) but this isn't required either. I think I'm going to have to have a local Web server running anyway because just about all the links are to directories with an index.html. (i.e the actual file referenced by /en-US/docs/Web/HTML/Global_attributes is /en-US/docs/Web/HTML/Global_attributes/index.html.)

prologic

twtxt.net

26 Oct 22 08:19 UTC

@mckinley Got it 👌

prologic

twtxt.net

26 Oct 22 08:19 UTC

@mckinley Got it 👌

mckinley

twtxt.net

26 Oct 22 08:19 UTC

Now I've just realized that if /en-US/docs/Web/HTML/Global_attributes is saved with that filename, the Web server is probably going to send the wrong MIME type. Wget solves this with --adjust-extension.

Man, you really don't have to do this...

prologic

twtxt.net

26 Oct 22 08:25 UTC

@mckinley I'll think about it -- It probably isn't that much code to write 🤞

prologic

twtxt.net

26 Oct 22 08:25 UTC

@mckinley I'll think about it -- It probably isn't that much code to write 🤞

mckinley

twtxt.net

26 Oct 22 08:45 UTC

If I can get a proper static copy of MDN, I'll make a torrent and share a magnet link here. I know I'm not the only one who wants something like this. I don't think the file sizes will be so bad. My current "build" of the entire site is sitting at 1.36 GiB. (Only a little more than double the size of node_modules!) So, with browser compatibility data and such, I think it'll still be less than 2GiB.

Aggressively compressed with bzip2 -9, it's only 114.29 MiB. A compression ratio of 0.08. That blows my mind.

prologic

twtxt.net

26 Oct 22 12:00 UTC

@mckinley I can confirm the library "does what it says on the tin" 👌 I'll put up my little CLI tool up for you to play with, its pretty damn stupid and basic right now as I'm not completely yet really sure how to flesh this out. Will need you to guide me on this, there's probably a fair few nuances to writing a decent web mirroring tool (at least it does the right thing though and handles dynamic content rendered with Javascript -- Which I tested by hitting my files.mills.io web app which has a pure JS frontend using MithrilJS)

prologic

twtxt.net

26 Oct 22 12:00 UTC

@mckinley I can confirm the library "does what it says on the tin" 👌 I'll put up my little CLI tool up for you to play with, its pretty damn stupid and basic right now as I'm not completely yet really sure how to flesh this out. Will need you to guide me on this, there's probably a fair few nuances to writing a decent web mirroring tool (at least it does the right thing though and handles dynamic content rendered with Javascript -- Which I tested by hitting my files.mills.io web app which has a pure JS frontend using MithrilJS)

prologic

twtxt.net

26 Oct 22 12:04 UTC

@mckinley Code is here: https://git.mills.io/prologic/mirror

prologic

twtxt.net

26 Oct 22 12:04 UTC

@mckinley Code is here: https://git.mills.io/prologic/mirror

prologic

twtxt.net

26 Oct 22 12:05 UTC

Let's see where this can go... its a good use of the chromedp library 👌

prologic

twtxt.net

26 Oct 22 12:05 UTC

Let's see where this can go... its a good use of the chromedp library 👌

prologic

twtxt.net

26 Oct 22 12:05 UTC

Who knows... Maybe I can finally build a proper web crawler? 🤔

prologic

twtxt.net

26 Oct 22 12:05 UTC

Who knows... Maybe I can finally build a proper web crawler? 🤔

prologic

twtxt.net

26 Oct 22 12:21 UTC

daohoangson/go-sitemirror: Website mirror app with priority for response consistency.

prologic

twtxt.net

26 Oct 22 12:21 UTC

daohoangson/go-sitemirror: Website mirror app with priority for response consistency.

prologic

twtxt.net

26 Oct 22 12:22 UTC

@mckinley Search over 🥳 Use this ☝️

👉 daohoangson/go-sitemirror: Website mirror app with priority for response consistency.

prologic

twtxt.net

26 Oct 22 12:22 UTC

@mckinley Search over 🥳 Use this ☝️

👉 daohoangson/go-sitemirror: Website mirror app with priority for response consistency.

prologic

twtxt.net

26 Oct 22 12:37 UTC

Crap. it doesn't support Javascript 🤣 🤦‍♂️

prologic

twtxt.net

26 Oct 22 12:37 UTC

Crap. it doesn't support Javascript 🤣 🤦‍♂️

prologic

twtxt.net

26 Oct 22 14:30 UTC

#mckinleytwtxt.net I've made a few more commits to mirror -- But sadly its not currently as good as I'd hope. Turns out mirror the structure of websites is rather tricky? Maybe you have some tips to help? 😅 Anyway give it a whirl, very much pre-alpha.

prologic

twtxt.net

26 Oct 22 14:30 UTC

#mckinleytwtxt.net I've made a few more commits to mirror -- But sadly its not currently as good as I'd hope. Turns out mirror the structure of websites is rather tricky? Maybe you have some tips to help? 😅 Anyway give it a whirl, very much pre-alpha.

ocdtrekkie

twtxt.net

26 Oct 22 18:28 UTC

@prologic What is the ttps:// protocol, prologic?

movq

www.uninformativ.de

26 Oct 22 19:45 UTC

@ocdtrekkie I bet it’s the “typo transport protocol suite” 🥴

movq

www.uninformativ.de

26 Oct 22 19:45 UTC

@ocdtrekkie I bet it’s the “typo transport protocol suite” 🥴

movq

uninformativ.de

26 Oct 22 19:45 UTC

@ocdtrekkie I bet it’s the “typo transport protocol suite” 🥴

movq

www.uninformativ.de

26 Oct 22 19:45 UTC

@ocdtrekkie I bet it’s the “typo transport protocol suite” 🥴

mckinley

twtxt.net

26 Oct 22 21:46 UTC

@prologic Thank you, I'll give it a try a little later. It looks very promising.

prologic

twtxt.net

27 Oct 22 00:09 UTC

@mckinley No worries 👌

prologic

twtxt.net

27 Oct 22 00:09 UTC

@mckinley No worries 👌