# I am the Watcher. I am your guide through this vast new twtiverse.
#
# Usage:
# https://watcher.sour.is/api/plain/users View list of users and latest twt date.
# https://watcher.sour.is/api/plain/twt View all twts.
# https://watcher.sour.is/api/plain/mentions?uri=:uri View all mentions for uri.
# https://watcher.sour.is/api/plain/conv/:hash View all twts for a conversation subject.
#
# Options:
# uri Filter to show a specific users twts.
# offset Start index for quey.
# limit Count of items to return (going back in time).
#
# twt range = 1 50
# self = https://watcher.sour.is/conv/momapxa
Anyone know of a tool that will crawl a website, run JavaScript, and then save the resulting DOM as HTML?
I tried Wpull, but I can't get it to stop crashing on startup and development seems to have stopped.
I'm sure there's a joke to be made about Python here.
@mckinley Wait what?! ... Why?! What are you doing? Are you trying to write a web search engine / crawler? 😅
@mckinley Wait what?! ... Why?! What are you doing? Are you trying to write a web search engine / crawler? 😅
@prologic I'm trying to make a static local mirror of MDN Web Docs. It's all free information on GitHub, but the whole system is extremely complicated.
I think it's so they can sell more MDN plus subscriptions, making people use their terrible MDN Offline system that uses the local storage of your browser.
At this point, I'm willing to run a local dev server and just save each generated page and its dependencies.
I really only need it to run JavaScript so it can request the browser compatibility JSON. It's https://github.com/mdn/browser-compat-data but the MDN server, annoyingly, transforms it.
Once the BCD data is rendered statically, I should be able to remove the references to the JavaScript.
That will solve another issue I'm having where the JavaScript is constantly trying to download /api/v1/whoami
, which seemingly has no purpose aside from user tracking.
@prologic I'm trying to make a static local mirror of MDN Web Docs. It's all free information on GitHub, but the whole system is extremely complicated.
<tinfoil-hat>I think it's so they can sell more MDN plus subscriptions, making people use their terrible MDN Offline system that uses the local storage of your browser.</tinfoil-hat>
At this point, I'm willing to run a local dev server and just save each generated page and its dependencies.
I really only need it to run JavaScript so it can request the browser compatibility JSON. It's https://github.com/mdn/browser-compat-data but the MDN server, annoyingly, transforms it.
Once the BCD data is rendered statically, I should be able to remove the references to the JavaScript.
That will solve another issue I'm having where the JavaScript is constantly trying to download /api/v1/whoami
, which seemingly has no purpose aside from user tracking.
@prologic I'm trying to make a static local mirror of MDN Web Docs. It's all free information on GitHub, but the whole system is extremely complicated.
<tinfoil-hat>I think it's so they can sell more MDN plus subscriptions, making people use their terrible MDN Offline system that uses the local storage of your browser.</tinfoil-hat>
At this point, I'm willing to run a local dev server and just save each generated page and its dependencies.
I really only need it to run JavaScript so it can request the browser compatibility JSON. It's https://github.com/mdn/browser-compat-data but the MDN server, annoyingly, transforms it.
Once the BCD is rendered statically, I should be able to remove the references to the JavaScript.
That will solve another issue I'm having where the JavaScript is constantly trying to download /api/v1/whoami
, which seemingly has no purpose aside from user tracking.
@prologic I'm trying to make a static local mirror of MDN Web Docs. It's all free information on GitHub, but the whole system is extremely complicated.
<tinfoil-hat>I think it's so they can sell more MDN plus subscriptions, making people use their terrible MDN Offline system that uses the local storage of your browser.
At this point, I'm willing to run a local dev server and just save each generated page and its dependencies.
I really only need it to run JavaScript so it can request the browser compatibility JSON. It's https://github.com/mdn/browser-compat-data but the MDN server, annoyingly, transforms it.
Once the BCD data is rendered statically, I should be able to remove the references to the JavaScript.
That will solve another issue I'm having where the JavaScript is constantly trying to download /api/v1/whoami
, which seemingly has no purpose aside from user tracking.
Doing it this way will also solve *another* issue I'm having. You actually can "build" the site and you get almost all the information in static files. However, all the links have capitalization, e.g. /en-US/docs/Web/CSS/border
, and all the filenames are in lowercase, e.g. /en-us/docs/web/css/border
.
@mckinley Ahh I see 👌 Well fortunately tool l linked above looks like it might do the trick for you 🤗
@mckinley Ahh I see 👌 Well fortunately tool l linked above looks like it might do the trick for you 🤗
@mckinley I was personally considering writing man
pages for HTML elements at some point. man
pages are very cool!
@prologic It's close, but it's just a Web scraping library. I'm looking for something of the command line variety.
@mckinley cause of probably write a command-line tool for you that uses that library. If you tell me what basic things you need it to do?
@mckinley cause of probably write a command-line tool for you that uses that library. If you tell me what basic things you need it to do?
@prologic That's awfully nice of you, but you don't need to do that. I know you're a busy guy.
I'm sure I can find something if I look around some more. I can't be the only one that wants to make a static mirror of a dynamic website.
@mckinley Yes yes but if this library will do the trick happy to write said tool for you and who knows anyone else in need?
@mckinley Yes yes but if this library will do the trick happy to write said tool for you and who knows anyone else in need?
@prologic What I need it to do is crawl a website, executing JavaScript along the way, and saving the resulting DOMs to HTML files. It isn't necessary to save the files downloaded via XHR and the like, but I would need it to save page requisites. CSS, JavaScript, favicons, etc.
Something that I'd like to have, but isn't required, is mirroring of content (+ page requisites) in frames. (Example) This would involve spanning hosts, but I only need to span hosts for this specific purpose.
It would also be nice if the program could resolve absolute paths to relative paths (/en-US/docs/Web/HTML/Global_attributes
-> ../../Global_attributes
) but this isn't required either. I think I'm going to have to have a local Web server running anyway because just about all the links are to directories with an index.html
. (i.e the actual file referenced by /en-US/docs/Web/HTML/Global_attributes
is /en-US/docs/Web/HTML/Global_attributes/index.html
.)
Now I've just realized that if /en-US/docs/Web/HTML/Global_attributes
is saved with that filename, the Web server is probably going to send the wrong MIME type. Wget solves this with --adjust-extension.
Man, you really don't have to do this...
@mckinley I'll think about it -- It probably isn't that much code to write 🤞
@mckinley I'll think about it -- It probably isn't that much code to write 🤞
If I can get a proper static copy of MDN, I'll make a torrent and share a magnet link here. I know I'm not the only one who wants something like this. I don't think the file sizes will be so bad. My current "build" of the entire site is sitting at 1.36 GiB. (Only a little more than double the size of node_modules
!) So, with browser compatibility data and such, I think it'll still be less than 2GiB.
Aggressively compressed with bzip2 -9
, it's only 114.29 MiB. A compression ratio of 0.08. That blows my mind.
@mckinley I can confirm the library "does what it says on the tin" 👌 I'll put up my little CLI tool up for you to play with, its pretty damn stupid and basic right now as I'm not completely yet really sure how to flesh this out. Will need you to guide me on this, there's probably a fair few nuances to writing a decent web mirroring tool (at least it does the right thing though and handles dynamic content rendered with Javascript -- Which I tested by hitting my files.mills.io web app which has a pure JS frontend using MithrilJS)
@mckinley I can confirm the library "does what it says on the tin" 👌 I'll put up my little CLI tool up for you to play with, its pretty damn stupid and basic right now as I'm not completely yet really sure how to flesh this out. Will need you to guide me on this, there's probably a fair few nuances to writing a decent web mirroring tool (at least it does the right thing though and handles dynamic content rendered with Javascript -- Which I tested by hitting my files.mills.io web app which has a pure JS frontend using MithrilJS)
Let's see where this can go... its a good use of the chromedp library 👌
Let's see where this can go... its a good use of the chromedp library 👌
Who knows... Maybe I can finally build a proper web crawler? 🤔
Who knows... Maybe I can finally build a proper web crawler? 🤔
Crap. it doesn't support Javascript 🤣 🤦♂️
Crap. it doesn't support Javascript 🤣 🤦♂️
#mckinleytwtxt.net I've made a few more commits to mirror -- But sadly its not currently as good as I'd hope. Turns out mirror the structure of websites is rather tricky? Maybe you have some tips to help? 😅 Anyway give it a whirl, very much pre-alpha.
#mckinleytwtxt.net I've made a few more commits to mirror -- But sadly its not currently as good as I'd hope. Turns out mirror the structure of websites is rather tricky? Maybe you have some tips to help? 😅 Anyway give it a whirl, very much pre-alpha.
@ocdtrekkie I bet it’s the “typo transport protocol suite” 🥴
@ocdtrekkie I bet it’s the “typo transport protocol suite” 🥴
@ocdtrekkie I bet it’s the “typo transport protocol suite” 🥴
@ocdtrekkie I bet it’s the “typo transport protocol suite” 🥴
@prologic Thank you, I'll give it a try a little later. It looks very promising.