[HN Gopher] HTTrack Website Copier
___________________________________________________________________
HTTrack Website Copier
Author : iscream26
Score : 118 points
Date : 2024-10-03 18:53 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| xnx wrote:
| Great tool. Does it still work for the "modern" web (i.e. now
| that even simple/content websites have become "apps")?
| alganet wrote:
| Nope. It is for the classic web (the only websites worth saving
| anyway).
| freedomben wrote:
| Even for classic web, if it's behind cloudflare, then HTTrack
| no longer works.
|
| It's a sad point to be at. Fortunately, the single file
| extension still works really well for single pages, even when
| they are built dynamically by JavaScript on the client side.
| There isn't a solution for cloning an entire site though, at
| least that I know of
| alganet wrote:
| If it is cloudflare human verification, then httrack will
| have an issue. But in the end it's just a cookie, you can
| use a browser with JS to grab the cookie, then feed it to
| httrack headers.
|
| If cloudflare ddos protection is an issue, you can throttle
| httrack requests.
| acheong08 wrote:
| > you can use a browser with JS to grab the cookie, then
| feed it to httrack headers
|
| They also check your user agent, IP and JA3 fingerprint
| (and ensures it matches with the one that got the cookie)
| so it's not as simple as copying some cookies. This might
| just be for paying customers though since it doesn't do
| such heavy checks for some sites
| freedomben wrote:
| Seconded. It seems to depend on the sites settings, and
| those in turn are regulated heavily by subscription plan
| the site is on.
| knowaveragejoe wrote:
| I'm aware of this tool, but I'm sure there are caveats in
| terms of "totally" cloning a website:
|
| https://github.com/ArchiveTeam/grab-site
| dark-star wrote:
| oh wow that brings back memories. I have used httrack in the late
| 90s and early 2000's to mirror interesting websites from the
| early internet, over a modem connection (and early DSL)
|
| Good to know they're still around, however, now that the web is
| much more dynamic I guess it's not as useful anymore as it was
| back then
| dspillett wrote:
| _> now that the web is much more dynamic I guess it 's not as
| useful anymore as it was back then_
|
| Also less useful because the web is so easy to access, I
| remember using it back then to draw things down over the
| university link for reference in my room (1st year, no network
| access at all in rooms) or house (or per-minute costed modem
| access).
|
| Sites can vanish easily of course still these days, so having a
| local copy could be a bonus, but they just as likely go out of
| date or get replaced, and if not are usually archived elsewhere
| already.
| Alifatisk wrote:
| Good ol' days
| corinroyal wrote:
| One time I was trying to create an offline backup of a botanical
| medicine site for my studies. Somehow I turned off depth of link
| checking and made it follow offsite links. I forgot about it. A
| few days later the machine crashed due to a full disk from trying
| to cram as much of the WWW as it could on there.
| rkhassen9 wrote:
| That is awesome.
| Felk wrote:
| Funny seeing this here now, as I _just_ finished archiving an old
| MyBB PHP forum. Though I used `wget` and it took 2 weeks and
| 260GB of uncompressed disk space (12GB compressed with zstd), and
| the process was not interruptible and I had to start over each
| time my hard drive got full. Maybe I should have given HTTrack a
| shot to see how it compares.
|
| If anyone wanna know the specifics on how I used wget, I wrote it
| down here: https://github.com/SpeedcubeDE/speedcube.de-forum-
| archive
|
| Also, if anyone has experience archiving similar websites with
| HTTrack and maybe know how it compares to wget for my use case,
| I'd love to hear about it!
| smashed wrote:
| I've tried both in order to archive EOL websites and I've had
| better luck with wget, it seems to recognize more
| links/resources and do a better job so it was probably not a
| bad choice.
| begrid wrote:
| wget2 has an option por paralel downloading.
| https://github.com/rockdaboot/wget2
| criddell wrote:
| Is there a friendly way to do this? I'd feel bad burning
| through hundreds of gigabytes of bandwidth for a non-corporate
| site. Would a database snapshot be as useful?
| dbtablesorrows wrote:
| If you want to customize the scraping, there's scrapy python
| framework. You would always need to download the html though.
| squigz wrote:
| Isn't bandwidth mostly dirt cheap/free these days?
| criddell wrote:
| It's inexpensive, but sometimes not free. For example,
| Google Cloud Hosting is $0.14 / GB so 260 GB would be
| around $36.
| nchmy wrote:
| its essentially free on non-extortionate hosts. Use hetzner
| + cloudflare and you'll essentially never pay for bandwidth
| z33k wrote:
| MyBB PHP forums have a web interface through which one can
| download the database as a single .sql file. It will most
| likely be a mess, depending on the addons that were installed
| on the forum.
| Felk wrote:
| Downloading a DB dump and crawling locally is possible, but
| had two gnarly show stoppers for me using wget: the forum's
| posts often link to other posts, and those links are
| absolute. Getting wget to crawl those links through localhost
| is hardly easy (local reverse proxy with content rewriting?).
| Second, the forum and its server were really unmaintained. I
| didn't want to spend a lot of time replicating it locally and
| just archive it as-is while it is still barely running
| codetrotter wrote:
| > it took 2 weeks and 260GB of uncompressed disk space
|
| Is most of that data because of there being like a zillion
| different views and sortings of the same posts? That's been the
| main difficulty for me when wanting to crawl some sites.
| There's like an infinite number of permutations of URLs with
| different parameters because every page has a bunch of
| different link with auto-generated URL parameters for various
| things, that results in often retrieving the same data over and
| over and over again throughout an attempted crawl. And
| sometimes URL parameters are needed and sometimes not so it's
| not like you can just strip all URL parameters either.
|
| So then you start adding things to your crawler like, starting
| with shortest URLs first, and then maybe you make it so
| whenever you pick the next URL to visit it will take one that
| is most different from what you've seen so far. And after that
| you start adding super specific rules for different paths of a
| specific site.
| Felk wrote:
| The slowdown wasn't due to a lot of permutations, but mostly
| because a) wget just takes a considerable amount of time to
| process large HTML files with lots of links, and b) MyBB has
| a "threaded mode", where each post of a thread geht's a
| dedicated page with links to all other posts of that thread.
| The largest thread had around 16k posts, so that's 16k2 URLs
| to parse.
|
| In terms of possible permutations, MyBB is pretty tame
| thankfully. Only the forums are sortable, posts only have the
| regular and the aforementioned threaded mode to view them.
| Even the calender widget only goes from 1901-2030, otherwise
| wget might have crawled forever.
|
| I originally considered excluding threaded mode using wget's
| `--reject-regex` and then just adding an nginx rule later to
| redirect any incoming such links to the normal view mode.
| Basically just saying "fuck it, you only get this version".
| That might be worth a try for your case
| oriettaxx wrote:
| I don't get it: last release 2017 while in github I see more
| releases...
|
| so, did developer of the github repo took over and
| updating/upgrading? very good!
| subzero06 wrote:
| i use this to double check which of my web app folder/files are
| publicly accessible.
| woutervddn wrote:
| Also known as: static site generator for any original website
| platform...
| jregmail wrote:
| I recommend to try also https://crawler.siteone.io/ for web
| copying/cloning.
|
| Real copy of the netlify.com website for demonstration:
| https://crawler.siteone.io/examples-exports/netlify.com/
|
| Sample analysis of the netlify.com website, which this tool can
| also provide:
| https://crawler.siteone.io/html/2024-08-23/forever/x2-vuvb0o...
| alberth wrote:
| I always wonder if this gives false positives for people just
| using the same WordPress template.
| zazaulola wrote:
| The archive saved in HTTrack Website Copier can be opened in
| https://replayweb.page locally or they have different save
| formats?
| suriya-ganesh wrote:
| This saved me a ton when back in college in rural India without
| Internet in 2015. I would download whole websites from a nearby
| library and read at home.
|
| I've read py4e, ostep, Pgs essays using this.
|
| I am who I am because of httrack. Thank you
| superjan wrote:
| I have tried the windows version 2 years ago. The site I copied
| was our on-prem issue tracker (fogbugz) that we replaced. HTTrack
| did not work because of too much javascript rendering, and I
| could not figure out how to make it login. What I ended up doing
| was embedding a browser (WebView2) in a C# Desktop app. You can
| intercept all the images/css, and after the Javascript rendering
| was complete, write out the DOM content to a html file. Also nice
| is that you can login by hand if needed, and you can generate all
| urls from code.
| chirau wrote:
| I use it to download sites with layouts that I like and want to
| use for landing pages and static pages for random projects. I
| strip all the copy and stuff and leave the skeleton to put my own
| content. Most recently link.com, column.com and increase.com. I
| don't have the time nor the youth to start with all the
| JavaScript & React stuff.
| j0hnyl wrote:
| Scammers love this tool. I see it used in the wild quite a bit.
___________________________________________________________________
(page generated 2024-10-04 23:02 UTC)