hngopher.com

       [HN Gopher] HTTrack Website Copier
       ___________________________________________________________________
        
       HTTrack Website Copier
        
       Author : iscream26
       Score  : 118 points
       Date   : 2024-10-03 18:53 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | xnx wrote:
       | Great tool. Does it still work for the "modern" web (i.e. now
       | that even simple/content websites have become "apps")?
        
         | alganet wrote:
         | Nope. It is for the classic web (the only websites worth saving
         | anyway).
        
           | freedomben wrote:
           | Even for classic web, if it's behind cloudflare, then HTTrack
           | no longer works.
           | 
           | It's a sad point to be at. Fortunately, the single file
           | extension still works really well for single pages, even when
           | they are built dynamically by JavaScript on the client side.
           | There isn't a solution for cloning an entire site though, at
           | least that I know of
        
             | alganet wrote:
             | If it is cloudflare human verification, then httrack will
             | have an issue. But in the end it's just a cookie, you can
             | use a browser with JS to grab the cookie, then feed it to
             | httrack headers.
             | 
             | If cloudflare ddos protection is an issue, you can throttle
             | httrack requests.
        
               | acheong08 wrote:
               | > you can use a browser with JS to grab the cookie, then
               | feed it to httrack headers
               | 
               | They also check your user agent, IP and JA3 fingerprint
               | (and ensures it matches with the one that got the cookie)
               | so it's not as simple as copying some cookies. This might
               | just be for paying customers though since it doesn't do
               | such heavy checks for some sites
        
               | freedomben wrote:
               | Seconded. It seems to depend on the sites settings, and
               | those in turn are regulated heavily by subscription plan
               | the site is on.
        
             | knowaveragejoe wrote:
             | I'm aware of this tool, but I'm sure there are caveats in
             | terms of "totally" cloning a website:
             | 
             | https://github.com/ArchiveTeam/grab-site
        
       | dark-star wrote:
       | oh wow that brings back memories. I have used httrack in the late
       | 90s and early 2000's to mirror interesting websites from the
       | early internet, over a modem connection (and early DSL)
       | 
       | Good to know they're still around, however, now that the web is
       | much more dynamic I guess it's not as useful anymore as it was
       | back then
        
         | dspillett wrote:
         | _> now that the web is much more dynamic I guess it 's not as
         | useful anymore as it was back then_
         | 
         | Also less useful because the web is so easy to access, I
         | remember using it back then to draw things down over the
         | university link for reference in my room (1st year, no network
         | access at all in rooms) or house (or per-minute costed modem
         | access).
         | 
         | Sites can vanish easily of course still these days, so having a
         | local copy could be a bonus, but they just as likely go out of
         | date or get replaced, and if not are usually archived elsewhere
         | already.
        
       | Alifatisk wrote:
       | Good ol' days
        
       | corinroyal wrote:
       | One time I was trying to create an offline backup of a botanical
       | medicine site for my studies. Somehow I turned off depth of link
       | checking and made it follow offsite links. I forgot about it. A
       | few days later the machine crashed due to a full disk from trying
       | to cram as much of the WWW as it could on there.
        
         | rkhassen9 wrote:
         | That is awesome.
        
       | Felk wrote:
       | Funny seeing this here now, as I _just_ finished archiving an old
       | MyBB PHP forum. Though I used `wget` and it took 2 weeks and
       | 260GB of uncompressed disk space (12GB compressed with zstd), and
       | the process was not interruptible and I had to start over each
       | time my hard drive got full. Maybe I should have given HTTrack a
       | shot to see how it compares.
       | 
       | If anyone wanna know the specifics on how I used wget, I wrote it
       | down here: https://github.com/SpeedcubeDE/speedcube.de-forum-
       | archive
       | 
       | Also, if anyone has experience archiving similar websites with
       | HTTrack and maybe know how it compares to wget for my use case,
       | I'd love to hear about it!
        
         | smashed wrote:
         | I've tried both in order to archive EOL websites and I've had
         | better luck with wget, it seems to recognize more
         | links/resources and do a better job so it was probably not a
         | bad choice.
        
         | begrid wrote:
         | wget2 has an option por paralel downloading.
         | https://github.com/rockdaboot/wget2
        
         | criddell wrote:
         | Is there a friendly way to do this? I'd feel bad burning
         | through hundreds of gigabytes of bandwidth for a non-corporate
         | site. Would a database snapshot be as useful?
        
           | dbtablesorrows wrote:
           | If you want to customize the scraping, there's scrapy python
           | framework. You would always need to download the html though.
        
           | squigz wrote:
           | Isn't bandwidth mostly dirt cheap/free these days?
        
             | criddell wrote:
             | It's inexpensive, but sometimes not free. For example,
             | Google Cloud Hosting is $0.14 / GB so 260 GB would be
             | around $36.
        
             | nchmy wrote:
             | its essentially free on non-extortionate hosts. Use hetzner
             | + cloudflare and you'll essentially never pay for bandwidth
        
           | z33k wrote:
           | MyBB PHP forums have a web interface through which one can
           | download the database as a single .sql file. It will most
           | likely be a mess, depending on the addons that were installed
           | on the forum.
        
           | Felk wrote:
           | Downloading a DB dump and crawling locally is possible, but
           | had two gnarly show stoppers for me using wget: the forum's
           | posts often link to other posts, and those links are
           | absolute. Getting wget to crawl those links through localhost
           | is hardly easy (local reverse proxy with content rewriting?).
           | Second, the forum and its server were really unmaintained. I
           | didn't want to spend a lot of time replicating it locally and
           | just archive it as-is while it is still barely running
        
         | codetrotter wrote:
         | > it took 2 weeks and 260GB of uncompressed disk space
         | 
         | Is most of that data because of there being like a zillion
         | different views and sortings of the same posts? That's been the
         | main difficulty for me when wanting to crawl some sites.
         | There's like an infinite number of permutations of URLs with
         | different parameters because every page has a bunch of
         | different link with auto-generated URL parameters for various
         | things, that results in often retrieving the same data over and
         | over and over again throughout an attempted crawl. And
         | sometimes URL parameters are needed and sometimes not so it's
         | not like you can just strip all URL parameters either.
         | 
         | So then you start adding things to your crawler like, starting
         | with shortest URLs first, and then maybe you make it so
         | whenever you pick the next URL to visit it will take one that
         | is most different from what you've seen so far. And after that
         | you start adding super specific rules for different paths of a
         | specific site.
        
           | Felk wrote:
           | The slowdown wasn't due to a lot of permutations, but mostly
           | because a) wget just takes a considerable amount of time to
           | process large HTML files with lots of links, and b) MyBB has
           | a "threaded mode", where each post of a thread geht's a
           | dedicated page with links to all other posts of that thread.
           | The largest thread had around 16k posts, so that's 16k2 URLs
           | to parse.
           | 
           | In terms of possible permutations, MyBB is pretty tame
           | thankfully. Only the forums are sortable, posts only have the
           | regular and the aforementioned threaded mode to view them.
           | Even the calender widget only goes from 1901-2030, otherwise
           | wget might have crawled forever.
           | 
           | I originally considered excluding threaded mode using wget's
           | `--reject-regex` and then just adding an nginx rule later to
           | redirect any incoming such links to the normal view mode.
           | Basically just saying "fuck it, you only get this version".
           | That might be worth a try for your case
        
       | oriettaxx wrote:
       | I don't get it: last release 2017 while in github I see more
       | releases...
       | 
       | so, did developer of the github repo took over and
       | updating/upgrading? very good!
        
       | subzero06 wrote:
       | i use this to double check which of my web app folder/files are
       | publicly accessible.
        
       | woutervddn wrote:
       | Also known as: static site generator for any original website
       | platform...
        
       | jregmail wrote:
       | I recommend to try also https://crawler.siteone.io/ for web
       | copying/cloning.
       | 
       | Real copy of the netlify.com website for demonstration:
       | https://crawler.siteone.io/examples-exports/netlify.com/
       | 
       | Sample analysis of the netlify.com website, which this tool can
       | also provide:
       | https://crawler.siteone.io/html/2024-08-23/forever/x2-vuvb0o...
        
       | alberth wrote:
       | I always wonder if this gives false positives for people just
       | using the same WordPress template.
        
       | zazaulola wrote:
       | The archive saved in HTTrack Website Copier can be opened in
       | https://replayweb.page locally or they have different save
       | formats?
        
       | suriya-ganesh wrote:
       | This saved me a ton when back in college in rural India without
       | Internet in 2015. I would download whole websites from a nearby
       | library and read at home.
       | 
       | I've read py4e, ostep, Pgs essays using this.
       | 
       | I am who I am because of httrack. Thank you
        
       | superjan wrote:
       | I have tried the windows version 2 years ago. The site I copied
       | was our on-prem issue tracker (fogbugz) that we replaced. HTTrack
       | did not work because of too much javascript rendering, and I
       | could not figure out how to make it login. What I ended up doing
       | was embedding a browser (WebView2) in a C# Desktop app. You can
       | intercept all the images/css, and after the Javascript rendering
       | was complete, write out the DOM content to a html file. Also nice
       | is that you can login by hand if needed, and you can generate all
       | urls from code.
        
       | chirau wrote:
       | I use it to download sites with layouts that I like and want to
       | use for landing pages and static pages for random projects. I
       | strip all the copy and stuff and leave the skeleton to put my own
       | content. Most recently link.com, column.com and increase.com. I
       | don't have the time nor the youth to start with all the
       | JavaScript & React stuff.
        
       | j0hnyl wrote:
       | Scammers love this tool. I see it used in the wild quite a bit.
        
       ___________________________________________________________________
       (page generated 2024-10-04 23:02 UTC)