hngopher.com

       [HN Gopher] HTTrack Website Copier
       ___________________________________________________________________
        
       HTTrack Website Copier
        
       Author : yamrzou
       Score  : 105 points
       Date   : 2025-03-18 17:30 UTC (5 hours ago)
        
 (HTM) web link (www.httrack.com)
 (TXT) w3m dump (www.httrack.com)
        
       | hosteur wrote:
       | Related: https://news.ycombinator.com/item?id=27789910
        
       | shuri wrote:
       | Time to add AI mode to this :).
        
         | hombre_fatal wrote:
         | I saw this on Twitter which came to mind when I read the title:
         | https://same.new/
         | 
         | Of course, it does a from scratch reimpl of a single web page,
         | but it might be related enough to be interesting here.
        
         | LinuxBender wrote:
         | You jest, but AI could look up the site's IP on shodan and a
         | few others then get a shell, elevate to root and just use GNU
         | Tar to back up the site including daemon configurations
         | assuming it's not well hidden behind a CDN and not leaking the
         | origin servers.
        
           | shiomiru wrote:
           | htcrack.ai is available...
        
       | ksec wrote:
       | Not sure about the context on why this is on HN but it surely put
       | a smile on my face. Used to use it during 56K era when I just
       | download everything and read it. Basically using it as RSS before
       | RSS was a thing.
        
         | pixelesque wrote:
         | Used Teleport Pro myself back then.
        
           | fsiefken wrote:
           | O wow, I have forgotten all about it, but I used it too! That
           | and pavuk with it's regular expressions on the commandline.
           | https://tenmax.wordpress.com/
        
         | tamim17 wrote:
         | Yeah that memory.
        
         | bigiain wrote:
         | Interestingly, the most recent commit and release in their
         | github is from March 11 2025, so it's still clearly maintained.
         | 
         | I remember using it, it must have been in 2012 or 2013, to
         | automatically make static sites out of Wordpress. We had a bank
         | department as a client who had a non negotiable requirement
         | that they could use Wordpress to manage their site, along with
         | an IT policy that absolutely forbade using Wordress (or PHP or
         | MySQL or even Linux) on public facing servers. So we had an
         | intranet only Wordpress site that got scraped 6 times a day and
         | published as static html to an IT approved public Windows
         | webserver.
        
       | icameron wrote:
       | I've used it a few times to "secure" an old but relevant dynamic
       | website site. Like a site for a mature project that shouldn't
       | disappear from the internet but it's not worth upgrading 5 year
       | old code that wont pass our "cyber security audit" due to
       | unsupported versions of php or rails so we just convert to a
       | static site and delete the database. Everything pretty much works
       | fine on the front end, and the CMS functionality is no longer
       | needed. It's great for that niche use case.
        
         | cluckindan wrote:
         | You could also do that with plain wget.
        
       | NewEntryHN wrote:
       | Is this
       | 
       | wget --mirror
       | 
       | ?
        
         | DaSHacka wrote:
         | I never really understood the appeal of httrack over wget, it
         | seems wget can do almost everything and is almost always
         | already installed
        
       | solardev wrote:
       | This doesn't really work with most sites anymore, does it? It
       | can't run JavaScript (unlike headless browsers with
       | Playwright/Puppeteer, for example), has limited supported for
       | more modern protocols, etc.?
       | 
       | Any suggestions for an easy way to mirror modern web content,
       | like an HTTrack for the enshittifed web?
        
         | vekatimest wrote:
         | ArchiveBox works decently with Javascript and uses a headless
         | browser, can be deployed with Docker
        
           | n3storm wrote:
           | Which content/information sites dependant on javascript can
           | be found? I always find marketing oriented or app interaction
           | heavily using js and thus unable to archive...but
           | otherwise...
        
       | superjan wrote:
       | A few years ago my workplace got rid of our on-premise install of
       | fogbugz. I tried to clone the site with HTTrack but did not work
       | due to client-side JavaScript and authentication issues.
       | 
       | I was familiar with C#/webview2 and used that: generate the
       | URL's, load the pages one by one, wait for it to build the HTML,
       | and then save the final page. Intercept and save the css/image
       | request.
       | 
       | If you have ever integrated a browserview in a dektop or mobile
       | app, you already know how to do this.
        
       | jmsflknr wrote:
       | Never found a great alternative of this for Mac.
        
         | 42lux wrote:
         | wget
        
         | CharlesW wrote:
         | Have you seen SiteSucker? https://ricks-
         | apps.com/osx/sitesucker/
        
         | ryoshu wrote:
         | https://formulae.brew.sh/formula/httrack - used it a couple
         | months ago
        
       | Hard_Space wrote:
       | I used this all the time twenty years ago. Tried it out again for
       | some reason recently, I think at the suggestion of ChatGPT (!),
       | for some archiving, and it actually did some damage.
       | 
       | I do wish there was a modern version of this that could embed the
       | videos in some of my old blog posts so I could save them entire
       | locally as something other than an HTML mystery blob. None of the
       | archive sites preserve video, and neither do extensions like
       | SingleFile. If you're lucky, they'll embed a link to the original
       | file, but that won't help later when the original posts go
       | offline.
        
         | groby_b wrote:
         | ArchiveBox is your friend. (When it doesn't hate you :)
         | 
         | It's pretty good at archiving most web pages - it relies on
         | SingleFile and other tools to get the job done. Depends on how
         | you saved the video, but in general it works decently well.
        
       | tamim17 wrote:
       | In old time, I used to download entire website using HTTrack and
       | read it later.
        
       | bruh2 wrote:
       | I can't recall the details, but this tool had quite some friction
       | last time I tried downloading a site with it. Too many new
       | definitions to learn too many knobs it asks you to tweak. I opted
       | to use `wget` with the `--recursive` flag, which just did what I
       | expected it to do out of the box: crawl all links you can find
       | and download them. No tweaking needed, and nothing new to learn.
        
         | smarx007 wrote:
         | I think I had a similar experience with HTTrack. However, wget
         | also needs some tweaking to make relatively robust crawls, e.g.
         | https://stackoverflow.com/a/65442746/464590
        
       | alanh wrote:
       | How does it compare to SiteSucker (<https://ricks-
       | apps.com/osx/sitesucker/index.html>)?
        
       ___________________________________________________________________
       (page generated 2025-03-18 23:00 UTC)