[HN Gopher] HTTrack Website Copier
___________________________________________________________________
HTTrack Website Copier
Author : yamrzou
Score : 105 points
Date : 2025-03-18 17:30 UTC (5 hours ago)
(HTM) web link (www.httrack.com)
(TXT) w3m dump (www.httrack.com)
| hosteur wrote:
| Related: https://news.ycombinator.com/item?id=27789910
| shuri wrote:
| Time to add AI mode to this :).
| hombre_fatal wrote:
| I saw this on Twitter which came to mind when I read the title:
| https://same.new/
|
| Of course, it does a from scratch reimpl of a single web page,
| but it might be related enough to be interesting here.
| LinuxBender wrote:
| You jest, but AI could look up the site's IP on shodan and a
| few others then get a shell, elevate to root and just use GNU
| Tar to back up the site including daemon configurations
| assuming it's not well hidden behind a CDN and not leaking the
| origin servers.
| shiomiru wrote:
| htcrack.ai is available...
| ksec wrote:
| Not sure about the context on why this is on HN but it surely put
| a smile on my face. Used to use it during 56K era when I just
| download everything and read it. Basically using it as RSS before
| RSS was a thing.
| pixelesque wrote:
| Used Teleport Pro myself back then.
| fsiefken wrote:
| O wow, I have forgotten all about it, but I used it too! That
| and pavuk with it's regular expressions on the commandline.
| https://tenmax.wordpress.com/
| tamim17 wrote:
| Yeah that memory.
| bigiain wrote:
| Interestingly, the most recent commit and release in their
| github is from March 11 2025, so it's still clearly maintained.
|
| I remember using it, it must have been in 2012 or 2013, to
| automatically make static sites out of Wordpress. We had a bank
| department as a client who had a non negotiable requirement
| that they could use Wordpress to manage their site, along with
| an IT policy that absolutely forbade using Wordress (or PHP or
| MySQL or even Linux) on public facing servers. So we had an
| intranet only Wordpress site that got scraped 6 times a day and
| published as static html to an IT approved public Windows
| webserver.
| icameron wrote:
| I've used it a few times to "secure" an old but relevant dynamic
| website site. Like a site for a mature project that shouldn't
| disappear from the internet but it's not worth upgrading 5 year
| old code that wont pass our "cyber security audit" due to
| unsupported versions of php or rails so we just convert to a
| static site and delete the database. Everything pretty much works
| fine on the front end, and the CMS functionality is no longer
| needed. It's great for that niche use case.
| cluckindan wrote:
| You could also do that with plain wget.
| NewEntryHN wrote:
| Is this
|
| wget --mirror
|
| ?
| DaSHacka wrote:
| I never really understood the appeal of httrack over wget, it
| seems wget can do almost everything and is almost always
| already installed
| solardev wrote:
| This doesn't really work with most sites anymore, does it? It
| can't run JavaScript (unlike headless browsers with
| Playwright/Puppeteer, for example), has limited supported for
| more modern protocols, etc.?
|
| Any suggestions for an easy way to mirror modern web content,
| like an HTTrack for the enshittifed web?
| vekatimest wrote:
| ArchiveBox works decently with Javascript and uses a headless
| browser, can be deployed with Docker
| n3storm wrote:
| Which content/information sites dependant on javascript can
| be found? I always find marketing oriented or app interaction
| heavily using js and thus unable to archive...but
| otherwise...
| superjan wrote:
| A few years ago my workplace got rid of our on-premise install of
| fogbugz. I tried to clone the site with HTTrack but did not work
| due to client-side JavaScript and authentication issues.
|
| I was familiar with C#/webview2 and used that: generate the
| URL's, load the pages one by one, wait for it to build the HTML,
| and then save the final page. Intercept and save the css/image
| request.
|
| If you have ever integrated a browserview in a dektop or mobile
| app, you already know how to do this.
| jmsflknr wrote:
| Never found a great alternative of this for Mac.
| 42lux wrote:
| wget
| CharlesW wrote:
| Have you seen SiteSucker? https://ricks-
| apps.com/osx/sitesucker/
| ryoshu wrote:
| https://formulae.brew.sh/formula/httrack - used it a couple
| months ago
| Hard_Space wrote:
| I used this all the time twenty years ago. Tried it out again for
| some reason recently, I think at the suggestion of ChatGPT (!),
| for some archiving, and it actually did some damage.
|
| I do wish there was a modern version of this that could embed the
| videos in some of my old blog posts so I could save them entire
| locally as something other than an HTML mystery blob. None of the
| archive sites preserve video, and neither do extensions like
| SingleFile. If you're lucky, they'll embed a link to the original
| file, but that won't help later when the original posts go
| offline.
| groby_b wrote:
| ArchiveBox is your friend. (When it doesn't hate you :)
|
| It's pretty good at archiving most web pages - it relies on
| SingleFile and other tools to get the job done. Depends on how
| you saved the video, but in general it works decently well.
| tamim17 wrote:
| In old time, I used to download entire website using HTTrack and
| read it later.
| bruh2 wrote:
| I can't recall the details, but this tool had quite some friction
| last time I tried downloading a site with it. Too many new
| definitions to learn too many knobs it asks you to tweak. I opted
| to use `wget` with the `--recursive` flag, which just did what I
| expected it to do out of the box: crawl all links you can find
| and download them. No tweaking needed, and nothing new to learn.
| smarx007 wrote:
| I think I had a similar experience with HTTrack. However, wget
| also needs some tweaking to make relatively robust crawls, e.g.
| https://stackoverflow.com/a/65442746/464590
| alanh wrote:
| How does it compare to SiteSucker (<https://ricks-
| apps.com/osx/sitesucker/index.html>)?
___________________________________________________________________
(page generated 2025-03-18 23:00 UTC)