[HN Gopher] Show HN: Epublifier - scrape pages (books, manuals) ...
       ___________________________________________________________________
        
       Show HN: Epublifier - scrape pages (books, manuals) for offline
       reading
        
       Author : maoserr
       Score  : 229 points
       Date   : 2024-10-21 13:18 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | dartharva wrote:
       | Awesome!
        
         | B1FF_PSUVM wrote:
         | It's rather unfair to "first commenters", who got the article
         | up from the pile and left a quick recommendation, to get
         | downvoted by latecomers.
         | 
         | (dartharva's comment was the only thing here when I first
         | looked from the front page)
        
       | mati365 wrote:
       | Is it legal?
        
         | reaperducer wrote:
         | Depends on where you live.
         | 
         | Where I am, it's perfectly legal.
         | 
         | Before cell service was as widespread as it is today, there
         | were programs that would scrape web pages into ePUBs so you
         | could read them later on your Palm Pilot. I used it every day
         | during my commute. And the best part was that they ended. No
         | mind-numbing infinite scroll.
         | 
         | When I switched to a "smart" phone (SonyEricsson m600c), I
         | really missed it.
        
           | thesuitonym wrote:
           | I wouldn't want to go back, because having instant access to
           | anything is pretty amazing, but I do miss those days of
           | offline internet.
        
             | richardlblair wrote:
             | Fully agree. I recently replaced my doomscrolling with a
             | retro handheld and it really makes me happy. It also pushed
             | me to pick up my ereader again.
             | 
             | I spend enough time at a computer than I shouldn't really
             | need a smartphone outside of 'I need to message ___' or 'I
             | need to go ___'
        
           | latchkey wrote:
           | Danger Hiptop had a proxy that reformatted websites for their
           | built in browser. Mostly as a way to reduce data transfer
           | amounts.
           | 
           | https://medium.com/@chrisdesalvo/the-future-that-everyone-
           | fo...
        
           | anthk wrote:
           | If you have a GNU/Linux/Mac/BSD machine with Python:
           | 
           | https://sr.ht/~lioploum/offpunk/
        
         | Tepix wrote:
         | If you can read it on a website, why not on an ebook reader?
         | 
         | If you start selling the resulting files, now that would be a
         | copyright violation. German law has a right to create a "
         | _Privatkopie_ ", i.e. a private copy. I guess this is similar
         | to fair use in US law?
        
       | stronglikedan wrote:
       | If this can handle those sites where every section is behind an
       | accordion that must be expanded (and especially where it
       | collapses other sections when you expand one), then this is going
       | to be awesome.
        
         | dotancohen wrote:
         | Can it remove popups for newsletters, or subscription, or
         | logins, or cookies' notifications? Can it read pages that
         | requires signing in?
        
           | maoserr wrote:
           | It extract the main content using Readability by default (you
           | can configure it with something else). Logins would depend on
           | how you're parsing. It has two modes, it either browses to
           | the page inside the window (for non-refreshing pages), or
           | retrieves it in the background using fetch.
        
             | dotancohen wrote:
             | Terrific, thank you.
        
         | maoserr wrote:
         | Works on this site: https://docs.ray.io/en/latest/ for me.
        
       | bloopernova wrote:
       | E-Reader makers, take note. This is a cool feature that should be
       | built in or at least able to be used with an API to get content
       | onto the Kindle/etc. Or even a "send to Kindle" email address
       | that can accept URLs too.
        
         | andai wrote:
         | I wonder if this would have a positive or negative effect on
         | profits.
         | 
         | On the one hand, they'd be adding a massive amount of free
         | content to a platform where they make money because people pay
         | to consume content.
         | 
         | On the other hand, it might actually increase sales simply
         | because I'd spend more time using it, which would presumably
         | result in more book purchases too.
         | 
         | (Also Kindle store is already full of $0 public domain stuff,
         | so they already don't seem too bothered by that possibility.)
        
           | joseda-hg wrote:
           | Huh didn't know that, guess I never assummed they would
           | bother with it, I'd think about a published work in kindle
           | like a product page in amazon therefore doesn't make sense to
           | have 0$ items
           | 
           | Are they an amazon offer or do third parties take the time to
           | set that up?
        
             | andai wrote:
             | It's on Amazon, tons of public domain stuff republished for
             | $0 on Kindle. 1 click to "purchase" (free download).
        
         | bryanrasmussen wrote:
         | You have this with the Remarkable sort of -
         | https://remarkable.com/blog/introducing-read-on-remarkable
        
         | 39896880 wrote:
         | Kobo has Pocket integration, is this substantially different?
        
       | solarkraft wrote:
       | Neat!
       | 
       | I once made a simple version of this concept that saves an epub
       | file on the server's file system, which is then synced to my
       | e-book reader:
       | 
       | https://github.com/solarkraft/webpub
       | 
       | The main ingredient is Postlight Parser, which gives a simplified
       | ,,document" view for a website.
        
       | kemayo wrote:
       | Having written my own one of these, the interesting thing about
       | this one is really the UI for iterating on extracting content
       | from an arbitrary site. Having a full GUI for working through the
       | extraction is much more flexible than the norm.
        
       | ffsm8 wrote:
       | Heh, I'm currently creating something very similar.
       | 
       | A web scraper for blogs and mainly web novels etc and ePub parser
       | that persists the data to database along with categories and
       | tags, and a companion PWA for offline reading to track reading
       | progress on various stories and let me keep multiple versions of
       | the same story (web novels and published epub).
        
       | stuxnet79 wrote:
       | For those interested in a simple to use command line tool that
       | accomplishes the same I've had success with percollate -
       | https://github.com/danburzo/percollate
        
         | Mkengine wrote:
         | Does it support http://fanfiction.net/ ? I never found an easy
         | solution for that one.
        
         | tra3 wrote:
         | This looks great!! I've long been looking for something that
         | leverages readability (or similar).
         | 
         | Edit: Tried it with Reuters and it looks like percolate
         | requires javascript, etc. Back to using "Print as PDF" from the
         | browser.
        
       | anthk wrote:
       | I had that, buf for terminal under Unix and for web pages, Gopher
       | and Gemini. Offpunk:
       | 
       | https://sr.ht/~lioploum/offpunk/
       | 
       | Instead of Epub, it get catched down into text files (Gopher),
       | Gemini files (Gemini) and HTML+images (Web Pages). You can visit
       | the hier from ~/.cache/offpunk or directly from Offpunk.
       | 
       | With the "tour" function, forget about doomscrolling. You'll read
       | all the articles in text mode sequentially until you finish down.
        
       | Mkengine wrote:
       | Does it support http://fanfiction.net/ ? I never found an easy
       | solution for that one.
        
         | maoserr wrote:
         | You can import a csv of all the chapter links, looks like it's
         | just incremental numbering in the url
        
           | t-3 wrote:
           | The issue is most likely cloudflare blocking most the best
           | scraping methods. If the site can be pulled down with eg.
           | wget or curl without a bunch of options that you definitely
           | aren't writing by hand, pandoc can just be used to directly
           | make an epub.
        
         | pasc1878 wrote:
         | I use a calibre add-in
         | https://www.mobileread.com/forums/showthread.php?t=259221
         | 
         | It sort of works ie some stories just work others just get the
         | first page.
        
         | seridescent wrote:
         | you can export epubs from https://fichub.net/
        
         | kemayo wrote:
         | Fanfiction.net is trivial... apart from it having Cloudflare
         | bot blocking turned up to aggressive levels. I've not seen an
         | approach that works, other than using headless browsers to
         | fetch the content.
        
           | theultdev wrote:
           | headless browsers won't work by default for cloudflare
           | captchas.
           | 
           | open source stealth plugins don't really work now either.
           | 
           | you have to use real browser fingerprints.
        
       | vivzkestrel wrote:
       | Gonna love running this on all the documentation heavy websites
       | like AWS VueJS MDN w3schools realpython betterstack
        
       | 3abiton wrote:
       | This is an amazing tool! Long gone are the days when I used to
       | force cache many webpages for offline travels.
        
       | KaoruAoiShiho wrote:
       | Calibre supports a massive list of sites.
       | 
       | https://github.com/JimmXinu/FanFicFare
       | 
       | https://github.com/JimmXinu/FanFicFare/wiki/SupportedSites
        
       | noam_compsci wrote:
       | Every so often, I want to get an epub of Paul Graham's essays (eg
       | right before a flight). Hopefully I'll remember to use this
        
       ___________________________________________________________________
       (page generated 2024-10-21 23:00 UTC)