[HN Gopher] Show HN: Epublifier - scrape pages (books, manuals) ...
___________________________________________________________________
Show HN: Epublifier - scrape pages (books, manuals) for offline
reading
Author : maoserr
Score : 229 points
Date : 2024-10-21 13:18 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| dartharva wrote:
| Awesome!
| B1FF_PSUVM wrote:
| It's rather unfair to "first commenters", who got the article
| up from the pile and left a quick recommendation, to get
| downvoted by latecomers.
|
| (dartharva's comment was the only thing here when I first
| looked from the front page)
| mati365 wrote:
| Is it legal?
| reaperducer wrote:
| Depends on where you live.
|
| Where I am, it's perfectly legal.
|
| Before cell service was as widespread as it is today, there
| were programs that would scrape web pages into ePUBs so you
| could read them later on your Palm Pilot. I used it every day
| during my commute. And the best part was that they ended. No
| mind-numbing infinite scroll.
|
| When I switched to a "smart" phone (SonyEricsson m600c), I
| really missed it.
| thesuitonym wrote:
| I wouldn't want to go back, because having instant access to
| anything is pretty amazing, but I do miss those days of
| offline internet.
| richardlblair wrote:
| Fully agree. I recently replaced my doomscrolling with a
| retro handheld and it really makes me happy. It also pushed
| me to pick up my ereader again.
|
| I spend enough time at a computer than I shouldn't really
| need a smartphone outside of 'I need to message ___' or 'I
| need to go ___'
| latchkey wrote:
| Danger Hiptop had a proxy that reformatted websites for their
| built in browser. Mostly as a way to reduce data transfer
| amounts.
|
| https://medium.com/@chrisdesalvo/the-future-that-everyone-
| fo...
| anthk wrote:
| If you have a GNU/Linux/Mac/BSD machine with Python:
|
| https://sr.ht/~lioploum/offpunk/
| Tepix wrote:
| If you can read it on a website, why not on an ebook reader?
|
| If you start selling the resulting files, now that would be a
| copyright violation. German law has a right to create a "
| _Privatkopie_ ", i.e. a private copy. I guess this is similar
| to fair use in US law?
| stronglikedan wrote:
| If this can handle those sites where every section is behind an
| accordion that must be expanded (and especially where it
| collapses other sections when you expand one), then this is going
| to be awesome.
| dotancohen wrote:
| Can it remove popups for newsletters, or subscription, or
| logins, or cookies' notifications? Can it read pages that
| requires signing in?
| maoserr wrote:
| It extract the main content using Readability by default (you
| can configure it with something else). Logins would depend on
| how you're parsing. It has two modes, it either browses to
| the page inside the window (for non-refreshing pages), or
| retrieves it in the background using fetch.
| dotancohen wrote:
| Terrific, thank you.
| maoserr wrote:
| Works on this site: https://docs.ray.io/en/latest/ for me.
| bloopernova wrote:
| E-Reader makers, take note. This is a cool feature that should be
| built in or at least able to be used with an API to get content
| onto the Kindle/etc. Or even a "send to Kindle" email address
| that can accept URLs too.
| andai wrote:
| I wonder if this would have a positive or negative effect on
| profits.
|
| On the one hand, they'd be adding a massive amount of free
| content to a platform where they make money because people pay
| to consume content.
|
| On the other hand, it might actually increase sales simply
| because I'd spend more time using it, which would presumably
| result in more book purchases too.
|
| (Also Kindle store is already full of $0 public domain stuff,
| so they already don't seem too bothered by that possibility.)
| joseda-hg wrote:
| Huh didn't know that, guess I never assummed they would
| bother with it, I'd think about a published work in kindle
| like a product page in amazon therefore doesn't make sense to
| have 0$ items
|
| Are they an amazon offer or do third parties take the time to
| set that up?
| andai wrote:
| It's on Amazon, tons of public domain stuff republished for
| $0 on Kindle. 1 click to "purchase" (free download).
| bryanrasmussen wrote:
| You have this with the Remarkable sort of -
| https://remarkable.com/blog/introducing-read-on-remarkable
| 39896880 wrote:
| Kobo has Pocket integration, is this substantially different?
| solarkraft wrote:
| Neat!
|
| I once made a simple version of this concept that saves an epub
| file on the server's file system, which is then synced to my
| e-book reader:
|
| https://github.com/solarkraft/webpub
|
| The main ingredient is Postlight Parser, which gives a simplified
| ,,document" view for a website.
| kemayo wrote:
| Having written my own one of these, the interesting thing about
| this one is really the UI for iterating on extracting content
| from an arbitrary site. Having a full GUI for working through the
| extraction is much more flexible than the norm.
| ffsm8 wrote:
| Heh, I'm currently creating something very similar.
|
| A web scraper for blogs and mainly web novels etc and ePub parser
| that persists the data to database along with categories and
| tags, and a companion PWA for offline reading to track reading
| progress on various stories and let me keep multiple versions of
| the same story (web novels and published epub).
| stuxnet79 wrote:
| For those interested in a simple to use command line tool that
| accomplishes the same I've had success with percollate -
| https://github.com/danburzo/percollate
| Mkengine wrote:
| Does it support http://fanfiction.net/ ? I never found an easy
| solution for that one.
| tra3 wrote:
| This looks great!! I've long been looking for something that
| leverages readability (or similar).
|
| Edit: Tried it with Reuters and it looks like percolate
| requires javascript, etc. Back to using "Print as PDF" from the
| browser.
| anthk wrote:
| I had that, buf for terminal under Unix and for web pages, Gopher
| and Gemini. Offpunk:
|
| https://sr.ht/~lioploum/offpunk/
|
| Instead of Epub, it get catched down into text files (Gopher),
| Gemini files (Gemini) and HTML+images (Web Pages). You can visit
| the hier from ~/.cache/offpunk or directly from Offpunk.
|
| With the "tour" function, forget about doomscrolling. You'll read
| all the articles in text mode sequentially until you finish down.
| Mkengine wrote:
| Does it support http://fanfiction.net/ ? I never found an easy
| solution for that one.
| maoserr wrote:
| You can import a csv of all the chapter links, looks like it's
| just incremental numbering in the url
| t-3 wrote:
| The issue is most likely cloudflare blocking most the best
| scraping methods. If the site can be pulled down with eg.
| wget or curl without a bunch of options that you definitely
| aren't writing by hand, pandoc can just be used to directly
| make an epub.
| pasc1878 wrote:
| I use a calibre add-in
| https://www.mobileread.com/forums/showthread.php?t=259221
|
| It sort of works ie some stories just work others just get the
| first page.
| seridescent wrote:
| you can export epubs from https://fichub.net/
| kemayo wrote:
| Fanfiction.net is trivial... apart from it having Cloudflare
| bot blocking turned up to aggressive levels. I've not seen an
| approach that works, other than using headless browsers to
| fetch the content.
| theultdev wrote:
| headless browsers won't work by default for cloudflare
| captchas.
|
| open source stealth plugins don't really work now either.
|
| you have to use real browser fingerprints.
| vivzkestrel wrote:
| Gonna love running this on all the documentation heavy websites
| like AWS VueJS MDN w3schools realpython betterstack
| 3abiton wrote:
| This is an amazing tool! Long gone are the days when I used to
| force cache many webpages for offline travels.
| KaoruAoiShiho wrote:
| Calibre supports a massive list of sites.
|
| https://github.com/JimmXinu/FanFicFare
|
| https://github.com/JimmXinu/FanFicFare/wiki/SupportedSites
| noam_compsci wrote:
| Every so often, I want to get an epub of Paul Graham's essays (eg
| right before a flight). Hopefully I'll remember to use this
___________________________________________________________________
(page generated 2024-10-21 23:00 UTC)