[HN Gopher] Show HN: I made a tool to clean and convert any webp...
       ___________________________________________________________________
        
       Show HN: I made a tool to clean and convert any webpage to Markdown
        
       My partner usually writes substack posts which I then mirror to our
       website's blog section.  To automate this, I made a simple tool to
       scrape the post and clean it so that I can drop it to our blog
       easily. This might be useful to others as well.  Oh and ofcourse
       you can instruct GPT to make any final edits :D
        
       Author : asadalt
       Score  : 203 points
       Date   : 2024-04-14 19:03 UTC (3 hours ago)
        
 (HTM) web link (markdowndown.vercel.app)
 (TXT) w3m dump (markdowndown.vercel.app)
        
       | LeonidBugaev wrote:
       | One of the cases when AI not needed. There is very good working
       | algorithm to extract content from the pages, one of
       | implementations: https://github.com/buriy/python-readability
        
         | asadalt wrote:
         | oh AI is optional here. I do use readability to clean the html
         | before converting to .md.
        
         | haddr wrote:
         | Some years ago I compared those boilerplate removal tools and I
         | remember that jusText was giving me the best results out of the
         | box (tried readability and few other libraries too). I wonder
         | what is the state of the art today?
        
         | IanCal wrote:
         | How do you achieve the same things without AI here using that
         | tool?
        
           | chrisweekly wrote:
           | "How do you do it without AI" is a question I (sadly) expect
           | to see more often.
        
             | IanCal wrote:
             | Feel free to answer then, how do you do the same functions
             | this does with gpt(3/4) without AI?
             | 
             | Edit -
             | 
             | This is an excellent use of it, a free text human input
             | capable of doing things like extracting summaries. It does
             | not seem to be used at all for the basic task of extracting
             | content, but for post filtering.
        
               | cactusfrog wrote:
               | I think "copy from a PDF" could be improved with AI. It's
               | been 30 years and I still get new lines in the middle of
               | sentences when I try to copy from one.
        
               | IanCal wrote:
               | That's a great use case, you might be able to do this if
               | you've got a copy and paste on the command line with
               | 
               | https://github.com/simonw/llm
               | 
               | In between. An alias like pdfwtf translating to "paste |
               | llm _command_ | copy "
        
               | genewitch wrote:
               | i've long assumed that is a "feature" of PDF akin to DRM.
               | Making copying text from a PDF makes sense from a
               | publisher's standpoint.
        
             | hombre_fatal wrote:
             | Meh, it's just the "how does it work?" question. How
             | content extractors work is interesting and not obvious nor
             | trivial.
             | 
             | And even when you see how readability parser works, AI
             | handles most of the edge cases that content extractors fail
             | on, so they are genuinely superseded by LLMs.
        
         | foundzen wrote:
         | how is it compared to mozilla/readability?
        
           | asadm wrote:
           | it uses readibility but does some additional stuff like
           | relink images to local paths etc., which I needed
        
         | fbdab103 wrote:
         | I was honestly expecting it to be mostly black magic, but it
         | looks like the meat of the project is a bunch of (surely hard
         | won) regexes. Nifty.
        
         | jot wrote:
         | Last time I tried readability it worked well with articles but
         | struggled with other kinds of pages. Took away far more content
         | than I wanted it to.
        
       | ChadNauseam wrote:
       | This is awesome. I kind of want a browser extension that does
       | this to every page I read and saves them somewhere.
        
         | ElFitz wrote:
         | Wouldn't that describe Pocket, Readwise Reader, Matter, or
         | another one of those many apps?
         | 
         | Edit: read too fast. Didn't notice the automatic and systematic
         | aspects.
        
           | BHSPitMonkey wrote:
           | Pocket saves the address; I don't think they save the
           | content.
        
         | ZunarJ5 wrote:
         | Wallabag + Obsidian + Wallabag Browser Ext. It's a manual
         | trigger but its great.
        
         | smartmic wrote:
         | My choice (manual): Markdown clipper
         | 
         | https://github.com/deathau/markdown-clipper
         | 
         | I guess there are dozens of alternative extensions available
         | out there ...
        
           | Terretta wrote:
           | This fork:
           | 
           | https://github.com/deathau/markdownload
           | 
           | With extension available for Firefox, Google Chrome,
           | Microsoft Edge and Safari.
        
         | Davidbrcz wrote:
         | Singlefile for Firefox https://addons.mozilla.org/en-
         | US/firefox/addon/single-file/
        
           | kwhitefoot wrote:
           | Also WebScrapBook: https://github.com/danny0838/webscrapbook
        
         | jwoq9118 wrote:
         | Omnivore. Saves a copy using web archive.
         | 
         | https://omnivore.app/
        
       | cratermoon wrote:
       | I've found htmltidy [1] and pandoc html->markdown sufficiently
       | capable.
       | 
       | 1 http://www.html-tidy.org/
       | 
       | 2 https://pandoc.org/
        
         | fbdab103 wrote:
         | Never heard of tidy, this looks promising.
         | 
         | I am kind of tempted/horrified to run all of my final templated
         | HTML through this and see if I can spot any lingering
         | malformations. Depending on how structured the corrections are,
         | could make it a test-suite thing.
        
       | midzer wrote:
       | "Failed to Convert. Either the URL is invalid or the server is
       | too busy. Please try again later."
        
         | asadm wrote:
         | can you try now
        
       | julienreszka wrote:
       | I think it was hugged to death
        
         | Animats wrote:
         | "Either the URL is invalid or the server is too busy".
         | 
         | Aw, tell the user which it is.
        
         | asadm wrote:
         | working on it!
        
           | asadm wrote:
           | I think it should be up now
        
       | franciscop wrote:
       | I also made one like this a while back, you can extract to
       | markdown, html, text or PDF. I found pages that are the straight
       | tool are very hard to SEO-position since there's not a lot of
       | text/content on the page, even if the tool could be very useful.
       | Feedback welcome:
       | 
       | https://content-parser.com/
       | 
       | These are all "wrappers around readability" AFAIK (including
       | mine), which is the Mozilla project to make sites look clean and
       | I use often.
        
       | remorses wrote:
       | Is the code open source?
        
       | dazzaji wrote:
       | I'm getting the server overload error but assuming this mostly
       | works I'd use it _every day_!
        
         | asadm wrote:
         | can you try now?
        
       | katehikes88 wrote:
       | links -dump
       | 
       | elinks -dump
       | 
       | lynx -dump
       | 
       | let me guess, you need more?
        
         | kwhitefoot wrote:
         | If the page is rendered using JavaScript then yes you need
         | more.
         | 
         | If there were a version of links, elinks, or lynx that executed
         | JS that would be wonderful.
        
       | rubymamis wrote:
       | If anyone looking for a C++ solution to convert HTML to Markdown,
       | I'm using this library https://github.com/tim-gromeyer/html2md in
       | my app.
        
       | eshoyuan wrote:
       | I tried a lot of tools, but none of them can work on website like
       | https://www.globalrelay.com/company/careers/jobs/?gh_jid=507...
        
       | blobcode wrote:
       | This is one of those things that the ever-amazing pandoc
       | (https://pandoc.org/) does very well, on top of supporting
       | virtually every other document format.
        
         | rubyn00bie wrote:
         | I second this. Pandoc is up there as one of the most useful
         | tools that exist, that almost no one talks about. It's amazing,
         | easy to use, and works. I regularly see new tools in the space
         | pop-up, but someone would have to have a REALLY unique and
         | compelling feature, or highly optimized use case to get me to
         | use anything else (besides Pandoc).
        
           | thangalin wrote:
           | I wrote a series of blog posts about typesetting Markdown
           | using pandoc:
           | 
           | https://dave.autonoma.ca/blog
           | 
           | Eventually, I found pandoc to be a little limiting:
           | 
           | * Awkward to use interpolated variables within prose.
           | 
           | * No real-time preview prior to rendering the final document.
           | 
           | * Limited options for TeX support (e.g., SVG vs. inline;
           | ConTeXt vs. LaTeX).
           | 
           | * Inconsistent syntax for captions and cross-references.
           | 
           | * Requires glue to apply a single YAML metadata source file
           | to multiple documents (e.g., book chapters).
           | 
           | * Does not (reliably) convert straight quotes to curly
           | quotes.
           | 
           | For my purposes, I wanted to convert variable-laden Markdown
           | and R Markdown to text, XHTML, and PDF formats. Eventually I
           | replaced my tool chain of yamlp + pandoc + knitr by writing
           | an integrated FOSS cross-platform desktop editor.
           | 
           | https://keenwrite.com/
           | 
           | KeenWrite uses flexmark-java + Renjin + KeenTeX + KeenQuotes
           | to provide a solution that can replace pandoc + knitr in some
           | situations.
           | 
           | Note how the captions and cross-reference syntax for images,
           | tables, and equations is unified to use a double-colon sigil:
           | 
           | https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/docs/ref.
           | ..
           | 
           | There's also command-line usage for integrating into build
           | pipelines:
           | 
           | https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/docs/cmd.
           | ..
        
       | jamesstidard wrote:
       | Nice. Turn this into a browser extension and id install it. Feel
       | like id forget about it otherwise
        
       | osener wrote:
       | Here is an open source alternative to this tool:
       | https://github.com/ozanmakes/scrapedown
        
         | asadm wrote:
         | I think this doesn't load js before scraping. Many sites
         | hydrate content using JS.
        
           | ushakov wrote:
           | use something like Browserbase?
        
       | jot wrote:
       | Great idea to offer image downloads and filtering with GPT!
       | 
       | I built a similar tool last year that doesn't have those
       | features: https://url2text.com/
       | 
       | Apologies if the UI is slow - you can see some example output on
       | the homepage.
       | 
       | The API it's built on is Urlbox's website screenshot API which
       | performs far better when used directly. You can request markdown
       | along with JS rendered HTML, metadata and screenshot all in one
       | go: https://urlbox.com/extracting-text
       | 
       | You can even have it all saved directly to your S3-compatible
       | storage: https://urlbox.com/s3
       | 
       | And/or delivered by webhook: https://urlbox.com/webhooks
       | 
       | I've been running over 1 million renders per month using Urlbox's
       | markdown feature for a side project. It's so much better using
       | markdown like this for embeddings and in prompts.
       | 
       | If you want to scrape whole websites like this you might also
       | want to checkout this new tool by dctanner:
       | https://usescraper.com/
        
         | jph00 wrote:
         | Looks nice, but url2text doesn't seem to have an API, and
         | urlbox doesn't seem to have an option to skip the screenshot if
         | you only want the text. And for just the text, it looks to be
         | really expensive.
        
           | jot wrote:
           | Thanks!
           | 
           | Sorry it's not clearer but you can skip the screenshot in the
           | Urlbox API if you want to with:                 curl -X POST
           | \         https://api.urlbox.io/v1/render/sync \         -H
           | 'Authorization: Bearer YOUR_URLBOX_SECRET' \         -H
           | 'Content-Type: application/json' \         -d '       {
           | "url": "example.com",         "format": "md"       }       '
           | 
           | Here's the result of that: https://renders.urlbox.io/urlbox1/
           | renders/5799274d37a8b4e604...
           | 
           | Sorry the pricing isn't a good fit for you. Urlbox has been
           | running for over 11 years. We're bootstrapped and profitable
           | with a team of 3 (plus a few contractors). We're priced to be
           | sustainable so our customers can depend on us in the long
           | term. We automatically give volume discounts as your usage
           | grows.
        
       | old_dev wrote:
       | It looks like, if the website presents a cookie message, the tool
       | just gets stuck on that and does not parse the actual content. As
       | an example, I tried https://www.cnbc.com/ and all it created was
       | a markdown of the cookie message and some legalese around it.
        
         | jot wrote:
         | It's not easy working around things like that. But here's how
         | it could work: https://url2text.com/u/wYVake
         | 
         | We were lucky to build this on a mature API that already solves
         | loads of the edge cases around rendering different kinds of
         | pages.
        
       | cpeffer wrote:
       | Very cool. We posted about a similar tool we built yesterday
       | 
       | https://www.firecrawl.dev/
       | 
       | It also crawls (although you can scrape single pages as well)
        
       | ben_ja_min wrote:
       | I've been looking for this! My method requires too many steps. I
       | look forward to seeing if this improves my results. Thanks!
        
       | sigio wrote:
       | Tried it earlier today, but it was hugged to death. Tried it now,
       | but it only gave me the contents of the cookie-wall, not the page
       | I was looking for.
       | 
       | Tried on another page of the same site, then it only gave me the
       | last article on a 6-article page, some weird things going on.
        
       | brokenmachine wrote:
       | I use markdownload extension for Firefox. Seems to work pretty
       | ok.
        
       | areichert wrote:
       | Nice! I did something similar a while back, but just for substack
       | :)
       | 
       | One example --
       | 
       | URL: https://newsletter.pragmaticengineer.com/p/zirp-software-
       | eng...
       | 
       | Sanitized output: https://substack-
       | ai.vercel.app/u/pragmaticengineer/p/zirp-so...
       | 
       | Raw markdown: https://substack-
       | ai.vercel.app/api/users/pragmaticengineer/p...
       | 
       | (Would be happy to open source it if anyone cares!)
        
         | nbbaier wrote:
         | I'd be interested in seeing the code, I love working on tools
         | like this
        
       | radicalriddler wrote:
       | Is this open sourced anywhere by any chance? Are you using GPT to
       | do the conversion, or just doing it yourself by ways of HTML ->
       | Markdown substitutions?
        
       | selfie wrote:
       | Vercel!! Watch out for you bill now this is being hugged.
       | Hopefully you are not using <Image /> like they pester you to do.
        
       ___________________________________________________________________
       (page generated 2024-04-14 23:00 UTC)