[HN Gopher] Show HN: I made a tool to clean and convert any webp...
___________________________________________________________________
Show HN: I made a tool to clean and convert any webpage to Markdown
My partner usually writes substack posts which I then mirror to our
website's blog section. To automate this, I made a simple tool to
scrape the post and clean it so that I can drop it to our blog
easily. This might be useful to others as well. Oh and ofcourse
you can instruct GPT to make any final edits :D
Author : asadalt
Score : 203 points
Date : 2024-04-14 19:03 UTC (3 hours ago)
(HTM) web link (markdowndown.vercel.app)
(TXT) w3m dump (markdowndown.vercel.app)
| LeonidBugaev wrote:
| One of the cases when AI not needed. There is very good working
| algorithm to extract content from the pages, one of
| implementations: https://github.com/buriy/python-readability
| asadalt wrote:
| oh AI is optional here. I do use readability to clean the html
| before converting to .md.
| haddr wrote:
| Some years ago I compared those boilerplate removal tools and I
| remember that jusText was giving me the best results out of the
| box (tried readability and few other libraries too). I wonder
| what is the state of the art today?
| IanCal wrote:
| How do you achieve the same things without AI here using that
| tool?
| chrisweekly wrote:
| "How do you do it without AI" is a question I (sadly) expect
| to see more often.
| IanCal wrote:
| Feel free to answer then, how do you do the same functions
| this does with gpt(3/4) without AI?
|
| Edit -
|
| This is an excellent use of it, a free text human input
| capable of doing things like extracting summaries. It does
| not seem to be used at all for the basic task of extracting
| content, but for post filtering.
| cactusfrog wrote:
| I think "copy from a PDF" could be improved with AI. It's
| been 30 years and I still get new lines in the middle of
| sentences when I try to copy from one.
| IanCal wrote:
| That's a great use case, you might be able to do this if
| you've got a copy and paste on the command line with
|
| https://github.com/simonw/llm
|
| In between. An alias like pdfwtf translating to "paste |
| llm _command_ | copy "
| genewitch wrote:
| i've long assumed that is a "feature" of PDF akin to DRM.
| Making copying text from a PDF makes sense from a
| publisher's standpoint.
| hombre_fatal wrote:
| Meh, it's just the "how does it work?" question. How
| content extractors work is interesting and not obvious nor
| trivial.
|
| And even when you see how readability parser works, AI
| handles most of the edge cases that content extractors fail
| on, so they are genuinely superseded by LLMs.
| foundzen wrote:
| how is it compared to mozilla/readability?
| asadm wrote:
| it uses readibility but does some additional stuff like
| relink images to local paths etc., which I needed
| fbdab103 wrote:
| I was honestly expecting it to be mostly black magic, but it
| looks like the meat of the project is a bunch of (surely hard
| won) regexes. Nifty.
| jot wrote:
| Last time I tried readability it worked well with articles but
| struggled with other kinds of pages. Took away far more content
| than I wanted it to.
| ChadNauseam wrote:
| This is awesome. I kind of want a browser extension that does
| this to every page I read and saves them somewhere.
| ElFitz wrote:
| Wouldn't that describe Pocket, Readwise Reader, Matter, or
| another one of those many apps?
|
| Edit: read too fast. Didn't notice the automatic and systematic
| aspects.
| BHSPitMonkey wrote:
| Pocket saves the address; I don't think they save the
| content.
| ZunarJ5 wrote:
| Wallabag + Obsidian + Wallabag Browser Ext. It's a manual
| trigger but its great.
| smartmic wrote:
| My choice (manual): Markdown clipper
|
| https://github.com/deathau/markdown-clipper
|
| I guess there are dozens of alternative extensions available
| out there ...
| Terretta wrote:
| This fork:
|
| https://github.com/deathau/markdownload
|
| With extension available for Firefox, Google Chrome,
| Microsoft Edge and Safari.
| Davidbrcz wrote:
| Singlefile for Firefox https://addons.mozilla.org/en-
| US/firefox/addon/single-file/
| kwhitefoot wrote:
| Also WebScrapBook: https://github.com/danny0838/webscrapbook
| jwoq9118 wrote:
| Omnivore. Saves a copy using web archive.
|
| https://omnivore.app/
| cratermoon wrote:
| I've found htmltidy [1] and pandoc html->markdown sufficiently
| capable.
|
| 1 http://www.html-tidy.org/
|
| 2 https://pandoc.org/
| fbdab103 wrote:
| Never heard of tidy, this looks promising.
|
| I am kind of tempted/horrified to run all of my final templated
| HTML through this and see if I can spot any lingering
| malformations. Depending on how structured the corrections are,
| could make it a test-suite thing.
| midzer wrote:
| "Failed to Convert. Either the URL is invalid or the server is
| too busy. Please try again later."
| asadm wrote:
| can you try now
| julienreszka wrote:
| I think it was hugged to death
| Animats wrote:
| "Either the URL is invalid or the server is too busy".
|
| Aw, tell the user which it is.
| asadm wrote:
| working on it!
| asadm wrote:
| I think it should be up now
| franciscop wrote:
| I also made one like this a while back, you can extract to
| markdown, html, text or PDF. I found pages that are the straight
| tool are very hard to SEO-position since there's not a lot of
| text/content on the page, even if the tool could be very useful.
| Feedback welcome:
|
| https://content-parser.com/
|
| These are all "wrappers around readability" AFAIK (including
| mine), which is the Mozilla project to make sites look clean and
| I use often.
| remorses wrote:
| Is the code open source?
| dazzaji wrote:
| I'm getting the server overload error but assuming this mostly
| works I'd use it _every day_!
| asadm wrote:
| can you try now?
| katehikes88 wrote:
| links -dump
|
| elinks -dump
|
| lynx -dump
|
| let me guess, you need more?
| kwhitefoot wrote:
| If the page is rendered using JavaScript then yes you need
| more.
|
| If there were a version of links, elinks, or lynx that executed
| JS that would be wonderful.
| rubymamis wrote:
| If anyone looking for a C++ solution to convert HTML to Markdown,
| I'm using this library https://github.com/tim-gromeyer/html2md in
| my app.
| eshoyuan wrote:
| I tried a lot of tools, but none of them can work on website like
| https://www.globalrelay.com/company/careers/jobs/?gh_jid=507...
| blobcode wrote:
| This is one of those things that the ever-amazing pandoc
| (https://pandoc.org/) does very well, on top of supporting
| virtually every other document format.
| rubyn00bie wrote:
| I second this. Pandoc is up there as one of the most useful
| tools that exist, that almost no one talks about. It's amazing,
| easy to use, and works. I regularly see new tools in the space
| pop-up, but someone would have to have a REALLY unique and
| compelling feature, or highly optimized use case to get me to
| use anything else (besides Pandoc).
| thangalin wrote:
| I wrote a series of blog posts about typesetting Markdown
| using pandoc:
|
| https://dave.autonoma.ca/blog
|
| Eventually, I found pandoc to be a little limiting:
|
| * Awkward to use interpolated variables within prose.
|
| * No real-time preview prior to rendering the final document.
|
| * Limited options for TeX support (e.g., SVG vs. inline;
| ConTeXt vs. LaTeX).
|
| * Inconsistent syntax for captions and cross-references.
|
| * Requires glue to apply a single YAML metadata source file
| to multiple documents (e.g., book chapters).
|
| * Does not (reliably) convert straight quotes to curly
| quotes.
|
| For my purposes, I wanted to convert variable-laden Markdown
| and R Markdown to text, XHTML, and PDF formats. Eventually I
| replaced my tool chain of yamlp + pandoc + knitr by writing
| an integrated FOSS cross-platform desktop editor.
|
| https://keenwrite.com/
|
| KeenWrite uses flexmark-java + Renjin + KeenTeX + KeenQuotes
| to provide a solution that can replace pandoc + knitr in some
| situations.
|
| Note how the captions and cross-reference syntax for images,
| tables, and equations is unified to use a double-colon sigil:
|
| https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/docs/ref.
| ..
|
| There's also command-line usage for integrating into build
| pipelines:
|
| https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/docs/cmd.
| ..
| jamesstidard wrote:
| Nice. Turn this into a browser extension and id install it. Feel
| like id forget about it otherwise
| osener wrote:
| Here is an open source alternative to this tool:
| https://github.com/ozanmakes/scrapedown
| asadm wrote:
| I think this doesn't load js before scraping. Many sites
| hydrate content using JS.
| ushakov wrote:
| use something like Browserbase?
| jot wrote:
| Great idea to offer image downloads and filtering with GPT!
|
| I built a similar tool last year that doesn't have those
| features: https://url2text.com/
|
| Apologies if the UI is slow - you can see some example output on
| the homepage.
|
| The API it's built on is Urlbox's website screenshot API which
| performs far better when used directly. You can request markdown
| along with JS rendered HTML, metadata and screenshot all in one
| go: https://urlbox.com/extracting-text
|
| You can even have it all saved directly to your S3-compatible
| storage: https://urlbox.com/s3
|
| And/or delivered by webhook: https://urlbox.com/webhooks
|
| I've been running over 1 million renders per month using Urlbox's
| markdown feature for a side project. It's so much better using
| markdown like this for embeddings and in prompts.
|
| If you want to scrape whole websites like this you might also
| want to checkout this new tool by dctanner:
| https://usescraper.com/
| jph00 wrote:
| Looks nice, but url2text doesn't seem to have an API, and
| urlbox doesn't seem to have an option to skip the screenshot if
| you only want the text. And for just the text, it looks to be
| really expensive.
| jot wrote:
| Thanks!
|
| Sorry it's not clearer but you can skip the screenshot in the
| Urlbox API if you want to with: curl -X POST
| \ https://api.urlbox.io/v1/render/sync \ -H
| 'Authorization: Bearer YOUR_URLBOX_SECRET' \ -H
| 'Content-Type: application/json' \ -d ' {
| "url": "example.com", "format": "md" } '
|
| Here's the result of that: https://renders.urlbox.io/urlbox1/
| renders/5799274d37a8b4e604...
|
| Sorry the pricing isn't a good fit for you. Urlbox has been
| running for over 11 years. We're bootstrapped and profitable
| with a team of 3 (plus a few contractors). We're priced to be
| sustainable so our customers can depend on us in the long
| term. We automatically give volume discounts as your usage
| grows.
| old_dev wrote:
| It looks like, if the website presents a cookie message, the tool
| just gets stuck on that and does not parse the actual content. As
| an example, I tried https://www.cnbc.com/ and all it created was
| a markdown of the cookie message and some legalese around it.
| jot wrote:
| It's not easy working around things like that. But here's how
| it could work: https://url2text.com/u/wYVake
|
| We were lucky to build this on a mature API that already solves
| loads of the edge cases around rendering different kinds of
| pages.
| cpeffer wrote:
| Very cool. We posted about a similar tool we built yesterday
|
| https://www.firecrawl.dev/
|
| It also crawls (although you can scrape single pages as well)
| ben_ja_min wrote:
| I've been looking for this! My method requires too many steps. I
| look forward to seeing if this improves my results. Thanks!
| sigio wrote:
| Tried it earlier today, but it was hugged to death. Tried it now,
| but it only gave me the contents of the cookie-wall, not the page
| I was looking for.
|
| Tried on another page of the same site, then it only gave me the
| last article on a 6-article page, some weird things going on.
| brokenmachine wrote:
| I use markdownload extension for Firefox. Seems to work pretty
| ok.
| areichert wrote:
| Nice! I did something similar a while back, but just for substack
| :)
|
| One example --
|
| URL: https://newsletter.pragmaticengineer.com/p/zirp-software-
| eng...
|
| Sanitized output: https://substack-
| ai.vercel.app/u/pragmaticengineer/p/zirp-so...
|
| Raw markdown: https://substack-
| ai.vercel.app/api/users/pragmaticengineer/p...
|
| (Would be happy to open source it if anyone cares!)
| nbbaier wrote:
| I'd be interested in seeing the code, I love working on tools
| like this
| radicalriddler wrote:
| Is this open sourced anywhere by any chance? Are you using GPT to
| do the conversion, or just doing it yourself by ways of HTML ->
| Markdown substitutions?
| selfie wrote:
| Vercel!! Watch out for you bill now this is being hugged.
| Hopefully you are not using <Image /> like they pester you to do.
___________________________________________________________________
(page generated 2024-04-14 23:00 UTC)