[HN Gopher] Reader-LM: Small Language Models for Cleaning and Co...
___________________________________________________________________
Reader-LM: Small Language Models for Cleaning and Converting HTML
to Markdown
Author : matteogauthier
Score : 184 points
Date : 2024-09-11 22:07 UTC (1 days ago)
(HTM) web link (jina.ai)
(TXT) w3m dump (jina.ai)
| choeger wrote:
| Maybe I am missing something here, but why would you run "AI" on
| that task when you go from formal language to formal language?
|
| I don't get the usage of "regex/heuristics" either. Why can that
| task not be completely handled by a classical algorithm?
|
| Is it about the removal of non-content parts?
| nickpsecurity wrote:
| It's informal language that has formal language mixed in. The
| informal parts determine how the final document should look.
| So, a simple formal-to-formal translation won't meet their
| needs.
| baq wrote:
| There's html and then there's... html.
|
| A nicely formatted subset of html is very different from a dom
| tag soup that is more or less the default nowadays.
| JimDabell wrote:
| Tag soup hasn't been a problem for years. The HTML 5
| specification goes into a lot more detail than previous
| specifications when it comes to parsing malformed markup and
| browsers follow it. So no matter the quality of the markup,
| if you throw it at any HTML 5 implementation, you will get
| the same consistent, unambiguous DOM structure.
| mithametacs wrote:
| yeah, you could just pull the parser out of any open source
| browser and voila a parser not only battle-tested, but
| probably the one the page was developed against
| faangguyindia wrote:
| That's why the best strategy is to feed the whole page into
| LLM. (After removing html tags) and just ask LLM to give you
| the date you need in the format you need.
|
| If there is lots of javascript dom manipulation happening
| after pageload. Then just render in webdriver and screenshot,
| ocr and feed the result into LLM and ask it the right
| questions.
| mithametacs wrote:
| My intuition is that you'd get better results emptying the
| tags or replacing them with some other delimiter.
|
| Keep the structural hint, remove the noise.
| valstu wrote:
| So regex version still beats the LLM solution. There's also the
| risk of hallucinations. I wonder if they tried to make SML which
| would rewrite or update the existing regex solution instead of
| generating the whole content again? This would mean less output
| tokens, faster inference and output wouldn't contain
| hallucinations. Although, not sure if small language models are
| capabable to write regex
| rockstarflo wrote:
| I think regex can beat SLM for a specific use case. But for the
| general case, there is no chance you come up with a pattern
| that works for all sites.
| fsndz wrote:
| I can say that enough. Small Language Models are the future.
| https://www.lycee.ai/blog/why-small-language-models-are-the-...
| Diti wrote:
| An _aligned_ future, for sure. Current commercial LLMs refuse
| to talk about "keeping secrets" (protection of identity) or
| pornographic topics (which, in the communities I frequent -
| made of individuals who have been oppressed partly because of
| their sexuality -, is an important subject). And uncensored AIs
| are not really a solution either.
| alexdoesstuff wrote:
| Feels surprising that there isn't a modern best-in-class non-LLM
| alternative for this task. Even in the post, they described that
| they used a hodgepodge of headless Chrome, readability, lots of
| regex to create content-only HTML.
|
| Best I can tell, everyone is doing something similar, only
| differing in the amount of custom situation regex being used.
| monacobolid wrote:
| How could it possibly be (a better solution) when there are X
| different ways to do any single thing in html(/css/js)? If you
| have a website that uses a canvas to showcase the content
| (think presentation or something like that), where would you
| even start? People are still discussing whether the semantic
| web is important; not every page is utf8 encoded, etc. IMHO
| small LLMS (trained specifically for this) combined with some
| other (more predictable) techniques are the best solution we
| are going to get.
| alexdoesstuff wrote:
| Fully agree on the premise: there are X different ways to do
| anything on the web. But - prior to this - the solution
| seemed to be: everyone starts from scratch with some ad-hoc
| Regex, and plays a game of whackamole to cover the first n of
| the x different ways to do things.
|
| Best of my knowledge there isn't anything more modern than
| Mozilla's readability and that's essentially a tool from the
| early 2010s.
| foul wrote:
| When does this SML perform better than hxdelete (or xmlstarlet or
| whatever) + rdrview + pandoc?
| fsiefken wrote:
| The answer is in the OP's Reader-LM report:
|
| About their readability-markdown pipeline: "Some users found it
| too detailed, while others felt it wasn't detailed enough.
| There were also reports that the Readability filter removed the
| wrong content or that Turndown struggled to convert certain
| parts of the HTML into markdown. Fortunately, many of these
| issues were successfully resolved by patching the existing
| pipeline with new regex patterns or heuristics."
|
| To answer their question about the potention of a SML doing
| this, they see 'room for improvement' - but as their benchmark
| shows, it's not up to their classic pipeline.
|
| You echo their research question: "instead of patching it with
| more heuristics and regex (which becomes increasingly difficult
| to maintain and isn't multilingual friendly), can we solve this
| problem end-to-end with a language model?"
| siscia wrote:
| The more I think about the less I am completely against this
| approach.
|
| Instead of applying an obscure set of heuristic by hand, let the
| LM figure out the best way starting from a lot of data.
|
| The model is bound to be less debuggable and much more difficult
| to update, for experts.
|
| But in the general case it will work well enough.
| faangguyindia wrote:
| My brother made: https://github.com/zerocorebeta/Option-K
|
| Basically, it's utility which completes commandline for you
|
| While playing with it, we thought about creating a custom small
| model for this.
|
| But it was really limiting! If we use small model trained on MAN
| pages, bash scripts, stack overflow and forums etc...
|
| We miss the key component, using a larger model like flash is
| more effective as this model knows lot more about other things.
|
| For example, I can ask this model to simply generate a command
| that lets me download audio from a youtube url.
| sippeangelo wrote:
| For as much as I would love for this to work, I'm not getting
| great results trying out the 1.5b model in their example notebook
| on Colab.
|
| It is impressively fast, but testing it on an arxiv.org page
| (specifically https://arxiv.org/abs/2306.03872) only gives me a
| short markdown file containing the abstract, the "View PDF" link
| and the submission history. It completely leaves out the title
| (!), authors and other links, which are definitely present in the
| HTML in multiple places!
|
| I'd argue that Arxiv.org is a reasonable example in the age of
| webapps, so what gives?
| faangguyindia wrote:
| Question is why even use these small models?
|
| When you've Google Flash which is lightening fast and cheap.
|
| My brother implemented it in option-k :
| https://github.com/zerocorebeta/Option-K
|
| It's near instant. So why waste time on small models? It's
| going to cost more than Google flash.
| FL33TW00D wrote:
| Privacy, Cost, Latency, Connectivity.
| oezi wrote:
| What is Google Flash? Do you mean Gemini Flash? If so, then
| the article talks about that general purpose LLMs are worse
| than this specialized LLM for Markdown conversion.
| sippeangelo wrote:
| In this case it is not, though. As much as I'd like a self-
| hostable, cheap and lean model for this specific task,
| instead we have a completely inflexible model that I can't
| just prompt tweak to behave better in even not-so-special
| cases like above.
|
| I'm sure there are good examples of specialised LLMs that
| do work well (like ones that are trained on specific
| sciences), but here the model doesn't have enough language
| comprehension to understand plain English instructions. How
| do I tweak it without fine-tuning? With a traditional
| approach to scraping this is trivial, but here it's
| unfeasible to the end user.
| dartos wrote:
| Sometimes you don't want to share all your data with the
| largest corporations on the planet.
| randomdata wrote:
| Small models often do a much better job when you have a well-
| defined task.
| tatsuya4 wrote:
| In real-world use cases, it seems more appropriate to use
| advanced models to generate suitable rule trees or regular
| expressions for processing HTML - Markdown, rather than directly
| using a smaller model to handle each HTML instance. The reasons
| for this approach include:
|
| 1. The quality of HTML - Markdown conversion results is easier to
| evaluate.
|
| 2. The HTML - Markdown process is essentially a more
| sophisticated form of copy-and-paste, where AI generates specific
| symbols (such as ##, *) rather than content.
|
| 3. Rule-based systems are significantly more cost-effective and
| faster than running an LLM, making them applicable to a wider
| range of scenarios.
|
| These are just my assumptions and judgments. If you have
| practical experience, I'd welcome your insights.
| igorzij wrote:
| Why Claude 3.5 Sonnet is missing from the benchmark? Even if the
| real reason is different and completely legitimate, or perhaps
| purely random, it comes across as "claude does better than our
| new model so we omitted it because we wanted the tallest bars on
| the chart to be ours". And as soon as the reader thinks that,
| they may start to question everything else in your work, which is
| genuinely awesome!
| faangguyindia wrote:
| It's damn slow and overkill for such task.
| smusamashah wrote:
| As per reddit their API that converts html to markdown can be
| used by appending url to https://r.jina.ai like
| https://r.jina.ai/https://news.ycombinator.com/item?id=41515...
|
| I don't know if its using their new model or their engine
| vladde wrote:
| Unfortunately not getting any good results for RFC 3339
| (https://www.rfc-editor.org/rfc/rfc3339), such a page where I
| think it would be great to convert text into readable Markdown.
|
| The end result is just like the original site but with without
| any headings and the a lot of whitespace still remaining (but
| with some non-working links inserted) :/
|
| Using thei API link, this is what it looks like:
| https://r.jina.ai/https://www.rfc-editor.org/rfc/rfc3339
| lelandfe wrote:
| That's their existing API (which I also tried, with... less
| than desirable results). This post is about a new model,
| `reader-lm`, which isn't in production yet.
| bberenberg wrote:
| Tested it using the model in Google Colab and it did ok, but
| the output is truncated at the following line:
|
| > [Appendix B](#appendix-B). Day
|
| So not sure if it's the length of the page, or something else,
| but in the end, it doesn't really work?
| Dowwie wrote:
| I'm curious about the dataset. What scenarios need to be covered
| during training?
| WesolyKubeczek wrote:
| What ever happened to parsing HTML with regexes that you need a
| beefy GPU/CPU/NPU now to convert HTML to Markdown?
| denidoman wrote:
| next step: websites add irrelevant text and prompt injections
| into hidden dom nodes, tags attributes, etc. to prevent llm-based
| scraping.
| rwl4 wrote:
| Not sure about the quality of the model's output. But I really
| appreciate this little mini-paper they produced. It gives a nice
| concise description of their goals, benchmarks, dataset
| preparation, model sizes, challenges and conclusion. And the
| whole thing is about a 5-10 minute read.
| MantisShrimp90 wrote:
| I never really understand this reasoning of "regex is hard to
| reason about, so we just use an LLM we custom made instead!" I
| get it's trendy but reasoning about LLMs is impossible for many
| devs the idea that this makes it more maintainable is pretty
| hilarious.
| nickpsecurity wrote:
| Regex's require you to understand what the obscure-looking
| patterns do character by character in a pile of text. Then,
| across different piles of text. Then, juggling different
| regex's.
|
| For a LLM, you can just tune it to produce the right output
| using examples. Your brain doesn't have to understand the
| tedious things it's doing.
|
| This also replaces a boring, tedious job with one (LLM's)
| that's more interesting. Programmers enjoy those opportunities.
| generalizations wrote:
| In either case you end up with an inscrutable black box into
| which you pass your html...honestly I'd prefer the black box
| that runs more efficiently and is intelligible to at least
| some people (or most, with the help of a big LLM).
| Onavo wrote:
| We need one that operates on the visual output
| coreypreston wrote:
| Pandoc does this very well.
___________________________________________________________________
(page generated 2024-09-12 23:02 UTC)