[HN Gopher] Reader-LM: Small Language Models for Cleaning and Co...
       ___________________________________________________________________
        
       Reader-LM: Small Language Models for Cleaning and Converting HTML
       to Markdown
        
       Author : matteogauthier
       Score  : 184 points
       Date   : 2024-09-11 22:07 UTC (1 days ago)
        
 (HTM) web link (jina.ai)
 (TXT) w3m dump (jina.ai)
        
       | choeger wrote:
       | Maybe I am missing something here, but why would you run "AI" on
       | that task when you go from formal language to formal language?
       | 
       | I don't get the usage of "regex/heuristics" either. Why can that
       | task not be completely handled by a classical algorithm?
       | 
       | Is it about the removal of non-content parts?
        
         | nickpsecurity wrote:
         | It's informal language that has formal language mixed in. The
         | informal parts determine how the final document should look.
         | So, a simple formal-to-formal translation won't meet their
         | needs.
        
         | baq wrote:
         | There's html and then there's... html.
         | 
         | A nicely formatted subset of html is very different from a dom
         | tag soup that is more or less the default nowadays.
        
           | JimDabell wrote:
           | Tag soup hasn't been a problem for years. The HTML 5
           | specification goes into a lot more detail than previous
           | specifications when it comes to parsing malformed markup and
           | browsers follow it. So no matter the quality of the markup,
           | if you throw it at any HTML 5 implementation, you will get
           | the same consistent, unambiguous DOM structure.
        
             | mithametacs wrote:
             | yeah, you could just pull the parser out of any open source
             | browser and voila a parser not only battle-tested, but
             | probably the one the page was developed against
        
           | faangguyindia wrote:
           | That's why the best strategy is to feed the whole page into
           | LLM. (After removing html tags) and just ask LLM to give you
           | the date you need in the format you need.
           | 
           | If there is lots of javascript dom manipulation happening
           | after pageload. Then just render in webdriver and screenshot,
           | ocr and feed the result into LLM and ask it the right
           | questions.
        
             | mithametacs wrote:
             | My intuition is that you'd get better results emptying the
             | tags or replacing them with some other delimiter.
             | 
             | Keep the structural hint, remove the noise.
        
       | valstu wrote:
       | So regex version still beats the LLM solution. There's also the
       | risk of hallucinations. I wonder if they tried to make SML which
       | would rewrite or update the existing regex solution instead of
       | generating the whole content again? This would mean less output
       | tokens, faster inference and output wouldn't contain
       | hallucinations. Although, not sure if small language models are
       | capabable to write regex
        
         | rockstarflo wrote:
         | I think regex can beat SLM for a specific use case. But for the
         | general case, there is no chance you come up with a pattern
         | that works for all sites.
        
       | fsndz wrote:
       | I can say that enough. Small Language Models are the future.
       | https://www.lycee.ai/blog/why-small-language-models-are-the-...
        
         | Diti wrote:
         | An _aligned_ future, for sure. Current commercial LLMs refuse
         | to talk about "keeping secrets" (protection of identity) or
         | pornographic topics (which, in the communities I frequent -
         | made of individuals who have been oppressed partly because of
         | their sexuality -, is an important subject). And uncensored AIs
         | are not really a solution either.
        
       | alexdoesstuff wrote:
       | Feels surprising that there isn't a modern best-in-class non-LLM
       | alternative for this task. Even in the post, they described that
       | they used a hodgepodge of headless Chrome, readability, lots of
       | regex to create content-only HTML.
       | 
       | Best I can tell, everyone is doing something similar, only
       | differing in the amount of custom situation regex being used.
        
         | monacobolid wrote:
         | How could it possibly be (a better solution) when there are X
         | different ways to do any single thing in html(/css/js)? If you
         | have a website that uses a canvas to showcase the content
         | (think presentation or something like that), where would you
         | even start? People are still discussing whether the semantic
         | web is important; not every page is utf8 encoded, etc. IMHO
         | small LLMS (trained specifically for this) combined with some
         | other (more predictable) techniques are the best solution we
         | are going to get.
        
           | alexdoesstuff wrote:
           | Fully agree on the premise: there are X different ways to do
           | anything on the web. But - prior to this - the solution
           | seemed to be: everyone starts from scratch with some ad-hoc
           | Regex, and plays a game of whackamole to cover the first n of
           | the x different ways to do things.
           | 
           | Best of my knowledge there isn't anything more modern than
           | Mozilla's readability and that's essentially a tool from the
           | early 2010s.
        
       | foul wrote:
       | When does this SML perform better than hxdelete (or xmlstarlet or
       | whatever) + rdrview + pandoc?
        
         | fsiefken wrote:
         | The answer is in the OP's Reader-LM report:
         | 
         | About their readability-markdown pipeline: "Some users found it
         | too detailed, while others felt it wasn't detailed enough.
         | There were also reports that the Readability filter removed the
         | wrong content or that Turndown struggled to convert certain
         | parts of the HTML into markdown. Fortunately, many of these
         | issues were successfully resolved by patching the existing
         | pipeline with new regex patterns or heuristics."
         | 
         | To answer their question about the potention of a SML doing
         | this, they see 'room for improvement' - but as their benchmark
         | shows, it's not up to their classic pipeline.
         | 
         | You echo their research question: "instead of patching it with
         | more heuristics and regex (which becomes increasingly difficult
         | to maintain and isn't multilingual friendly), can we solve this
         | problem end-to-end with a language model?"
        
       | siscia wrote:
       | The more I think about the less I am completely against this
       | approach.
       | 
       | Instead of applying an obscure set of heuristic by hand, let the
       | LM figure out the best way starting from a lot of data.
       | 
       | The model is bound to be less debuggable and much more difficult
       | to update, for experts.
       | 
       | But in the general case it will work well enough.
        
       | faangguyindia wrote:
       | My brother made: https://github.com/zerocorebeta/Option-K
       | 
       | Basically, it's utility which completes commandline for you
       | 
       | While playing with it, we thought about creating a custom small
       | model for this.
       | 
       | But it was really limiting! If we use small model trained on MAN
       | pages, bash scripts, stack overflow and forums etc...
       | 
       | We miss the key component, using a larger model like flash is
       | more effective as this model knows lot more about other things.
       | 
       | For example, I can ask this model to simply generate a command
       | that lets me download audio from a youtube url.
        
       | sippeangelo wrote:
       | For as much as I would love for this to work, I'm not getting
       | great results trying out the 1.5b model in their example notebook
       | on Colab.
       | 
       | It is impressively fast, but testing it on an arxiv.org page
       | (specifically https://arxiv.org/abs/2306.03872) only gives me a
       | short markdown file containing the abstract, the "View PDF" link
       | and the submission history. It completely leaves out the title
       | (!), authors and other links, which are definitely present in the
       | HTML in multiple places!
       | 
       | I'd argue that Arxiv.org is a reasonable example in the age of
       | webapps, so what gives?
        
         | faangguyindia wrote:
         | Question is why even use these small models?
         | 
         | When you've Google Flash which is lightening fast and cheap.
         | 
         | My brother implemented it in option-k :
         | https://github.com/zerocorebeta/Option-K
         | 
         | It's near instant. So why waste time on small models? It's
         | going to cost more than Google flash.
        
           | FL33TW00D wrote:
           | Privacy, Cost, Latency, Connectivity.
        
           | oezi wrote:
           | What is Google Flash? Do you mean Gemini Flash? If so, then
           | the article talks about that general purpose LLMs are worse
           | than this specialized LLM for Markdown conversion.
        
             | sippeangelo wrote:
             | In this case it is not, though. As much as I'd like a self-
             | hostable, cheap and lean model for this specific task,
             | instead we have a completely inflexible model that I can't
             | just prompt tweak to behave better in even not-so-special
             | cases like above.
             | 
             | I'm sure there are good examples of specialised LLMs that
             | do work well (like ones that are trained on specific
             | sciences), but here the model doesn't have enough language
             | comprehension to understand plain English instructions. How
             | do I tweak it without fine-tuning? With a traditional
             | approach to scraping this is trivial, but here it's
             | unfeasible to the end user.
        
           | dartos wrote:
           | Sometimes you don't want to share all your data with the
           | largest corporations on the planet.
        
           | randomdata wrote:
           | Small models often do a much better job when you have a well-
           | defined task.
        
       | tatsuya4 wrote:
       | In real-world use cases, it seems more appropriate to use
       | advanced models to generate suitable rule trees or regular
       | expressions for processing HTML - Markdown, rather than directly
       | using a smaller model to handle each HTML instance. The reasons
       | for this approach include:
       | 
       | 1. The quality of HTML - Markdown conversion results is easier to
       | evaluate.
       | 
       | 2. The HTML - Markdown process is essentially a more
       | sophisticated form of copy-and-paste, where AI generates specific
       | symbols (such as ##, *) rather than content.
       | 
       | 3. Rule-based systems are significantly more cost-effective and
       | faster than running an LLM, making them applicable to a wider
       | range of scenarios.
       | 
       | These are just my assumptions and judgments. If you have
       | practical experience, I'd welcome your insights.
        
       | igorzij wrote:
       | Why Claude 3.5 Sonnet is missing from the benchmark? Even if the
       | real reason is different and completely legitimate, or perhaps
       | purely random, it comes across as "claude does better than our
       | new model so we omitted it because we wanted the tallest bars on
       | the chart to be ours". And as soon as the reader thinks that,
       | they may start to question everything else in your work, which is
       | genuinely awesome!
        
         | faangguyindia wrote:
         | It's damn slow and overkill for such task.
        
       | smusamashah wrote:
       | As per reddit their API that converts html to markdown can be
       | used by appending url to https://r.jina.ai like
       | https://r.jina.ai/https://news.ycombinator.com/item?id=41515...
       | 
       | I don't know if its using their new model or their engine
        
       | vladde wrote:
       | Unfortunately not getting any good results for RFC 3339
       | (https://www.rfc-editor.org/rfc/rfc3339), such a page where I
       | think it would be great to convert text into readable Markdown.
       | 
       | The end result is just like the original site but with without
       | any headings and the a lot of whitespace still remaining (but
       | with some non-working links inserted) :/
       | 
       | Using thei API link, this is what it looks like:
       | https://r.jina.ai/https://www.rfc-editor.org/rfc/rfc3339
        
         | lelandfe wrote:
         | That's their existing API (which I also tried, with... less
         | than desirable results). This post is about a new model,
         | `reader-lm`, which isn't in production yet.
        
         | bberenberg wrote:
         | Tested it using the model in Google Colab and it did ok, but
         | the output is truncated at the following line:
         | 
         | > [Appendix B](#appendix-B). Day
         | 
         | So not sure if it's the length of the page, or something else,
         | but in the end, it doesn't really work?
        
       | Dowwie wrote:
       | I'm curious about the dataset. What scenarios need to be covered
       | during training?
        
       | WesolyKubeczek wrote:
       | What ever happened to parsing HTML with regexes that you need a
       | beefy GPU/CPU/NPU now to convert HTML to Markdown?
        
       | denidoman wrote:
       | next step: websites add irrelevant text and prompt injections
       | into hidden dom nodes, tags attributes, etc. to prevent llm-based
       | scraping.
        
       | rwl4 wrote:
       | Not sure about the quality of the model's output. But I really
       | appreciate this little mini-paper they produced. It gives a nice
       | concise description of their goals, benchmarks, dataset
       | preparation, model sizes, challenges and conclusion. And the
       | whole thing is about a 5-10 minute read.
        
       | MantisShrimp90 wrote:
       | I never really understand this reasoning of "regex is hard to
       | reason about, so we just use an LLM we custom made instead!" I
       | get it's trendy but reasoning about LLMs is impossible for many
       | devs the idea that this makes it more maintainable is pretty
       | hilarious.
        
         | nickpsecurity wrote:
         | Regex's require you to understand what the obscure-looking
         | patterns do character by character in a pile of text. Then,
         | across different piles of text. Then, juggling different
         | regex's.
         | 
         | For a LLM, you can just tune it to produce the right output
         | using examples. Your brain doesn't have to understand the
         | tedious things it's doing.
         | 
         | This also replaces a boring, tedious job with one (LLM's)
         | that's more interesting. Programmers enjoy those opportunities.
        
           | generalizations wrote:
           | In either case you end up with an inscrutable black box into
           | which you pass your html...honestly I'd prefer the black box
           | that runs more efficiently and is intelligible to at least
           | some people (or most, with the help of a big LLM).
        
       | Onavo wrote:
       | We need one that operates on the visual output
        
       | coreypreston wrote:
       | Pandoc does this very well.
        
       ___________________________________________________________________
       (page generated 2024-09-12 23:02 UTC)