[HN Gopher] Nougat: Neural Optical Understanding for Academic Do...
       ___________________________________________________________________
        
       Nougat: Neural Optical Understanding for Academic Documents
        
       Author : falkaer
       Score  : 24 points
       Date   : 2023-08-31 19:07 UTC (3 hours ago)
        
 (HTM) web link (facebookresearch.github.io)
 (TXT) w3m dump (facebookresearch.github.io)
        
       | stevenae wrote:
       | Missed opportunity to call it "...Texts" instead of "Documents".
        
       | czbond wrote:
       | pretty impressive
        
       | echo_time wrote:
       | Funnily enough the Example Page 1 is wrong. Rendering du^n as
       | du^*, and then nu^n-1 as nw^*-1.
       | 
       | It is impressive but...it really feels like those are the details
       | that really really matter.
        
         | [deleted]
        
         | nicodjimenez wrote:
         | Second page is even worse. Ends in repeated \cdots and doesn't
         | finish parsing page. Also it read number 73 as 3 I guess
         | because the previous section number was 2.
        
         | [deleted]
        
       | gremlinunderway wrote:
       | This is great, but when is academia, business and government
       | going to finally get off PDF as a typical standard? It's awful,
       | not adaptive for mobile, and a pain in the ass to work with for
       | any kind of development.
        
         | adr1an wrote:
         | We need more DjVu!
        
         | froh wrote:
         | pdf looks the same everywhere and is self-contained. an
         | "immutable" document which looks the same for everyone if it
         | hashes to the same sha... key. which has a value on it's own.
        
           | harshreality wrote:
           | There's nothing immutable about pdfs. If you have an
           | "original" document, it'll always hash to whatever it hashes
           | to. I fail to see the point. You can cite md5 hashes on LG
           | the same whether they're pdfs or epubs or, heaven forbid,
           | azw3 (amazon's proprietary epub-like format).
           | 
           | What's the obsession with "looking the same everywhere"?
           | 
           | Page references: this shouldn't be a thing. Academia has
           | already solved this problem for notable texts. Rather than
           | nearly uncountable numbers of paragraphs that all run
           | together, paragraphs or short sections or lines are numbered.
           | See any good edition of Plato or Aristotle, or just about any
           | notable play or longer poem ever translated. Relying on a
           | single published layout of a work to reference is dumb.
           | 
           | Citing exact line numbers isn't even necessary for native-
           | language works. When they're digital, search works. It works
           | even better in flowed-format texts than it does in pdfs,
           | which sometimes, depending on how the pdf was constructed,
           | won't match text properly across newlines.
           | 
           | Visual quality: As long as images--data, charts, graphs,
           | photographs--are not degraded beyond usefulness, the actual
           | text, and its display, is up to the reader application.
           | Everyone uses the web complete with mathjax, and those
           | doesn't have Knuth-approved formatting in every respect. But
           | they're good enough, and they work _everywhere_ on _every_
           | device without squinting or pinch to zoom. There are some
           | people who insist on putting pre-rendered images of math in
           | html, and they always look worse, because they don 't match
           | the text without a lot of work to have extra high-res images
           | that are auto-scaled according to viewport and surrounding
           | font size--work that I bet not many people have ever done in
           | the history of html publishing.
        
             | froh wrote:
             | that's all missing the point.
             | 
             | mhtml would somewhat fit part of the bill of what PDF
             | offers: a single downloadable "file" you can archive or
             | forward and you know: the recipient will see exactly what
             | you saw.
             | 
             | however the mhtml doesn't look the same, depending on the
             | device. and looking.exactly the same helps a great deal in
             | convincing a judge that we all talk about the same.thing.
             | 
             | get me right.
             | 
             | I hate PDF with all passion of my heart. epub (similar to
             | mhtml) imho is a much better format for many intents and
             | purposes and it allows to reflow the contents depending on
             | the device.
             | 
             | but the claim was "PDF is useless and.shall go" and that's
             | cutting.it too short.
        
             | etrautmann wrote:
             | how does line number citation work for responsive text?
        
               | harshreality wrote:
               | You put line numbers in the margin (with css styling), or
               | I've also seen it as [#] inline in the text, possibly
               | styled differently to make it more intuitive that it's
               | not part of the source text.
               | 
               | For the vast majority of works that are untranslated,
               | that isn't necessary, because, as mentioned, search works
               | fine, and it's faster, too. For translated works, the
               | concept of one published source of truth for page numbers
               | is already broken, so you need some alternative to page
               | numbers anyway.
        
             | esafak wrote:
             | Most academic PDFs are typeset and consequently look better
             | than typical web sites. There are notable exceptions such
             | as distill.pub
        
         | vosper wrote:
         | What's a good alternative, for users and developers?
         | 
         | I don't have any love for PDF, but I'm actually not sure what's
         | more cross-platform. Any browser will render PDF, so everyone
         | already has a viewer on their computer. A browser will also
         | print any document to PDF, and many other editors can export to
         | PDF (though perhaps not import for editing)
         | 
         | It can't be replaced by an Office format, like docx, because
         | even today apps like Pages can't render MS Office docs
         | correctly half the time.
         | 
         | Doesn't seem like HTML would fly, either, given all the kinds
         | of things that get embedded into PDF.
        
           | harshreality wrote:
           | HTML and various javascript libraries like mathjax or other
           | libraries for charts and graphs.
           | 
           | > Doesn't seem like HTML would fly, either, given all the
           | kinds of things that get embedded into PDF.
           | 
           | That's ironic. Browser PDF readers, at least open source
           | ones, render PDFs as HTML using javascript. At least I'm sure
           | about FF because I just checked that text from a native-
           | digital pdf showed up in the DOM in developer tools.
        
       ___________________________________________________________________
       (page generated 2023-08-31 23:01 UTC)