[HN Gopher] Nougat: Neural Optical Understanding for Academic Do...
___________________________________________________________________
Nougat: Neural Optical Understanding for Academic Documents
Author : falkaer
Score : 24 points
Date : 2023-08-31 19:07 UTC (3 hours ago)
(HTM) web link (facebookresearch.github.io)
(TXT) w3m dump (facebookresearch.github.io)
| stevenae wrote:
| Missed opportunity to call it "...Texts" instead of "Documents".
| czbond wrote:
| pretty impressive
| echo_time wrote:
| Funnily enough the Example Page 1 is wrong. Rendering du^n as
| du^*, and then nu^n-1 as nw^*-1.
|
| It is impressive but...it really feels like those are the details
| that really really matter.
| [deleted]
| nicodjimenez wrote:
| Second page is even worse. Ends in repeated \cdots and doesn't
| finish parsing page. Also it read number 73 as 3 I guess
| because the previous section number was 2.
| [deleted]
| gremlinunderway wrote:
| This is great, but when is academia, business and government
| going to finally get off PDF as a typical standard? It's awful,
| not adaptive for mobile, and a pain in the ass to work with for
| any kind of development.
| adr1an wrote:
| We need more DjVu!
| froh wrote:
| pdf looks the same everywhere and is self-contained. an
| "immutable" document which looks the same for everyone if it
| hashes to the same sha... key. which has a value on it's own.
| harshreality wrote:
| There's nothing immutable about pdfs. If you have an
| "original" document, it'll always hash to whatever it hashes
| to. I fail to see the point. You can cite md5 hashes on LG
| the same whether they're pdfs or epubs or, heaven forbid,
| azw3 (amazon's proprietary epub-like format).
|
| What's the obsession with "looking the same everywhere"?
|
| Page references: this shouldn't be a thing. Academia has
| already solved this problem for notable texts. Rather than
| nearly uncountable numbers of paragraphs that all run
| together, paragraphs or short sections or lines are numbered.
| See any good edition of Plato or Aristotle, or just about any
| notable play or longer poem ever translated. Relying on a
| single published layout of a work to reference is dumb.
|
| Citing exact line numbers isn't even necessary for native-
| language works. When they're digital, search works. It works
| even better in flowed-format texts than it does in pdfs,
| which sometimes, depending on how the pdf was constructed,
| won't match text properly across newlines.
|
| Visual quality: As long as images--data, charts, graphs,
| photographs--are not degraded beyond usefulness, the actual
| text, and its display, is up to the reader application.
| Everyone uses the web complete with mathjax, and those
| doesn't have Knuth-approved formatting in every respect. But
| they're good enough, and they work _everywhere_ on _every_
| device without squinting or pinch to zoom. There are some
| people who insist on putting pre-rendered images of math in
| html, and they always look worse, because they don 't match
| the text without a lot of work to have extra high-res images
| that are auto-scaled according to viewport and surrounding
| font size--work that I bet not many people have ever done in
| the history of html publishing.
| froh wrote:
| that's all missing the point.
|
| mhtml would somewhat fit part of the bill of what PDF
| offers: a single downloadable "file" you can archive or
| forward and you know: the recipient will see exactly what
| you saw.
|
| however the mhtml doesn't look the same, depending on the
| device. and looking.exactly the same helps a great deal in
| convincing a judge that we all talk about the same.thing.
|
| get me right.
|
| I hate PDF with all passion of my heart. epub (similar to
| mhtml) imho is a much better format for many intents and
| purposes and it allows to reflow the contents depending on
| the device.
|
| but the claim was "PDF is useless and.shall go" and that's
| cutting.it too short.
| etrautmann wrote:
| how does line number citation work for responsive text?
| harshreality wrote:
| You put line numbers in the margin (with css styling), or
| I've also seen it as [#] inline in the text, possibly
| styled differently to make it more intuitive that it's
| not part of the source text.
|
| For the vast majority of works that are untranslated,
| that isn't necessary, because, as mentioned, search works
| fine, and it's faster, too. For translated works, the
| concept of one published source of truth for page numbers
| is already broken, so you need some alternative to page
| numbers anyway.
| esafak wrote:
| Most academic PDFs are typeset and consequently look better
| than typical web sites. There are notable exceptions such
| as distill.pub
| vosper wrote:
| What's a good alternative, for users and developers?
|
| I don't have any love for PDF, but I'm actually not sure what's
| more cross-platform. Any browser will render PDF, so everyone
| already has a viewer on their computer. A browser will also
| print any document to PDF, and many other editors can export to
| PDF (though perhaps not import for editing)
|
| It can't be replaced by an Office format, like docx, because
| even today apps like Pages can't render MS Office docs
| correctly half the time.
|
| Doesn't seem like HTML would fly, either, given all the kinds
| of things that get embedded into PDF.
| harshreality wrote:
| HTML and various javascript libraries like mathjax or other
| libraries for charts and graphs.
|
| > Doesn't seem like HTML would fly, either, given all the
| kinds of things that get embedded into PDF.
|
| That's ironic. Browser PDF readers, at least open source
| ones, render PDFs as HTML using javascript. At least I'm sure
| about FF because I just checked that text from a native-
| digital pdf showed up in the DOM in developer tools.
___________________________________________________________________
(page generated 2023-08-31 23:01 UTC)