[HN Gopher] Show HN: Paper to HTML Converter
       ___________________________________________________________________
        
       Show HN: Paper to HTML Converter
        
       Author : codeviking
       Score  : 48 points
       Date   : 2021-09-15 19:01 UTC (3 hours ago)
        
 (HTM) web link (papertohtml.org)
 (TXT) w3m dump (papertohtml.org)
        
       | p4bl0 wrote:
       | I tried that a few days ago with one of my papers (a PDF
       | generated using pdflatex) and it didn't work that well: the text
       | was fine but some section titles were off, and all of the math
       | and code parts were broken.
       | 
       | But clearly it is a nice idea and I can't wait that such tools
       | work better!
        
         | codeviking wrote:
         | > all of the math and code parts were broken.
         | 
         | Yup, this is a known issue that we're working towards fixing.
         | 
         | > But clearly it is a nice idea and I can't wait that such
         | tools work better!
         | 
         | Glad to hear it!
        
       | jimmySixDOF wrote:
       | I am so amazed at the work you guys are doing at AI2 & the
       | Semantic Scholar project. You guys are really fixing a broken
       | system of research and discovery which suffers from organization
       | design principles based on university library index card filing
       | cabinets as magnified by the exponential content growth.
       | 
       | Cant wait to see what people do with this . . . .
        
         | codeviking wrote:
         | Thanks!
         | 
         | There's a lot of amazing people here, doing really great work.
         | It's a really inspiring place to be. I feel really lucky to
         | work with such great people on interesting, important problems.
         | 
         | Also, I should mention...we're hiring!
         | 
         | https://allenai.org/careers#current-openings
        
       | codeviking wrote:
       | Hi all,
       | 
       | I'm one of the engineers at AI2 that helped make this happen.
       | We're excited about this for several reasons, which I'll explain
       | below.
       | 
       | Most academic papers are currently inaccessible. This means, for
       | instance, that researchers who are vision impaired can't access
       | that research. Not only is this unfair, but it probably prevents
       | breakthroughs from happening by limiting opportunities for
       | collaboration.
       | 
       | We think this is partly due to the fact that the PDF format isn't
       | easy to work with, and thereby make accessible. HTML, on the
       | other hand, has benefited from years of open contributions.
       | There's a lot of accessibility affordances, and they're well
       | documented and easy to add. In fact, our hope long-term is to use
       | ML to make papers more accessible without (much) effort on the
       | author's part.
       | 
       | We're also excited about distributing papers in their HTML form
       | as we think it'll allow us to greatly improve the UX of reading
       | papers. We think papers should be easy to read regardless of the
       | device you're on, and want to provide interactive, ML provided
       | enhancements to the reading experience like those provided via
       | the Semantic Reader.
       | 
       | We're eager to hear what you think, and happy to answer
       | questions.
        
         | isaacimagine wrote:
         | Looks great! Have you considered linking this up to something
         | like arxiv or other preprint sites?
        
           | codeviking wrote:
           | Yup, we're definitely thinking about this.
           | 
           | Our focus right now is on providing a tool folks can run it
           | on whatever papers they have access to. For instance, some
           | researchers might have access to documents that aren't
           | available to the public. We want them to be able to run this
           | against those.
           | 
           | That said as we expand the effort I imagine we'll eventually
           | pre-convert things that are publicly available, like those on
           | ArXiv, etc.
        
         | politelemon wrote:
         | I've never actually questioned the why, so maybe you could
         | shine some light... why are they usually published as PDFs?
        
           | kartoshechka wrote:
           | Unfortunately for my mental health my thesis was exactly
           | about converting arxiv papers to modern looking html, and
           | there's so much more broken, unjust and ugly things in
           | academia then using pdfs...
           | 
           | Regarding your question, I'd say that it is a natural
           | continuation of centuries long tradition of writing on the
           | actual paper. The invention of TeX actually made it easier to
           | produce more papers, then came PDF, and you could produce
           | virtual papers. Also science journals pretty much have
           | monopoly on scientific knowledge distribution, and they are
           | mostly paper too
        
           | DoreenMichele wrote:
           | I have no idea at all but as a wild guess, I would assume
           | it's because you can't edit PDFs. So you know it says the
           | same thing forever and no one went and changed it in response
           | to reading criticism of their paper or something.
        
           | codeviking wrote:
           | Y'know, that's a good question. I'm not sure I know the
           | answer.
           | 
           | My guess is it's largely for historical reasons. At the time
           | most venues were organized PDF was probably the best (or
           | only) mechanism for sharing documents for print distribution.
           | 
           | But we think it's time to change that :).
        
           | ephbit wrote:
           | I always assumed the main reason for using PDFs is, that an
           | author/distributor can be pretty sure, that they're rendered
           | almost exactly the same (fonts, layout) no matter with which
           | viewer they're viewed.
           | 
           | This probably evokes some kind of sense of authenticity. Like
           | some physical paper document it has exactly one appearance.
        
           | temp8964 wrote:
           | What alternative do you have? Word file?
           | 
           | PDF is the only widely supported format can guarantee
           | accurate reprint.
        
             | miohtama wrote:
             | Are papers printed anymore?
             | 
             | HTML for text.
             | 
             | SVGs for diagrams.
             | 
             | Equations can be exported as images if needed.
        
         | kahon65 wrote:
         | Do you remove the pdf files we send to your servers?
         | 
         | Edit https://allenai.org/terms point 5, you own all the
         | uploads! So if by mistake we send a medical PDF for example or
         | something else that is under gdpr, we can't ask you to delete
         | it???? ? Wtfffff
        
       | nanis wrote:
       | This seems pdf2tohtml combined with GROBID[1].
       | 
       | It seems to me the masheen learningz technikz boil down to a
       | generalization of my lightbulb moment here[2].
       | 
       | [1]: https://grobid.readthedocs.io/en/latest/
       | 
       | [2]: https://www.nu42.com/2014/09/scraping-pdf-documents-
       | without-...
        
         | codeviking wrote:
         | Yup, right now we use GROBID, do some post processing and
         | combine the output with other extraction techniques. For
         | instance, we use a model to extract document figures[1], so
         | that we can render them in the resulting HTML document.
         | 
         | Also, we're working hard on a new extraction mechanism that
         | should allow us to replace GROBID [2].
         | 
         | There's a lot of really smart people at AI2 working on this,
         | I'm excited to see the resulting improvements and the cool
         | things (like this) that we build with the results!
         | 
         | [1]: https://api.semanticscholar.org/CorpusID:4698432
         | 
         | [2]: https://api.semanticscholar.org/CorpusID:235265639
        
       | kartoshechka wrote:
       | Looks exactly like what type of crunch work ML would do, but have
       | you considered using brute force converters like latexml or
       | pandoc where appropriate?
        
       | chrisMyzel wrote:
       | This is amazing! Will make my (offline-only) Kindle finally
       | display scientific papers. Took a random link of arxiv and it
       | worked like a charm, including TOC. will this be OS'ed?
        
         | mintplant wrote:
         | See also KOReader [0], if jailbreaking is an option for you.
         | The built-in column splitter works pretty well for the papers
         | I've used it to read.
         | 
         | [0] https://github.com/koreader/koreader
        
         | chrisMyzel wrote:
         | (HTML->Mobi is totally possible)
        
         | kartoshechka wrote:
         | You may check out https://arxiv-vanity.com as well. OS,
         | convertation rates are close to 70% on random arxiv paper if
         | I'm not mistaken, but hardly can be called stable
        
         | codeviking wrote:
         | Yay, glad to hear it! If you end up viewing one of these on
         | your Kindle, let us know how well (or not) things work.
         | 
         | We're not sure if it's something that we can distribute as OSS
         | just yet. It relies on a few internal libraries that would also
         | need be publicly released, so it's not as simple as adjusting a
         | single repository's visibility.
        
       | oolonthegreat wrote:
       | cool project, though the name was confusing for me: I believe to
       | most people "paper" first means actual paper, so I thought this
       | was some kind of OCR system converting printed material to html?
        
         | codeviking wrote:
         | Thanks for the feedback. There's two hard problems n' all
         | that... :)
        
       | gregsadetsky wrote:
       | Great site, congrats!
       | 
       | One comment is that the slowest page to load was the Gallery [0]
       | as it loads an ungodly amount of PNG files from what appears to
       | be a single IP (a GCP Compute instance?)
       | 
       | I see 421 requests and 150 Mb loaded. As it seems to be mostly
       | thumbnails, have you considered using jpegs instead of pngs,
       | potentially use lazy loading (i.e. not load images outside of the
       | viewport) and potentially use GCP's (or another provider) CDN
       | offering?
       | 
       | Once I clicked a thumbnail, loading the article itself (for
       | example [1]) was quite breezy.
       | 
       | The gallery is a great showcase of what your site does -- I think
       | that it'd be worth making it snappier :-)
       | 
       | Cheers and congrats again
       | 
       | P.S. Also, the paper linked below [1] seems to have a few
       | conversion problems -- I see "EQUATION (1): Not extracted; please
       | refer to original document", and also some (formula? Greek?)
       | characters that seem out of place after the words "and the next
       | token is generated by sampling"
       | 
       | [0] https://papertohtml.org/gallery
       | 
       | [1]
       | https://papertohtml.org/paper?id=02f033482b8045c687316ef81ba...
        
         | codeviking wrote:
         | > One comment is that the slowest page to load was the Gallery
         | [0] as it loads an ungodly amount of PNG files from what
         | appears to be a single IP (a GCP Compute instance?)
         | 
         | Yup. There's no CDN or anything like that right now. We kept
         | things simple to get this out the door. But we definitely
         | intend to make improvements like this as we improve the tool.
         | 
         | The more adoption we see, the more it motivates these types of
         | fixes!
         | 
         | > P.S. Also, the paper linked below [1] seems to have a few
         | conversion problems -- I see "EQUATION (1): Not extracted;
         | please refer to original document", and also some (formula?
         | Greek?) characters that seem out of place after the words "and
         | the next token is generated by sampling"
         | 
         | Thanks for the catch. As you noted there's still a fair number
         | of extraction errors for us to correct!
        
           | mintplant wrote:
           | Another sample paper that caused some trouble with figure
           | extraction: https://www.cs.utexas.edu/~hovav/dist/vera.pdf
           | 
           | Very cool project, looking forward to seeing how it develops!
        
             | codeviking wrote:
             | Thanks, I'll pass this example along!
        
       ___________________________________________________________________
       (page generated 2021-09-15 23:00 UTC)