[HN Gopher] ArXiv now offers papers in HTML format
       ___________________________________________________________________
        
       ArXiv now offers papers in HTML format
        
       Author : programd
       Score  : 1089 points
       Date   : 2023-12-21 18:34 UTC (1 days ago)
        
 (HTM) web link (blog.arxiv.org)
 (TXT) w3m dump (blog.arxiv.org)
        
       | shrimpx wrote:
       | Since the article doesn't link to any example HTML article,
       | here's a random link:
       | 
       | https://browse.arxiv.org/html/2312.12451v1
       | 
       | It's cool that it has a dark mode. Didn't see a toggle but
       | renders in the system mode.
       | 
       | Overall will make arXiv a lot more accessible on mobile.
        
         | burkaman wrote:
         | And here's the PDF of the same paper for comparison:
         | https://arxiv.org/pdf/2312.12451.pdf
        
           | FredPret wrote:
           | The contrast is massive. I'm much more likely to read the
           | html version; that PDF is deeply off-putting in some hard to
           | define way. Maybe it's the two columns, or the font, or the
           | fact that the format doesn't adjust to fit different screen
           | sizes.
        
             | ForkMeOnTinder wrote:
             | Definitely the two columns for me. It's super annoying
             | skimming a paper and having to scroll down and back up
             | again in a zig-zag pattern.
        
               | mmis1000 wrote:
               | I think the consuming device matters. A ipad or computer
               | have much wider screen width. One column layout is too
               | wide for them for average people to scan text lines
               | quickly.
               | 
               | While it looks perfectly fine on a phone. Two columns
               | layout looks terrible on a smartphone, the text is too
               | tiny to read comfortably.
               | 
               | It would probably be even better if you can flip it left
               | and right like a ebook instead of scrolling to allocate
               | the content faster. But current design is good enough
               | IMO. (Compare to reading a pdf on cellphone)
        
               | kjkjadksj wrote:
               | Just zoom the smartphone into one column. Problem solved.
        
               | mmis1000 wrote:
               | And then you will have to scroll both top bottom and left
               | right, a even worst experience.
        
               | GoblinSlayer wrote:
               | It's about "One column layout is too wide" - if you zoom,
               | it's not too wide anymore, also smartphones have narrow
               | screen, not wide, and tablets can do that too afaik.
        
               | GoblinSlayer wrote:
               | To display two column layout you need a tall screen, now
               | wide. If you display two column layout on a short wide
               | screen, you have to scroll it up and down in zigzag
               | pattern to read one page.
        
             | tobias2014 wrote:
             | This is very interesting, because for me it's just the
             | opposite. In particular the two column layout is just more
             | readable and approachable for me. The PDF version also
             | allows for a presentation just as the authors intended. I
             | guess it's good that they offer both now.
        
               | kjkjadksj wrote:
               | The authors don't format the pdf, the editor does.
               | Authors probably sent a double spaced word document with
               | figures and tables on another file.
        
               | tonyg wrote:
               | In computer science, the usual case is that the author
               | fully formats the paper.
        
               | z2h-a6n wrote:
               | Not on arXiv (unless I'm much mistaken), which is a
               | preprint server, not a conventional journal.
               | 
               | arXiv accepts various flavors of TeX, or PDFs not
               | produced by TeX [0], and automatically produces PDFs and
               | HTML where possible (e.g. if TeX is submitted). In the
               | case of the example paper under discussion, the authors
               | submitted TeX with PDF figures [1], and the PDF version
               | of the paper was produced by arXiv. The formatting was
               | mainly set by using REVTeX, which is a set of macros for
               | LaTeX intended for American Physical Society journals.
               | 
               | [0]
               | https://info.arxiv.org/help/submit/index.html#formats-
               | for-te... [1] https://arxiv.org/format/2312.12451
        
               | smartmic wrote:
               | FWIW, I recently learned that it is also possible to
               | produce nice PDF papers with GNU roff (groff), have a
               | look at this example: https://github.com/SudarsonNantha/L
               | inuxConfigs/blob/master/....
        
               | pimlottc wrote:
               | Looks nice but seems strange to switch from two columns
               | to one column after the first page? Although maybe
               | they're just trying to demonstrate its capabilities.
        
               | macintux wrote:
               | W. Richard Stevens (RIP, still hurts) famously used troff
               | for his books.
        
               | frocmlol wrote:
               | You are very confidently wrong.
               | 
               | In the arxiv you use latex and do everything yourself.
               | There is no editor.
        
               | cozzyd wrote:
               | You typically send a .tar.gz of tex files (and, figures,
               | .bbl, etc.) to the journal. And then you typically upload
               | something very similar to the arxiv (I have an arxivify
               | Makefile target for for my papers that handles some arxiv
               | idiosyncrasies like requiring all figures to be in the
               | same folder as the .tex file, and it also clears all the
               | comments; sometimes you can find amusing things in source
               | file comments for some papers).
               | 
               | Some fields may use Word files, but in most of physics
               | you would get laughed at...
               | 
               | It is true that most journals will typically reformat
               | your .tex in a different way than is displayed on the
               | arXiv.
        
               | eigenket wrote:
               | You are completely wrong. ArXiv doesn't work like that.
        
               | aragilar wrote:
               | Not only is this wrong about physics/astronomy, I
               | regularly use the arxiv version because the typography is
               | better (e.g. in the published paper an equation is split
               | with part of the equation being at the bottom of one
               | column, and the top of the next, whereas the equation is
               | on one line in the arxiv version).
        
               | JumpCrisscross wrote:
               | Do you work extensively with LaTeX?
               | 
               | Two columns is good, albeit annoying on mobile. But the
               | font. The typeface kills me, and almost every LaTeX-
               | generated document sports it.
        
               | saurik wrote:
               | Hilariously, I would probably tolerate the HTML version a
               | lot better if it had the font from the PDF (and FWIW, the
               | answer for me is "no: I don't work with LaTeX at all... I
               | just read a lot of papers").
        
               | folmar wrote:
               | If you disable the font rule                 :root,
               | [data-theme=light] {         /* --text-font-family:
               | "freight-sans-pro";       }
               | 
               | it switches to "Noto Serif" that is way easier on the
               | eyes.
        
               | GoblinSlayer wrote:
               | I hard override the font in browser, designers never get
               | it right.
        
               | borg16 wrote:
               | what is your font of choice?
        
               | GoblinSlayer wrote:
               | Verdana
        
               | westurner wrote:
               | https://github.com/neilpanchal/spinzero-jupyter-theme
               | /fonts/{cmu-text,cmu-mono} :
               | 
               | > _" Computer Modern" is used for body text to give it a
               | professional/academic look_
        
               | cozzyd wrote:
               | Hating on Computer Modern (ok, probably now Latin Modern)
               | is something close to blasphemy.
        
               | hollerith wrote:
               | I hate Computer Modern, and I'm not even particularly
               | fussy about typefaces.
        
               | kibwen wrote:
               | Computer Modern was not designed for easy viewing on
               | screens (think about the screens Knuth would have been
               | using in 1977), it was designed for printing in books.
        
               | isaacfung wrote:
               | What device and app are you using to read the document?
        
             | kjkjadksj wrote:
             | If you read a lot of papers in your line of work you will
             | quickly appreciate the two columns and justification.
        
               | FredPret wrote:
               | Admittedly, I don't read research papers. But with HTML,
               | surely the choice between one or two columns is a
               | checkbox away.
        
               | IlliOnato wrote:
               | Which checkbox?
               | 
               | I cannot find anything relevant in any of the 3 browsers
               | I use (Vivialdi, Firefox, Chrome). Would really
               | appreciate this option.
               | 
               | A quick search gave some apparently unmaintained browser
               | extensions, and it's it.
        
               | FredPret wrote:
               | No, I'm saying there _should_ be a checkbox. That way,
               | you can switch between two columns formatted like LaTeX
               | and that font they always use, and one column with
               | Helvetica  / Arial.
        
               | IlliOnato wrote:
               | It would be nice, but I am not holding my breath.
        
               | jabroni_salad wrote:
               | Only problem is jagoffs like me who need the text to be
               | bigger. On PDFs you now get to experience a horizontal
               | scrollbar. HTML has text reflow and I can set the line
               | length by resizing the window. I'm willing to make a lot
               | of sacrifices for that experience.
        
             | z2h-a6n wrote:
             | For what it's worth, two column layouts are very common in
             | the physical sciences, or at least in physics which I'm
             | more familliar with. I have a feeling that the reason is at
             | least partly to save page space when using displayed math
             | (e.g. equations that are formatted in a break between
             | blocks of text), which use the full text width (i.e. the
             | width of one column) to display what may be much less than
             | half a page wide.
        
               | FredPret wrote:
               | It makes sense - for paper. But pixels are infinite -
               | HTML is far better for screen display, which is how
               | people read things nowadays.
               | 
               | The extra column next to the one I'm reading introduces a
               | lot of visual noise, and the content is hard enough as it
               | is. I'm sure physicists have all gotten used to it, but
               | it certainly trips me up.
        
               | nyssos wrote:
               | > The extra column next to the one I'm reading introduces
               | a lot of visual noise
               | 
               | Papers are generally not read start to finish in one go:
               | there's lots of rereading and jumping back and forth
               | between key parts, and anything that moves them further
               | apart makes this harder.
        
               | FredPret wrote:
               | Ah, that makes more sense. I imagined scientists just
               | reading the whole thing start-to-finish.
               | 
               | I still think a flexible layout is best. If you like
               | multi-columns and have a wide screen, why not display 12
               | columns next to each other?
               | 
               | With PDF this is not possible. With HTML the content can
               | in principle be sliced and diced how you like it.
        
               | fuck_google wrote:
               | One can also view PDF pages side by side, which works
               | pretty well with a 4K monitor.
        
               | arp242 wrote:
               | I need to scroll up and down a lot more with two-column
               | layout because a single page doesn't fit on my screen in
               | my chosen font size (which is fairly large).
               | 
               | But HTML is so much more flexible, and ideally people can
               | choose how they want it, although at this point it seems
               | that's not (yet) implemented.
               | 
               | I find jumping back and forth is always a pain on
               | computer screens and ebooks by the way, and is the major
               | reason I much prefer print for this type of thing.
        
               | aragilar wrote:
               | Two column is the default in astronomy also.
        
             | mastazi wrote:
             | I wonder if perhaps it's a generational thing, I prefer the
             | PDF because it reminds me of printed paper, which is what I
             | used growing up.
             | 
             | (For reference: I am at the end of Gen X, people 3-4 years
             | younger than me are considered Millennials).
        
             | Blikkentrekker wrote:
             | Quite so. The font annoys me. This is one of the reasons I
             | hate PDF and why I believe these things should be
             | controlled by the person reading it, not the publisher.
             | 
             | I do not much care what font the auctor finds pleasant to
             | read, but what I find pleasant to read, and this font isn't
             | it, and neither are the colors.
        
             | wruza wrote:
             | Seconded. I can (will) actually just read referenced papers
             | now instead of hesitating to either get a headache or stay
             | uninformed.
             | 
             | Defaults and UX rule the world. It's unfortunate that $subj
             | wasn't a thing for so long and probably scared millions of
             | curious minds from material. It is _so_ important.
        
           | znpy wrote:
           | I prefer the pdf version, mostly. I can annotate it on the
           | side both in print and digitally with my iPad. I can also
           | invert colors in pdf readers to get some kind of "dark mode"
           | easily.
           | 
           | The html version is wasting a lot of space on the right side
           | and the color scheme is awful (dark grey on a brown
           | background, seriously? How is that any better? Edit:
           | disabling dark mode yields a better reading experience wrt
           | color scheme). Also, somehow links to references make another
           | http request and have no backlink?
           | 
           | The html version could make sense if it had more dynamic
           | functionalities: change fonts/line spacing, toggle color
           | schemes, maybe a mini map or some other navigational tool?
           | Also, some kind of support for highlighting and/or
           | annotating?
        
       | alephnerd wrote:
       | This is a great UX addition. Why did it take them so long?
        
         | gwern wrote:
         | The conversion is still very error-prone. It can't convert a
         | lot of packages, and the last paper I read, StarVector, half
         | the HTML version is just missing. (I think it hit an error at a
         | figure of some sort.) I reported an error, but I've been
         | reporting errors against the ar5iv and abstracts for years now
         | and the long tail of problems just seems like an incredible
         | slog.
        
           | KRAKRISMOTT wrote:
           | Where are the computer vision people? This is the perfect
           | type of problem for multi modal LLMs
        
             | IlliOnato wrote:
             | Except that the errors made by an LLM might be harder to
             | spot then converter errors that typically are very blatant,
             | and don't usually alter text (perhaps just drop parts of
             | it).
             | 
             | Also, a bug in a converter is conceptually much easier to
             | fix than to re-train your LLM.
             | 
             | I am not sure that AI in it's current state is useful when
             | "high fidelity" is required.
        
           | dginev wrote:
           | Can confirm. From an ar5iv standpoint, 2.56% articles
           | currently fail to convert entirely, and 22.9% have known
           | errors to the converter. That leaves 74.5% of nominally
           | usable articles. This success rate is noticeably _lower_ for
           | the newest batches of arXiv submissions, as the converter
           | hasn 't caught up with the most recent package innovations.
           | 
           | We have a plan in place to meaningfully fall back for unknown
           | packages, but that will take at least another year to put in
           | place, and likely another couple of years to stabilize.
           | 
           | Meanwhile, there is some hope that with arXiv launching the
           | HTML Beta we will get more contributions for package support
           | (LaTeXML is an open source project, with public domain
           | licensing, everybody benefits).
           | 
           | But again the original point is spot on. Coverage will be
           | hit-or-miss for a while longer yet, for an arbitrary arXiv
           | submission. The good news is that authors _could_ work
           | towards better support for their articles, if they wanted to.
        
         | eviks wrote:
         | Because this is a rather conservative field with little
         | dependency on the general public, so without much interest in
         | hepling disseminate the knowledge broadly & accessibly
         | (relative to other priorities, not absolute)
        
         | Strilanc wrote:
         | How would you do it quickly?
         | 
         | For example, HTML isn't divided into numbereres pages while
         | PDFs are. A lot of latex interacts with page boundaries.
         | Figures tend towards the tops of pages. And there's \clearpage.
         | And the reference list might say which page each citation
         | appeared on. All that stuff needs someone to decide how to
         | handle it and then to implement that handling. Like... what
         | value does \pageheight return? Sometimes I resize things to fit
         | the page height, and if it was doubled then I should have
         | resized to fit the width instead.
        
         | lynndotpy wrote:
         | Almost universally, we prepare conference papers as LaTeX files
         | made to export to PDFs which fit within the conferences
         | template.
         | 
         | It's nontrivial to export this to HTML in all cases, and even
         | then, nobody is asking for HTML from us even though we all want
         | it. I'm guessing Arxiv is using some kind of converter which
         | _usually_ but not _always_ works.
         | 
         | That said, this is a long time coming and PDF as the standard
         | should've died a decade ago. I wish I had this when I was in my
         | PhD program.
        
         | alright2565 wrote:
         | Latex is a very complicated programming language for creating
         | documents. It is not easy to create a new backend for it.
         | 
         | As a glimpse into the very tip of the iceberg, this diagram is
         | https://tex.stackexchange.com/a/158740/ generated with 100%
         | Latex code.
        
       | binarymax wrote:
       | Nice! Now I don't need to manually replace arxiv with ar5iv.
       | Congrats to the team.
        
         | imjonse wrote:
         | "Our ultimate goal is to backfill arXiv's entire corpus so that
         | every paper will have an HTML version, but for now this feature
         | is reserved for new papers."
         | 
         | For now it only works for papers submitted this month. But it's
         | great to have this feature, makes it so much easier to read on
         | phones.
        
       | eviks wrote:
       | Finally a modern format you can copy&paste from and read on one
       | of the most popular computing platforms!!!
        
       | shusaku wrote:
       | Seems like the references aren't working very well.
       | 
       | I really want journals to have two way links in a paper. I get
       | google scholar alerts about certain papers being cited, and I
       | want to skip to "why did they cite this? Did they use it, improve
       | it, it just mention it?"
        
         | r3trohack3r wrote:
         | I'd never considered setting up citation alerts like this.
         | 
         | Thank you for the idea!
        
         | shrimpx wrote:
         | Looks like clicking a reference adds the hash to the URL but
         | doesn't scroll to the reference. If you load the hash URL
         | directly in the browser you get a 404 page...
        
           | burkaman wrote:
           | https://browse.arxiv.org/html/2312.12451v1#bib.bib1 works,
           | but https://browse.arxiv.org/html/2312.12451v1/#bib.bib1
           | doesn't.
        
             | IlliOnato wrote:
             | Yeah, it seems like a bug in HTML generator...
        
               | cbf66 wrote:
               | It is a bug. Will be fixed soon.
        
       | pushfoo wrote:
       | Previously discussed:
       | https://news.ycombinator.com/item?id=38713215
        
       | carlosjobim wrote:
       | With the 2024 browser update, this means I can read these
       | articles on my ancient Kindle perfectly fine.
        
       | ChrisArchitect wrote:
       | [dupe] from yesterday
       | 
       | More here: https://news.ycombinator.com/item?id=38713215
        
       | winwang wrote:
       | Probably more accessible in general. (PDF) Papers are
       | psychologically scary.
        
         | mmis1000 wrote:
         | Pdf is by design a image format that can also embed text. It
         | just don't have the primitives to properly retain the article
         | structure.
        
           | PaulHoule wrote:
           | Nah, it's a super-complex system that creates a graph of
           | components, can draw vectors like PostScript, can embed 3-d
           | models, etc. The spec is here
           | 
           | https://opensource.adobe.com/dc-acrobat-sdk-
           | docs/pdfstandard...
           | 
           | if you look at sections 14.6 through 14.10 you will find
           | quite baroque facilities for representing the structure of
           | documents in great detail, making documents with
           | accessibility data, making documents that can reflow with
           | HTML, etc. Note to mention the 14.11 stuff which addresses
           | problems with high end printing (say you want to make litho
           | plates for a book.)
           | 
           | For that matter sections 14.4 and 14.5 describe facilities
           | that can be used to add additional private data to PDF files
           | for particular applications. For instance Adobe Illustrator's
           | files are PDF files with some extra private data, and
           | https://en.wikipedia.org/wiki/GeoPDF
           | 
           | I like to complain that PDF has no facility to draw a circle
           | but instead makes you approximate a circle with (accursed)
           | Bezier curves but other than that the main complaint people
           | make about PDF is that it is too complicated not that it is
           | lacking this feature or that feature.
           | 
           | Contrast that to a highly opinionated document format like
           | DjVu
           | 
           | https://en.wikipedia.org/wiki/DjVu
           | 
           | which came out around the same time as PDF and is specialized
           | for the problem of scanned documents and works by decomposing
           | the document into three layers, one of which is a bilevel
           | layer intended to represent text. All three layers have
           | specialized coding schemes, the text layer in particular
           | tries to identify that every copy of (say) the letter "e" or
           | the character "Han " is the same and reuse s the same bitmap
           | for them.
        
             | anonimo37 wrote:
             | You would normally use a library to create the PDF so you
             | don't need deal with the complexity of the format. A
             | library would likely provide a function for drawing circles
             | that translates the circle into Bezier curves.
        
             | mmis1000 wrote:
             | The adode can surely add whatever extension they want to
             | address whatever problem. But unfortunately, most
             | implementation outside of Adobe acrobat itself won't
             | implement all of them. Most library would just implement
             | basic part for printing and marking (At best, supports
             | forms and javascript). That part is basically non-exist for
             | anyone.
             | 
             | There is a reason that most people still use docx for forms
             | even pdf technically support forms.
             | 
             | PS: pdf reader of firefox and chrome don't really supports
             | forms until very late versions.
        
       | ZeroCool2u wrote:
       | Wow, this is _so_ much better!
        
       | choppaface wrote:
       | Hope they benefit from CDN caching now too.
       | 
       | Edit: aaaand they got Fastly
       | https://news.ycombinator.com/item?id=38723373
        
       | cozzyd wrote:
       | doesn't work great with long author lists...
       | 
       | https://browse.arxiv.org/html/2312.12907v1
        
         | degenerate wrote:
         | The PDF is worse, so there is no simple answer to this:
         | https://arxiv.org/pdf/2312.12907v1.pdf
         | 
         | At least the HTML version pairs each author with their
         | affiliations, instead of the PDF which has all the names on
         | page 1, and all the affiliations on page 2. That's completely
         | unreadable.
        
           | cozzyd wrote:
           | The PDF is better because I'm trained to scroll past the
           | author list. That takes forever on the html version .
        
             | mattigames wrote:
             | You can click the "Introduction" anchor on the left side
             | and it scrolls for you past the author list
        
               | cozzyd wrote:
               | well it skips the abstract too, but yes, you can scroll
               | back up to see it.
        
               | mattigames wrote:
               | Yeah, its a bit weird that the abstract doesn't have a
               | link on the left
        
               | cozzyd wrote:
               | Probably because \abstract{ } is treated differently than
               | \section{ }, I guess...
        
           | IlliOnato wrote:
           | For me the PDF is much better. It's compact and clean, if I
           | really need to see an affiliation for a particular author,
           | it's really easy to do so in the PDF, not so in the HTML.
           | 
           | It's highly unlikely anybody will read an entire author list
           | this long; typically you would read the first two or three
           | names, or check if some particular name is on the list. So
           | the compactness of the list and being able to quickly get to
           | the article contents is important.
        
       | Al-Khwarizmi wrote:
       | Nice! It would be even better if they offered authors of previous
       | papers the option of converting to HTML, as the latex sources are
       | already in the system.
        
         | fprog wrote:
         | The article states they're going to backfill all, or nearly
         | all, previously submitted papers!
        
       | FredPret wrote:
       | This is brilliant. I don't share academia's love of LateX multi-
       | column PDFs.
        
         | tiagod wrote:
         | I like multi-column text on paper (literally), but it's awkward
         | in digital where you can just shape text on the fly to whatever
         | column size you want
        
           | golol wrote:
           | The oroblem is that gaining this responsiveness fundamentally
           | makes your task much more difficult. Instead of just creating
           | a picture you're now writing code that has to be maintained.
           | In my philosophy arxiv is for documents which are set in
           | granite - pictures.
        
       | leoncaet wrote:
       | I just hope they don't stop to offer the papers in PDF. Even when
       | I'm on a computer, I still prefer to read PDFs.
        
         | creatonez wrote:
         | There is a taste component to it of course, but the history of
         | PDF shows that it's the wrong format for reading on a computer.
         | It was originally meant to be the end result of a publishing
         | process before printing, a layer that sits right between the
         | publishing software and the postscript that gets sent to the
         | printer. This makes the PDF format quite inflexible for reading
         | on a computer, with it being impossible to properly zoom or
         | adjust the reading experience.
         | 
         | Unfortunately many institutions and businesses have ignored its
         | limitation because PDF turned out to be an obvious-but-naive to
         | put a 'sheets of paper' metaphor into a computer system, which
         | in the 1990s appealed to tech illiterate folks doing bare-bones
         | computerization of existing paper systems. So later we got
         | complicated and error-prone tools for editing PDFs, and many
         | random additions to the spec to allow for unusual use cases.
        
           | impendia wrote:
           | > This makes the PDF format quite inflexible for reading on a
           | computer, with it being impossible to properly zoom or adjust
           | the reading experience.
           | 
           | As an academic researcher, generally speaking I also prefer
           | PDF, and the inflexibility and static nature is a feature,
           | not a bug. I appreciate the fact that a paper will appear the
           | same everywhere, that I can refer to "the top of page 7",
           | etc.
           | 
           | The exception is if I wanted to just skim a paper; in this
           | case, I think I'd prefer HTML.
           | 
           | I'm a huge fan of what arXiv is doing here. It effectively
           | preserves the status quo, while adding an additional option
           | on the side. The HTML option might prove a little bit useful
           | for me, and it is likely to prove extremely useful for people
           | with disabilities.
        
         | aragilar wrote:
         | I know of no-one who provides only HTML to arxiv, it's either
         | latex, or doc/odt, so the PDFs should always be there.
        
       | sylware wrote:
       | Like the maths noscript/basic (x)html wikipedia generator:
       | 
       | The magic of inline images at a known DPI, of course you can
       | provide images for different DPIs.
       | 
       | Reading maths/science noscript/basic (x)html documents on my 100
       | DPI monitor, on wikipedia. Not yet fully ready on arxiv.
        
       | gms7777 wrote:
       | About time. Biorxiv and medrxiv have been doing this for probably
       | half a decade at this point?
        
         | dginev wrote:
         | Wrong, arXiv was first. Check this HTML paper from 1997:
         | 
         | https://arxiv.org/html/astro-ph/9708066
        
         | cbf66 wrote:
         | medRxiv and bioRxiv get most of their submissions as Word
         | files. It's a much easier conversion, and if necessary they
         | have manual touch-up. Not feasible for arXiv's volume.
        
       | jez wrote:
       | It would be neat if they offered submitters the chance to upload
       | their own HTML version alongside the PDF version, instead of
       | always relying on an automatic conversion process.
       | 
       | - I can imagine authors feeling frustrated if someone reaches out
       | about a problem in the HTML version of their paper, but they have
       | no way to correct it except by hoping that a change to the PDF
       | fixes a change to the generated HTML. Easier to just fix the
       | formatting problem in the PDF outright.
       | 
       | - It would be neat to allow people to experiment with alternative
       | formatting for their papers. For example, imagine a paper about a
       | programming language that embeds a sandbox you can use to play
       | around with the language under discussion. Or a paper about
       | multivariable calculus and you can interact with a three
       | dimensional plot of some function.
        
         | layer8 wrote:
         | They'd have to define and document a "safe" subset of HTML, and
         | implement a filter/checker for it. Otherwise we'd end up with
         | papers containing ads and tracking and XSS vulnerabilities and
         | whatnot.
        
           | digging wrote:
           | Those are issues with JavaScript, not HTML. Wouldn't
           | filtering out iframes pretty much keep us in the clear?
        
             | layer8 wrote:
             | The parent wanted interactive 3D plots, which means
             | JavaScript embedded in or linked from the HTML. Then
             | there's stuff like JavaScript embedded in SVG.
        
             | CaptainOfCoit wrote:
             | > Those are issues with JavaScript, not HTML
             | 
             | What about various HTML tags that remote load resources?
             | From script, link, to things like img or CSS `background-
             | image` attribute, added in a `style` attribute.
             | 
             | There is a bunch of ways to do remote requests even without
             | HTML.
        
               | quickthrower2 wrote:
               | The same _problem_ exists in HN comments. This comment
               | _gets_ converted _to_ html.                  But it is
               | fine!
        
               | fdupress wrote:
               | "gets converted to" and "gets rendered as uploaded by the
               | user" are two different things.
               | 
               | There are no issues with arXiv generating the HTML and
               | sending that over: they control the generation process,
               | and users who visit arXiv already trust it to not be
               | malicious. The issue is with letting the user upload
               | their own and having it sent on to other users as is.
        
         | diffeomorphism wrote:
         | > It would be neat if they offered submitters the chance to
         | upload their own HTML version alongside the PDF version,
         | instead of always relying on an automatic conversion process.
         | 
         | Please don't. Then you will have a mismatch between the source
         | and the "own html" which ruins the point of uploading the
         | source.
        
           | eviks wrote:
           | Pdf isn't the source
        
             | IlliOnato wrote:
             | But the PDF is also generated. LaTeX is the single source
             | of truth.
        
         | kjkjadksj wrote:
         | Most authors probably have no interest in learning html. Also
         | most authors want nothing to do with the work by the time its
         | submitted. It was probably hell getting the project to that
         | point of publishing, they want to be done with it and move on
         | to the next thing going on in their career asap.
        
           | jez wrote:
           | I think this is an argument in favor of doing automatic PDF
           | -> HTML conversion for the authors that don't want to touch
           | it, but I don't think it's an argument against letting those
           | who are fine with HTML provide their own.
        
             | IlliOnato wrote:
             | HTML is not generated from PDF. Both PDF and HTML are
             | generated from LaTeX.
        
           | bookofjoe wrote:
           | You hit on an unappreciated truth. By the time my papers
           | appeared in print, I was so sick of them and the endless
           | effort involved in taking them from raw data to finished,
           | edited, proofed, rewritten a zillion times to meet the
           | reviewers' and editors' requests and corrections and
           | suggestions, that I didn't even read the published paper when
           | it arrived as preprints and in the journal.
           | 
           | Enough!
           | 
           | My proof:
           | https://scholar.google.com/citations?user=5DdrMc8AAAAJ&hl=en
        
         | tiagod wrote:
         | I was under the impression the source authors publish to arxiv
         | was a latex file
        
           | jraph wrote:
           | It is.
        
           | jez wrote:
           | Ah, thanks for clarifying!
           | 
           | I looked up the submission formats, and it looks like if you
           | authored the paper in TeX/LaTeX, they do not accept pre-
           | rendered versions of the document.
           | 
           | https://info.arxiv.org/help/submit/index.html#formats-for-
           | te...
           | 
           | But if you did not author it in TeX/LaTeX (e.g., Word, Google
           | Docs, etc.) it appears you can upload a PDF or HTML yourself.
        
             | IlliOnato wrote:
             | But it's still a single source of truth. Only one document
             | is submitted. So for works submitted as HTML no PDF or
             | LaTeX version is available.
        
         | IlliOnato wrote:
         | No, it would not. It's critically important that there is only
         | one "logical" article, albeit with different representations.
         | In other words, a single "source of truth".
         | 
         | With "sideloading" of HTML there is no way in general to make
         | sure that the _contents_ of LaTeX (and PDF) on one side and
         | HTML on the other side is the same.
        
           | dataflow wrote:
           | > With "sideloading" of HTML there is no way in general to
           | make sure that the contents of LaTeX (and PDF) on one side
           | and HTML on the other side is the same.
           | 
           | Is it not possible to write LaTeX code that produces
           | different contents in HTML vs. PDF?
        
             | IlliOnato wrote:
             | Well, perhaps by exploiting bugs/shortcomings in PDF and
             | HTML converters. Not by design.
             | 
             | However, bugs get fixed, and since the PDF and HTML are
             | generated dynamically, any such hack would be extremely
             | fragile.
             | 
             | And while "single source of truth" can help to prevent such
             | malicious discrepancy, it's unlikely that people would try
             | to hack the system this way: what for?
             | 
             | Far more likely scenario is unintentional discrepancy, and
             | single source of truth definitely helps to prevent that!
        
           | GoblinSlayer wrote:
           | Huh? What's the point of html version if you define it as
           | source of deception?
        
         | thomasahle wrote:
         | > It would be neat if they offered submitters the chance to
         | upload their own HTML version alongside the PDF version,
         | instead of always relying on an automatic conversion process.
         | 
         | Can you recommend a system I can use to compile my latex, while
         | also making sure the html is going to look good? I'd like some
         | kinds of css style @media queries to switch between certain
         | parts of the layout, while keeping a single latex file.
        
         | turing_complete wrote:
         | With the shelf life of web technologies, authors would
         | constantly have to maintain their "papers" or they just would
         | not be accessible after a while.
        
       | endergen wrote:
       | I was hoping this meant that html native submissions would be
       | possible, so that people made interactive explanations.
        
       | lucidrains wrote:
       | nice! will make reading papers on the phone so much more
       | pleasant!
        
       | tarboreus wrote:
       | One of the reasons is to make the papers more accessible to
       | people with disabilities, especially the blind. I participated in
       | a conference they hosted on this a few months ago, I recommend
       | taking a look at the recordings if you're interested in thinking
       | on this.
       | 
       | https://accessibility2023.arxiv.org/
        
         | miki123211 wrote:
         | Blind person here, can confirm this. Reading PDFs with a screen
         | reader is bad, reading PDFs that come from LaTeX is worse,
         | reading LaTeX math is pretty much impossible. All the semantic
         | info you need is just thrown away.
         | 
         | You _can_ make decently accessible PDFs but it 's lots of work,
         | you need Acrobat on the producer' side and might also need it
         | on the consumer's side. Free tools don't even come close.
         | There's also the fact that the process of making accessible
         | PDFs in Acrobat isn't itself accessible.
         | 
         | With that said, the way screen readers treat HTML math
         | certainly isn't perfect, it's geared more towards school
         | children than anything above calculus. I'm probably going to
         | stay with my LaTeX source files for now. At least ArXiv offers
         | those, not many sites do. To be fair, that approach also has
         | its own set of problems (particularly when people use some
         | extra fancy formatting in their math equations, making the
         | markup hard to read), but I find this to be the best approach
         | for me so far, at least on AI/ML papers.
        
           | saurik wrote:
           | Huh. It would seem like, of all the things which should make
           | it easy to generate the correct accessibility information,
           | the pipeline of compiling a paper from source code in LaTeX
           | should nail it... maybe we should all pitch in to some pool
           | to pay someone to put in the required effort to connect all
           | the dots?
        
             | semi-extrinsic wrote:
             | Kind of tangential, but it's also kind of surprising how
             | difficult it is in LaTeX to make a plot of an equation.
             | 
             | Say I have Equation \ref{eq}. Why can't I just say "plot
             | \ref{eq} for x from -6 to 11" and get my graph?
             | 
             | And yes, I know about pgfplots, PSTricks, TikZ etc. But in
             | all those cases, I need to define the same equation twice,
             | in different syntax to boot. It's kind of unsatisfying.
        
               | fsh wrote:
               | TeX is a very arcane language, and it doesn't support
               | floating point numbers. Few languages would be less
               | suited for making a plotting library.
        
               | semi-extrinsic wrote:
               | Both pgfplots and PSTricks and TikZ are plotting
               | libraries. It seems like it shouldn't be that hard to let
               | them plot an equation written elsewhere in different
               | syntax.
        
               | IlliOnato wrote:
               | > Say I have Equation \ref{eq}. Why can't I just say
               | "plot \ref{eq} for x from -6 to 11" and get my graph?
               | 
               | Pretty much for the same reason you cannot press a word
               | and get a pop-up dictionary definition in a paper book.
        
               | semi-extrinsic wrote:
               | To be clear, I meant in the LaTeX source code. And there
               | I can already write code that plots equations, I just
               | have to re-type the equation in a new syntax.
        
             | jahewson wrote:
             | Surprisingly it's not easy, and depending on the field it
             | can be quite challenging. The reason for this is that TeX
             | captures the visual aspects of typesetting, not the
             | semantic meaning of the mathematics.
             | 
             | A simple example is '\sum' which provides no way to capture
             | the expression being summed over - because that's not
             | necessary for typesetting. That's not the case in, say,
             | MathML.
             | 
             | Writing MathML is no fun though because mathematical
             | formulae are visually ambiguous and we rely on the context
             | to know how to read them, e.g. does 'f(x - 1)' mean
             | function f called with argument x - 1, or does it mean
             | variable f multiplied by x - 1?
        
           | ldenoue wrote:
           | I wrote an app called PDF Reflow that reflows the original
           | PDF using image processing to cut out words into tiles so you
           | see the reflowed version of the text in their original look.
           | 
           | https://www.appblit.com/pdfreflow
        
             | sydbarrett74 wrote:
             | Any chance of releasing an Android version?
        
               | no_identd wrote:
               | +1
        
               | hedora wrote:
               | Gv (part of ghostscript) used to do a good job of this
               | for two column documents. When zoomed in to show one
               | column width of text, the spacebar ran through the top of
               | column 1, then the bottom of column 1, then the top of
               | column 2 and so on.
               | 
               | The amount it scrolled probably depended on the aspect
               | ratio of the window, so it might be multiple key presses
               | to scroll an entire column.
        
               | ldenoue wrote:
               | It's using web technologies so yes it could also be on
               | Android. I'll see what can be done.
        
               | IlliOnato wrote:
               | +1
        
           | jakderrida wrote:
           | Hold on... Are you telling me that all these complex
           | sentences are being typed out based on your voice alone?
           | That's insane.
        
             | ehPReth wrote:
             | ? blind people can use keyboards
        
             | kzrdude wrote:
             | Hm tangential question but shouldn't touch typing be well
             | accessible for many blind computer users?
        
             | topato wrote:
             | I'd say it would be simple to talk type these using windows
             | 11's redux of voice typing. Pretty damn accurate and easy
             | to modify/variate text/options. I use it all the time to
             | make tech/engineering blog posts, faster and more organic
             | than typing, typically, and it learns your technoacronyms.
             | Combined with voice access, it makes it trivial to fully
             | operate your computer (well, at least, browse the web,
             | email, and media apps) from across the room. For anyone who
             | hasn't tried the updated version, highly suggest hitting
             | windowskey+h and giving it a shot.
        
             | spookie wrote:
             | There are braille keyboards too
        
               | Blikkentrekker wrote:
               | Or normal keyboards? Many people can type blind. Some
               | learned to do so while born blind, others became blind
               | after they had already learned this skill.
               | 
               | I would assume that the majority of persons on HN are not
               | looking at their keyboard as they type.
        
               | spookie wrote:
               | I was just giving an additional way to use a computer not
               | known by many. Either way, we shouldn't rely on the
               | skills of a few to interact with a computer.
        
           | anthk wrote:
           | Emacs with Emacspeak has a math reading module.
        
           | ahepp wrote:
           | Do you think there's potential for language models to play a
           | role here? I know that AI can get tossed around as a
           | buzzword, but hasn't it proved quite successful in fields
           | like computer vision?
           | 
           | I'm not deeply familiar with the state of that art, but it
           | seems like recovering the metadata from a PDF generated by
           | LaTeX would be no more impressive than many other things
           | we're currently seeing language models achieve?
        
             | staunton wrote:
             | I'm absolutely positive a few million dollars could get you
             | a system that can "read aloud" pdf math papers in no time.
             | I guess people will wait for it to become cheaper though.
        
               | hutzlibu wrote:
               | You can also have that cheaper already. But having it
               | stable and reliable - will take some time and possibly
               | more money, depending on your definition of reliable.
        
             | throwaway287391 wrote:
             | You wouldn't need to use computer vision on a picture of
             | the PDF. arXiv has the tex source for most of the papers.
             | An LLM trained on code could do a pretty good job of
             | translating tex to readable html with a bit of effort.
        
             | miki123211 wrote:
             | Mathpix is trying to achieve something like this, and they
             | do consider the visually impaired market AFAIK, but it's
             | pretty expensive and I have no experience with it
             | personally, so I can't say how good it is.
        
           | spookie wrote:
           | Yup LaTeX math doesn't make sense. I've been trying to hack
           | my way into getting a voice model to read it but no real
           | progress.
        
             | IlliOnato wrote:
             | LaTeX is a programming language for generating beautiful
             | pages, basically a typesetting system. It serves this
             | purpose fantastically well.
             | 
             | It was not designed to provide semantic information,
             | unfortunately. So getting anything other than visual
             | representation out of it is _hard_.
        
           | Blikkentrekker wrote:
           | I made these arguments two decades ago when I was still in
           | university that PDF is a horrible format because it's purely
           | praesentational, especially for people with disabilities
           | whose software relies on semantic information. LaTeX last
           | time I used it didn't even have a different symbol for
           | uppercase Alpha and A because the glyphs are
           | indistinguishable.
           | 
           | They argued that PDF was superior because the publisher could
           | control how it looked and it looked the same everywhere but
           | the point is that it should not. Things such as font size and
           | line spacing should be at the control of the consumer, not
           | the publisher. This isn't simply blind people but for
           | instance also persons with dyslexia who use particular fonts
           | to make it easier to read for them. Or in my case, someone
           | who simply gets a headache from fronts and line-spacing that
           | is too big. I've also been using darkmode everywhere for so
           | long now that reading black text on a white surface on a
           | screen gives me a headache.
        
             | seanhunter wrote:
             | To write uppercase Alpha you need a modern version of latex
             | (ie xelatex or lualatex) and to include the unicode-math
             | package
             | 
             | https://tex.stackexchange.com/questions/485593/how-to-
             | write-...
        
             | IlliOnato wrote:
             | For scientific articles pagination is still important,
             | because it's how you refer to a particular part of a paper.
             | If things like font size and line spacing are at the
             | control of the consumer, pagination is not preserved.
             | 
             | This problem is harder than you one would think naively.
        
               | lsaferite wrote:
               | Seems like they should use detailed section numbering
               | like military documents and laws. Referring by page
               | number seems very course by comparison.
        
           | phlakaton wrote:
           | For the math equations, I'm curious: does MathML do any
           | better for you than LaTeX?
        
             | seanhunter wrote:
             | Not the person you're asking the question to, but it's
             | worth noting (if you don't already know) that MathML is
             | really not designed at all as an input language for
             | practitioners who just want to write a few equations in
             | some document. It's designed as an output/presentation
             | language so that devices that want to render some maths can
             | do so faithfully[1]. As such, if you're a human being who
             | wants to typeset some equation, you'll want to go to latex
             | every single time rather than mathml and then someone else
             | has to figure out the conversion.
             | 
             | [1] Great explanation here https://tex.stackexchange.com/qu
             | estions/57717/relationship-b...
        
               | IlliOnato wrote:
               | On the other hand, "semantic" flavor of MathML (as
               | opposed to "presentation") is much easier than TeX for
               | things like screen readers, both conceptually and in
               | practice.
        
           | kkylin wrote:
           | I teach math at a university. A couple years ago I had two
           | blind students in my section of first-year calculus, and I
           | really struggled with the tooling. Using latexml, I could
           | produce documents that _one_ of the students could use with a
           | screen reader, but the other student never managed to make it
           | work on their machine. Both students prefer braille but I
           | didn 't find anything open source that could typeset
           | mathematical braille easily. Our disability resource office
           | sends things out to a contractor to typeset into braille; the
           | turn-around is measured in _weeks_.
           | 
           | Anyway, if you (or anyone else reading this) has suggestions
           | I'd really appreciate it!
        
             | lostlogin wrote:
             | > Our disability resource office sends things out to a
             | contractor to typeset into braille; the turn-around is
             | measured in weeks.
             | 
             | This seems a massive gap in the market - many institutions
             | have funding earmarked for such things.
        
               | hedora wrote:
               | I wonder if this is a useful service that an llm could
               | actually outperform humans on.
        
             | Saigonautica wrote:
             | Interesting! I never thought about this, thank you for
             | sharing.
             | 
             | What kind of turn-around time would be practical? Could you
             | point me to any typeset mathematical braille that would be
             | an example of a solution to your problem? Is Nemeth the
             | only important standard, or are others important for you
             | too?
             | 
             | I'm wondering if it's practical to set this up as back-
             | office work here in Vietnam. There are some outlying
             | provinces here where there are very few job opportunities.
             | Job opportunities for the blind also round down to zero
             | here (e.g. I could hire for proofreading). Maybe there's
             | room to do something cool here.
        
               | miki123211 wrote:
               | How's English proficiency (and American braille code
               | proficiency) like in Vietnam?
               | 
               | Keep in mind that most blind people who speak English
               | fluently but don't live in an English-speaking country
               | (myself included) can't read English braille, or at least
               | not well. Because of how voluminous Braille is, it uses
               | contractions, single characters that replace common words
               | and character combinations like "the", "would", "ing" or
               | "ed". Those tend to be language specific, never taught
               | outside their country or countries of use, and hard to
               | get accessible electronic materials for. The math codes
               | are completely different too, we use something derived
               | from Marburg, while English-speaking countries use
               | Nemeth. Even basic characters like + and - differ between
               | those two, not to mention more complicated structures.
               | It's not just the dot patterns that are different but
               | also the design principles, like where you put spaces or
               | when you can omit "begin fraction" / "end fraction"
               | characters.
        
             | miki123211 wrote:
             | I learned (the basics of) LaTeX in my last year of middle
             | school, and stuck with it ever since. To be fair, I was
             | into computers since I was a child, played with Rockbox at
             | the age of 10, started to dabble in programming shortly
             | after, so this was a lot less scary than most of the things
             | I was doing already. I took my middle and high school
             | finals (they're kind of like SAT but matter a lot more) by
             | producing LaTeX output, which I then compiled to PDF and
             | printed. The test itself was in braille, as that was all
             | that our government could do.
             | 
             | Throughout college, my first question to most of my
             | professors of math subjects was "do you do LaTeX, and can
             | you give me your source code." Most said yes, and that's
             | how we worked. LaTeX in, LaTeX or PDF out, depending on
             | what the professor preferred.
             | 
             | The amount of LaTeX you need for calculus 1 isn't that
             | great, you could probably teach it to a relatively bright
             | student if you had an hour or two to spare, and then give
             | them the source files. If you have the time, I'd suggest
             | producing "stripped" versions of your files, with as little
             | markup as possible to get your point across and no fancy
             | formatting unless absolutely necessary. The amount of hoops
             | some books and papers jump through to "look nice" drives me
             | crazy.
             | 
             | You could also consider producing, teaching and consuming
             | ASCII math, which seems like an even simpler and friendlier
             | format. I couldn't really use it much in my school career
             | for boring technical reasons, but it looks like a promising
             | option.
        
         | wilg wrote:
         | For accessibility purposes (and regular reading), it would be
         | so much better to drop the justified text. Ragged edge is the
         | way to go!
         | 
         | https://www.boia.org/blog/why-justified-or-centered-text-is-...
        
           | jonatanheyman wrote:
           | Not necessarily:
           | 
           | https://heyman.info/2023/fill-justified-text-on-the-web
        
             | wilg wrote:
             | Perhaps someone can publish a paper to arXiv that provides
             | a meta-analysis. But still there doesn't seem to be a clear
             | reason _to_ justify it, given that almost all internet text
             | is not justified.
        
               | dginev wrote:
               | To me one of the exciting aspects of HTML is that we can
               | theme the same article in different ways, tailored to
               | individual preferences - just swap in a different CSS
               | file.
               | 
               | Having a two-column theme, or left-aligned vs justified
               | themes, could be workable in the long run. I hope that we
               | get to see some browser extensions modding the pages
               | before too long.
               | 
               | The reason for the current justified text is that it is
               | the default aesthetic for a LaTeX-based article, and a
               | lot of authors expect it.
        
       | odyssey7 wrote:
       | article {         text-justify: Knuth-Plass;       }
        
         | SushiHippie wrote:
         | Mind explaining?
        
           | odyssey7 wrote:
           | The comment is invalid CSS to apply the Knuth-Plass algorithm
           | in rendering an HTML article. Knuth being a perfectionist's
           | perfectionist, TeX uses this algorithm to determine optimal
           | line breaks to provide for better text justification.
           | 
           | Here's a discussion of hacks to achieve the algorithm's
           | results on web pages and an upcoming CSS feature as of 2020.
           | https://mpetroff.net/2020/05/pre-calculated-line-breaks-
           | for-...
        
         | computerfriend wrote:
         | If only.
        
       | matt1 wrote:
       | For anyone interested in staying informed about important new
       | AI/ML papers on arXiv, check out https://www.emergentmind.com, a
       | site I'm building that should help.
       | 
       | Emergent Mind works by checking social media for arXiv paper
       | mentions (HackerNews, Reddit, X, YouTube, and GitHub), then ranks
       | the papers based on how much social media activity there has been
       | and how long since the paper was published (similar to how HN and
       | Reddit work, except using social media activity, not upvotes, for
       | the ranking). Then, for each paper, it summarizes it using GPT-4,
       | links to the social media discussions, paper references, and
       | related papers.
       | 
       | It's a fairly new site and I haven't shared it much yet. Would
       | love any feedback or requests you all have for improving it.
        
         | raccoonDivider wrote:
         | That looks great. No real feedback yet, but it's the kind of
         | thing I've always been looking for as a better alternative to
         | Twitter.
        
           | matt1 wrote:
           | Thanks! I've got a lot more planned for it too. If anyone has
           | any feedback that doesn't make sense to share here, or if
           | you're a researcher who is open to some questions about how
           | you currently follow arXiv papers, drop me a note at
           | matt@emergentmind.com.
        
         | CodeCube wrote:
         | Love to see Energent Mind continuing to innovate!
        
         | sureglymop wrote:
         | Love the clean design of the website! Looks amazing on mobile.
        
           | matt1 wrote:
           | Thanks! If you ever run into any issues or have any
           | suggestions for improving the site, drop me a note:
           | matt@emergentmind.com.
        
         | jakderrida wrote:
         | This is exactly what I was using HN for. But, yeah, in kinda
         | sucked compared to yours. Another thing I was trying to create
         | was some sort of NN model that could use the semanticscholar
         | h-index of authors along with the abstract text and T5 to
         | estimate the one-year out citations. Just for personal use,
         | though. That whole thing fell apart because semanticscholar is
         | kinda crap for associating author links to the same author. I
         | frequently ended up with the wrong professors, which I'd think
         | would be easily fixable for them.
        
           | carlossouza wrote:
           | I did that (used other features). This is how new papers are
           | ranked here:
           | 
           | https://trendingpapers.com
        
             | matt1 wrote:
             | Great site, thanks for sharing. Can you explain how you're
             | determining how many times a paper is cited? Obviously
             | papers include a list of references, but extracting them
             | accurately from the PDF is difficult in my experience (two
             | column formats, ugh) - though the new HTML versions help.
             | And even if you have a list, many authors just mention
             | arXiv paper titles, not their ids, making identifying
             | specific references tricky.
        
               | carlossouza wrote:
               | Difficult, yes... but not impossible :)
               | 
               | I just extract the titles and look for their respective
               | ids.
               | 
               | The real challenge was how to do that at scale. Only in
               | CS there are well over half a million papers
        
           | matt1 wrote:
           | Just a note to say that factoring authors into the ranking
           | system is high on my todo list. v1 won't be too fancy - just
           | a hardcoded list of prominent authors whose papers warrant
           | extra visibility. A future version will likely automate it to
           | avoid the hardcoded list.
           | 
           | Also, soon-ish I'm going to add the ability for users to
           | follow specific authors, so you can get notified when they
           | publish new papers.
        
             | jakderrida wrote:
             | > Also, soon-ish I'm going to add the ability for users to
             | follow specific authors, so you can get notified when they
             | publish new papers.
             | 
             | If you could do it, this would be a dream. My original
             | intent was to be able to look through only papers citing a
             | popular one and filtering the results for ones having at
             | least one author with a set minimum h-index. Using Google
             | Scholar data required using SerpAPI, which has some
             | annoying limitations.
             | 
             | The core goal is obviously just not to miss out on a paper
             | that will very likely be influential while not having to
             | comb through the mountain of irrelevant papers.
             | 
             | What's funny is that Microsoft Academic was the best
             | suited, but was retired in 2021.
        
         | nojvek wrote:
         | Great site. Bookmarked it.
         | 
         | Would be nice if I could change timeframe. Top this week,
         | month, year, all time.
        
           | matt1 wrote:
           | I'm slowly adding older papers as I work out the kinks in the
           | site. Down the road when the database is more comprehensive,
           | this should definitely be possible.
        
         | team_dale wrote:
         | Would love to see a comments feature at the bottom there.
         | Reddit / HN style
         | 
         | Love the concept though. Added it to my Home Screen on iOS
        
           | matt1 wrote:
           | Thanks for the kind words, it's appreciated.
           | 
           | I might add comments down the road if there's enough interest
           | and if there's enough traffic to warrant it. Don't want to
           | add them just yet and have zero comments on everything and it
           | look like a ghost town.
           | 
           | Keep the suggestions coming though as you use it more:
           | matt@emergentmind.com.
        
         | danielbln wrote:
         | Works in Chrome, but does not seem to work in Firefox.
        
           | matt1 wrote:
           | Can you (or anyone experiencing similar issues) share any
           | details about what's not working in Firefox? I tested it and
           | all is well for me, though it's definitely possible there's
           | an issue with some other version of it.
        
         | keyle wrote:
         | I've got a somewhat related question:
         | 
         | is there a site that lists and rates the various LLM models of
         | hugginface.co alongside their various applications?
        
         | matt1 wrote:
         | FYI I started embedding the HTML pages in an iframe on Emergent
         | Mind when the HTML version is available:
         | https://www.emergentmind.com/papers/2312.11444 // should make
         | it even easier to stay informed about trending papers
        
       | apstats wrote:
       | I wonder if this could be used to train an LLM to convert PDFs
       | with rich charts into HTML?
        
       | reqo wrote:
       | A lot of AI/ML papers these days have an accompanying interactive
       | page like [0], will we see anything like these now directly in
       | arXive?
       | 
       | [0] https://voyager.minedojo.org/
        
         | z2h-a6n wrote:
         | I think then arXiv would have to deal with mantaining the tech
         | stack and providing the presumably much higher server capacity
         | to serve the more varied web pages that would result, so it
         | seems like a tall order. arXiv already has an experimental
         | integration with Papers with Code [0], which I guess provides
         | similar results for the reader, though the authors have to
         | figure out their own web hosting.
         | 
         | [0] https://info.arxiv.org/labs/showcase.html#arxiv-links-to-
         | cod...
        
       | ansk wrote:
       | When I open a large pdf on arxiv (100+ MB, not uncommon for ML
       | papers focused on hi-res image generation), there is a
       | significant load time (10+ seconds) before anything is rendered
       | at all other than a loading bar. Does anyone know what the source
       | of this delay is? Is it network-bound or is Chrome just really
       | slow to render large PDFs? Do PDFs have to be fully downloaded to
       | begin rendering? In any case, this delay is my only gripe with
       | arxiv and a progressively rendered HTML doc that instantly loads
       | the document text would be a huge improvement.
        
         | IlliOnato wrote:
         | It may be even that the time is taken to _generate_ a PDF.
         | 
         | The format in which articles are submitted and stored in arXive
         | is LaTeX. PDF is automatically generated from it.
         | 
         | Probably arXiv does some caching of PDFs so they don't have to
         | be generated anew every time they are requested, but I don't
         | know how this caching works.
        
         | upbeat_general wrote:
         | I have the same issue. From what I can tell it's just network-
         | bound and the Arxiv servers are slow. They theoretically allow
         | for you to setup a caching server but after spending a while
         | trying to get it setup, I haven't been able to get it to work.
         | 
         | https://info.arxiv.org/help/faq/cache.html
        
           | arccy wrote:
           | maybe it'll be faster now with fastly
           | 
           | https://news.ycombinator.com/item?id=38723373
        
         | 10000truths wrote:
         | > Does anyone know what the source of this delay is? Is it
         | network-bound or is Chrome just really slow to render large
         | PDFs? Do PDFs have to be fully downloaded to begin rendering?
         | In any case, this delay is my only gripe with arxiv and a
         | progressively rendered HTML doc that instantly loads the
         | document text would be a huge improvement.
         | 
         | The default PDF format puts the xref table at the end of the
         | file, forcing a full download before rendering can take place.
         | PDF-1.2 onwards supports linearized PDFs, and most PDF export
         | tools have some way of enabling it (usually an option like
         | "optimize for web").
        
       | ww520 wrote:
       | That's great. Now I can read the papers on my phone.
        
       | svag wrote:
       | The tool that it's being used for this offering is this one,
       | https://github.com/arXiv/arxiv-readability, just to save a few
       | clicks :)
        
         | IshKebab wrote:
         | Wow I did not know they have the LaTeX for all the papers and
         | compile it themselves! That's pretty crazy. What if they don't
         | have packages you need? What if your paper isn't written with
         | LaTeX?
        
           | r4indeer wrote:
           | > What if they don't have packages you need?
           | 
           | Unlikely. But if so, you can provide the packages yourself:
           | https://info.arxiv.org/help/submit_tex.html#wegotem
           | 
           | > What if your paper isn't written with LaTeX?
           | 
           | Then they still accept PDF or HTML. See:
           | https://info.arxiv.org/help/submit/index.html#formats-for-
           | te...
        
           | aragilar wrote:
           | They specify what version of texlive they use. This is
           | _significantly_ better than what publishers offer (usually a
           | _really_ old latex version, not even pdflatex).
        
         | ofou wrote:
         | I wonder how better is this compared to Pandoc's
        
         | dginev wrote:
         | That's it in spirit, but in practice it's refreshed:
         | 
         | https://github.com/arXiv/arxiv-view-as-html
        
       | WendyTheWillow wrote:
       | I'm so far left wanting for an app that gives me a way to easily
       | track and consume newly published work of a given topic. The
       | existing apps are not great, and maybe this change will make it
       | easier to provide better "reader" views, and possibly even tts (I
       | like to listen+read).
        
       | codethief wrote:
       | Ugh. I don't belong to the target audience (people with
       | disabilities) but the typesetting doesn't exactly look pleasant
       | on my machine (Chrome on Linux).
        
       | aragonite wrote:
       | A lot of academic journals (say from Springer) also offer HTML
       | formats for papers published in the past decade or so, which I
       | personally often find more convenient for reading purposes than
       | PDFs. For example, I parse text a lot faster if I use a regex to
       | split each paragraph into sentences and place a linebreak after
       | each sentence, or if I do natural language "syntax highlighting"
       | by assigning a distinctive color to functional words indicating
       | logical structure like 'if/then', 'and', 'or', 'not', 'because',
       | and 'is'. And sometimes it really improves readability to be able
       | to do "semantic highlighting", in the sense of say assigning a
       | different hashed color to each proper name (or each labeled
       | thesis, etc) that occurs in the paper. Such manipulations are
       | basically impossible with PDFs. It makes me wish sci-hub would
       | start archiving HTML versions in addition to PDFs!
        
       | johnsillings wrote:
       | https://www.arxiv-vanity.com/
        
         | jakderrida wrote:
         | And, of course, https://ar5iv.labs.arxiv.org/html
         | 
         | However, ar5iv isn't a la carte like arxiv-vanity. They pretty
         | much do last month's papers every month or so. Something like
         | that.
        
           | dginev wrote:
           | Hi, ar5iv creator here.
           | 
           | You can think of both arxiv-vanity and ar5iv as the "alpha"
           | experiments that lead into the official arXiv "beta" HTML
           | announced today.
           | 
           | Once a few rounds of feedback and improvements are
           | integrated, and the full collection of articles acquires HTML
           | in the main arXiv site, ar5iv will be decommissioned.
           | 
           | The plan is to turn all existing ar5iv links into redirects
           | to the official HTML, and free up the resources for
           | maintaining it. I am not sure what are the plans for
           | maintaining arxiv-vanity, but I suspect they may head down a
           | similar path some time later.
        
             | jakderrida wrote:
             | lmao! The actual creator of ar5iv? Sometimes I forget this
             | isn't reddit and legit accomplished people comment here.
             | 
             | Reminds of Burning Man when people kept telling me, "Never
             | talk trash on the art at the main landmarks. The artists
             | are frequently within listening distance."
             | 
             | So, of course, I'd walk around talking about buying the art
             | for $50K-$60k, knowing it's already scheduled to be burned
             | with the landmark.
        
       | philipashlock wrote:
       | 30 years after HTML was invented to support accessibility and
       | collaboration for research and academia and the same day the
       | White House released their new accessibility guidance which
       | happens to be the first time they've published formal new policy
       | natively has HTML rather than PDF -
       | https://www.whitehouse.gov/omb/management/ofcio/m-24-08-stre...
        
         | murphyslab wrote:
         | I feel surprised by how succinct, easy-to-understand, and
         | sensible the policy (M-23-22) is:
         | 
         | > Default to HTML: HyperText Markup Language (HTML) is the
         | standard for publishing documents designed to be displayed in a
         | web browser. HTML provides numerous advantages (e.g., easier to
         | make accessible, friendlier to assistive technology, more
         | dynamic and responsive, easier to maintain). When developing
         | information for the web, agencies should default to creating
         | and publishing content in an HTML format in lieu of publishing
         | content in other electronic document formats that are designed
         | for printing or preserving and protecting the content and
         | layout of the document (e.g., PDF and DOCX formats). An agency
         | should develop online content in a non-HTML format only if
         | necessitated by a specific user need.
         | 
         | https://www.whitehouse.gov/omb/management/ofcio/delivering-a...
        
           | wolverine876 wrote:
           | Hmmm ... accessibility is essential, but PDF is far better
           | for static documents: There's no straightfoward, standard way
           | to read an html document on another platform. Also, the html
           | document may not be readable in 10+ years (unlike most PDFs),
           | and updates are too fluid and hard to track.
           | 
           | I think the general problem is that the end-user doesn't
           | control an html document, e.g., for annotation, as a local
           | record, etc.
        
             | shakow wrote:
             | > There's no straightfoward, standard way to read an html
             | document on another platform.
             | 
             | What do you think of the epub format?
        
               | wolverine876 wrote:
               | I wish so much for it:
               | 
               | Despite all our advances, we lack an editable, local,
               | multimedia, platform (and form-factor) independent, self-
               | contained file - essentially a word-processing file for
               | the 21st century (and I mean it's almost a quarter-
               | century overdue). epub has that potential as a format,
               | and being based on web standards it has capability, a
               | universe of supporting tools and technology, and easy
               | adoption to different applications.
               | 
               | But I haven't heard anyone else express that particular
               | interest, and as of a few years ago epub doesn't allow
               | annotations and is not stable (i.e., I don't know that
               | today's epub file will be readable in 20 or 50 years) -
               | two essential requirements for a serious local content,
               | imho.
               | 
               | And even if it meets those specifications, we need epub
               | editors that are the equivalent of word processsors for
               | non-technical users.
        
             | LordDragonfang wrote:
             | ...What are you talking about? HTML files are readable on
             | basically _every_ platform, even moreso because they are
             | fundamentally text files (unlike PDFs, which are binaries).
             | PDFs need special software, html can be read on the
             | _command line_. Likewise, HTML is dead simple to edit and
             | annotate.
             | 
             | Seriously, name a single device that has PDF support that
             | doesn't allow you to view HTML.
             | 
             | I think you're conflating "html" and "things stored on a
             | server", because all of your objections apply to pdfs
             | stored on a server. The ability to save and annotate pdfs
             | is not an inherent feature of the file format, they exist
             | _because_ the format is such a PITA to interact with that
             | specialized programs have to be written. HTML can be saved
             | just as easily, and _usually is_ (on archive.org).
        
               | wolverine876 wrote:
               | How do I save an HTML document locally, and annotate it,
               | in an easily sharable form, and in a form that is stable
               | - i.e., in a way that will be readable and useable in
               | 20-50 years?
        
               | GoblinSlayer wrote:
               | You say it as if pdf is somehow better. To begin with
               | it's a proprietary format. If Adobe goes bankrupt or
               | obscure tomorrow, pdf will go out of use as a failed
               | technology.
        
               | LordDragonfang wrote:
               | Basically any _HTML_ document from 20-30 years ago (can
               | 't go any further because it didn't exist 50 years ago)
               | will be completely readable and usable. The only issue is
               | people creating _content_ (not styling) in formats
               | besides HTML.
               | 
               | As far as annotations, you can use the native <ruby>[1]
               | tag, or strikethough, but if you mean "literally drawing
               | on the text" then, yeah, you're looking for an image
               | format at that point (which is fundamentally what PDF
               | _is_ ), but _we shouldn 't default to storing text in
               | image formats_ just because of one specific use case.
               | (Also, as I said above, the only reason tools exist to
               | easily do that in PDFs exist is because everyone insists
               | on using a format that's hard to edit. )
               | 
               | Also, note that the context I was responding to was _US
               | legal documents_ , not something more presentation-heavy.
               | 
               | [1]https://twitter.com/antumbral/status/17308297560133758
               | 75
        
               | jpeloquin wrote:
               | I just tested saving
               | https://browse.arxiv.org/html/2312.12451v1 to disk using
               | Chrome, transferring it to my Android phone, and opening
               | it on the phone. Results:
               | 
               | 1. Saving as "Webpage, Single File" (.mhtml): Neither
               | Firefox nor Chrome even showed up in the list of
               | available apps to open it.
               | 
               | 2. Saving as "Webpage, Complete": Opened in Chrome but
               | images were broken. Also very difficult to open with the
               | default file browser because it uses a flat folder view
               | and the sidecar folder pollutes the file list.
               | 
               | I was hoping this would work, perhaps you will have
               | different findings. I agree that HTML is the superior
               | format in theory but usability in practice is often
               | lacking. I'm resigned to using both depending on context.
        
               | wolverine876 wrote:
               | Yes, that's the kind of issue I was talking about. I wish
               | it were otherwise. As a nearby comment pointed out, epub
               | is a potential solution (and I wish arXiv embraced it -
               | without my knowing their other requirements or epub's
               | accessibility features). It's essentially packaged html.
        
               | znpy wrote:
               | Of course, they're "just text files" only in theory...
               | but theory and practice diverge very very often.
        
             | nonethewiser wrote:
             | > There's no straightfoward, standard way to read an html
             | document on another platform.
             | 
             | Such as? What doesn't have a browser but can render pdfs?
        
               | wolverine876 wrote:
               | I mean, how do I save it locally on one platform and read
               | it on any platform? Or share it with someone else to read
               | (without them downloading software)? I.e., we don't have
               | a standard, local, single-file html format.
        
               | thfuran wrote:
               | Print it to a pdf
        
               | Zuiii wrote:
               | You're right.
               | 
               | We could have such a format if browser and os vendors
               | were interested in supporting such a use case.
               | Unfortunately, they aren't.
               | 
               | On the browser side, supporting all-in-one html files can
               | be as simple a reading a single multipart-encoded page.
               | Heck, if they support automatically serializing all
               | external resources as datauris when saving pages, then
               | most browsers will be able to open them without any
               | modification.
               | 
               | On the OS side, operating systems can treat html files as
               | first class citizens; execute them in an offline sandbox
               | (most operating systems have embedded webviews), then
               | extract icon, title, description and other metadata to
               | present to the user. An icon the consists of a blank page
               | with a small browser icon in the corner doesn't tell me
               | anything about what the page is about. This needs to
               | change.
               | 
               | In short, html can be easily made nicer to deal with
               | locally thanks to all the parts already being in place.
               | The problem is that no one (tech giants, os vendors) are
               | interested in doing this.
        
               | GoblinSlayer wrote:
               | There's epub as one file html document.
        
               | CaptainOfCoit wrote:
               | > I mean, how do I save it locally on one platform and
               | read it on any platform?
               | 
               | Ctrl/Meta/Cmd + S should do the trick, or "File > Save
               | page", and you get a HTML file you can open in any
               | browser. If there is images, they'll most likely be
               | loaded remotely, or worst case not load at all. But the
               | rest of the structure is there.
        
               | blackoil wrote:
               | > If there is images, they'll most likely be loaded
               | remotely
               | 
               | Most sites have images as a relative path which won't
               | work with saved html and there is also CSS.
        
               | wolverine876 wrote:
               | A web page is much more than one file. Also, I'm looking
               | for something with end-user control, where they can save
               | the current document statically and long-term.
        
               | 8organicbits wrote:
               | If both devices have internet, you share the URL. If not,
               | see other replies.
        
       | jll29 wrote:
       | It's a cool feature because it makes the papers more finable,
       | more easily navigatable, easier to read online and faster to
       | scroll through. I am also happy for blind people that they can
       | more easily use ArXive with Braille readers now.
       | 
       | (I'm still a fan of printing the PDFs, because I annotate on
       | paper and refer to page numbers, but the HTML feature is in
       | addition to PDF download, not a replacement.)
       | 
       | One thing that still sucks (not ArXiv related though) is reading
       | mathematical formulae on the Kindle - wonder if someone with
       | rendering expertise could have a look into the MOBI format.
        
         | isaacfung wrote:
         | This would never happen but in an ideal world, we should be
         | able to click on a citation to jump to the part of the paper
         | that is being referenced and each paper page should have a
         | discussion board so we can easily communicate with the authors
         | and group the discussion in one place instead of us having to
         | google to see if there is relevant discussion on
         | twitter/reddit. We can even put links to talks, tutorials,
         | blogs, github repo, demo, paperswithcode/google scholar/open
         | review, background material, a timeline of citations in tree
         | form on the same page(actually I am seeing more machine
         | learning papers that have a project page that does some of
         | these) or even turn it into a mini wiki. I just think html has
         | so much more potential(especially now with LLM we can do
         | semantic search). I wonder if there would be interest in such a
         | chrom extension overlay.
         | 
         | Related projects:
         | 
         | https://github.com/ahrm/sioyek
         | 
         | https://github.com/arxiv-vanity/engrafo
         | 
         | https://github.com/dginev/ar5iv
         | 
         | https://academ.us/article/2111.15588/ (powered by
         | https://github.com/jgm/pandoc I believe)
        
           | me_jumper wrote:
           | I think https://web.hypothes.is/ would be of interest to you.
        
       | golol wrote:
       | IMO pdf and HTML optimize for different things. pdf is easy and
       | pretty. HTML is easy and responsive. But making pdf responsive is
       | impossible and making HTML pretty is not easy. I think having
       | arxiv for well-polished pretty documents, not responsive ugly
       | documents. Most researchers don't have time to make an HTML
       | responsive and pretty.
        
         | querez wrote:
         | Am researcher, care about responsiveness way more than pretty.
         | I am super glad for the option. Downloading PDFs is super
         | annoying. I'm stoked.
        
           | mmis1000 wrote:
           | Well... download html is even harder nowadays, because many
           | pages are dynamically generated. Although there are surely
           | some browser extensions that can help you to finish it in a
           | few clicks..
        
       | radicalriddler wrote:
       | FUCK YES (excuse my profanity). I have a tool that converts HTML
       | to Neural Speech and I always wanted to push arXiv papers through
       | it, but couldn't be bothered with a PDF implementation.
        
       | topicseed wrote:
       | What do they use to convert a PDF document to a clean, correct
       | HTML document? It's a difficult space, especially with the
       | variety of layouts you may find in PDF documents...
        
         | blackbear_ wrote:
         | Arxiv encourages users to submit the latex source of their
         | papers rather than the PDF
        
         | SushiHippie wrote:
         | > The tool that it's being used for this offering is this one,
         | https://github.com/arXiv/arxiv-readability, just to save a few
         | clicks :)
         | 
         | https://news.ycombinator.com/item?id=38726582
        
       | vegabook wrote:
       | PDF is objectively much better than HTML at rendering text
       | documents. And it's not even close. This could easily have been
       | done 10, even 15-20 years ago. That it didn't is not just
       | inertia. Latex and PDF have enormously better text rendering, and
       | the static format locks a state-commit in time that is much
       | easier to go back to and reference/critique. Unlike the
       | intrinsically fluid nature of HTML. For academic work, milestone-
       | like formats, that lock state in time, are useful for those who
       | later build on them. And again, the rendering just doesn't
       | compare and that imparts [sub]conscious quality signals.
        
       | imranq wrote:
       | At this point are academic papers simply peer-reviewed blog
       | posts?
        
       | acjohnson55 wrote:
       | This is great! I browse papers on mobile, and PDF is so bad for
       | that use case.
        
       | alecsm wrote:
       | I don't read many papers but this makes it easier for me to save
       | them in Joplin.
        
       | wolverine876 wrote:
       | Many here say they prefer html documents. How do you annotate
       | them? How do you make local copies? Also, how will you read them
       | in the decades to come?
       | 
       | I love PDF.
        
       | hollerith wrote:
       | I'm sad that the best they can do is HTML format. HTML is a mess.
        
       | nojvek wrote:
       | OMG. This is amazing. I legit hated reading two column pdfs on a
       | smartphone.
        
       | wildpeaks wrote:
       | Very good decision, always bet on the web.
        
       | sicariusnoctis wrote:
       | Personally, I would prefer the conventional Latin Modern math
       | font instead of Palatino math.
       | 
       | Latin Modern is used by:
       | 
       | - Wikipedia. - Math.StackExchange. - Nearly all papers, including
       | the ones hosted on arxiv in PDF format. - Nearly any math videos,
       | slides/presentations, notes. - Almost everything, really.
       | 
       | Palatino just looks weird.
       | 
       | Also, I imagine that authors might do math formatting hacks that
       | were only tested on Latin Modern, and might end up breaking on
       | Palatino.
       | 
       | TL;DR:
       | 
       | Palatino :(
       | 
       | Latin Modern :)
        
       | IHLayman wrote:
       | Fun fact: if seems that if you use Lockdown mode on Apple devices
       | you can't open PDFs from a browser (no official documentation
       | says it but there is anecdotal evidence). This would allow people
       | with Lockdown mode to open Arxiv papers more easily.
        
       | matrix2596 wrote:
       | thats great news. I was using arxiv vanity to read on mobile
       | phones. I am not seeing it on all articles, is it only for new
       | papers?
        
       | therealmarv wrote:
       | This is the reason I've never liked LaTeX from a data point view.
       | It's made to be printed out or get to look beautiful on a PDF but
       | was never designed to get you to a HTML file or a Word file.
       | 
       | I've written my thesis in Markdown in the past because of this
       | (best for humans) which can be easily transformed to HTML, Word,
       | PDF and even LaTeX
       | https://github.com/tompollard/phd_thesis_markdown
       | 
       | And I think that XML is the best format for machines.
        
       | delhanty wrote:
       | > If you are familiar with ar5iv, an arXivLabs collaboration, our
       | HTML offering is essentially bringing this impactful project
       | fully "in-house". Our ultimate goal is to backfill arXiv's entire
       | corpus so that every paper will have an HTML version, but for now
       | this feature is reserved for new papers.
       | 
       | IIRC, ar5iv was created on his own initiative by Deynan Ginev
       | 
       | https://twitter.com/dginev/status/1736792316675825981
       | 
       | and it seems that he has worked tirelessly to fix nearly all of
       | the edge cases during the collaboration.
       | 
       | This project creates huge value to humanity so Deynan is to be
       | heartily thanked.
        
         | dginev wrote:
         | Thanks for the kind words, but some corrections:
         | 
         | 1. My name is Deyan (hi!)
         | 
         | 2. ar5iv was the latest frontend incarnation, but our actual
         | work on converting LaTeX to HTML goes back nearly 20 years
         | behind the scenes.
         | 
         | 3. I was an undergraduate student when I was introduced to the
         | project back in 2007. It was started "in spirit" by 3 senior
         | co-conspirators back then: Michael Kohlhase, Bruce Miller and
         | Robert Miner. And I am by no means a solitary actor today, even
         | if I may be the chief online presence of the people involved.
         | Bruce is doing the bulk of the hard work on LaTeXML to this
         | day.
         | 
         | I documented some of the history in an invited talk for CICM
         | 2022, which you can find on youtube, or see the slides at:
         | 
         | https://prodg.org/talks/welcome_to_ar5iv
         | 
         | It's really great that the HTML has now reached "home base" in
         | arXiv, and I hope their team gets a lot more of the positive
         | attention going forward - today's achievement is entirely
         | theirs!
        
           | indrora wrote:
           | I remember stumbling upon your work long ago when I was
           | working on a project to have "e-zines" that consumed a series
           | of `article` class files and rendered them out into PDF and
           | HTML as a series package.
           | 
           | I had come across latex2html, Dan Gildea's project, and found
           | myself unpleasantly dissatisfied with how it worked. As I
           | understand it, it's more a "half implementation of lots of
           | packages" rather than what ar5iv seems to be, which is
           | "enough of the core LaTeX engine producing HTML instead of
           | DVI"? I'd love to know more about the nitty gritty of how the
           | engine does its thing.
           | 
           | I'm curious: How has modern web tech (e.g. WebAssembly,
           | Canvas, etc) helped or gotten in the way of getting _good_
           | LaTeX rendering in the browser?
        
             | dginev wrote:
             | Right, that's LaTeXML - it tries to emulate as much as
             | possible of the TeX typesetting system, while retaining
             | enough control to emit structured markup.
             | 
             | Which also allows us (and generally all contributors of
             | latexml package support) to conveniently maintain various
             | parallel data structures and metadata needed along the way.
             | 
             | Modern HTML is very often helpful to produce higher quality
             | article renderings. Examples:
             | 
             | 1. we recently started using flexbox for subfigures,
             | allowing them to reflow.
             | 
             | 2. we have started emitting ARIA accessibility annotations
             | (there is now an "alt" key for \includegraphics)
             | 
             | 3. MathML Core allowed us to have native web rendering for
             | math expressions in every browser.
             | 
             | As to LaTeX rendering _in_ the browser, there are various
             | other projects out there you could look up with partial
             | support. For latexml the WebAssembly route seems most
             | realistic, as we are undergoing a rewrite in Rust. But
             | there are quite a number of pieces to flesh out before we
             | get there.
        
       | trostaft wrote:
       | Taking a look at a paper I have that went up this month and
       | another that went up before the dec cutoff on ar5iv, they look
       | 90% OK! Figures with side-by-side plots and algorithm
       | environments are the common culprit for being broken though.
       | Particularly in figures, it seems like the width argument isn't
       | being interpreted correctly.
       | 
       | Interestingly this review paper seems to have their side by side
       | figures intact (e.g. fig 2 fig 4). Maybe it's because he used a
       | subfigure like environment (judging by the subcaptions)?
       | 
       | https://ar5iv.labs.arxiv.org/html/1609.04747
        
         | dginev wrote:
         | For the image widths, there is some CSS fine-tuning that is
         | still needed on the arXiv HTML side. I think that will get
         | fixed soon, just needs the right height directive set.
         | 
         | Getting subfigures emulated via flexbox is one of our more
         | recent LaTeXML enhancements, and still has some ongoing work
         | (working on it today actually). It can be a bit finicky to test
         | - there are easily 20 different ways people can write LaTeX for
         | subfigures in arXiv.
        
       | blackoil wrote:
       | > Didn't see a toggle
       | 
       | you can run toggleColorScheme() twice in console to switch to
       | light theme or dark theme.
        
       | charleshan wrote:
       | This is awesome! Push to Kindle (HTML to EPUB) isn't converting
       | the page properly but I'm sure it's coming soon
        
       | zerop wrote:
       | They should also add commenting capabilities under the paper.. a
       | good discussion will lead to more research and information
       | discovery
        
       | krick wrote:
       | Curious to see how well it will work. Does anybody here know a
       | robust and not crazy computationally expensive solution to
       | extract tables from fairly clean PDF files (especially non-
       | english)?
        
       | llamaInSouth wrote:
       | Nice.... a website that offers even more web pages.
        
       | happyyalda wrote:
       | Unfortunately, I am from Iran so I can't use this new feature. I
       | got '403 Forbidden' message from the arXiv server. Worse than
       | that, I totally lost my access to arXiv since they changed their
       | CDN to fastly, because fucking mullahs don't like fastly!
        
       | forgingahead wrote:
       | What I would like is for ArXiv to have an LLM to rewrite all
       | papers away from the stodgy, stilted language prevalent in every
       | paper. Just write clearly gang, use proper paragraph breaks and
       | stop with the run-on sentences.
        
       | creatonez wrote:
       | I am glad to see a sans font being used, rather than trying to
       | replicate the serif font from the original papers. It's a bit
       | narrow and fuzzy on low resolutions, but a massive improvement
       | just by switching to sans.
        
         | dang wrote:
         | We detached this comment from
         | https://news.ycombinator.com/item?id=38724925.
        
       | SallyThinks wrote:
       | Saw it last night ! I was sooo happy ! Reading papers on phone is
       | a nightmare. Well done guys !
        
       | quickthrower2 wrote:
       | Reading papers on mobile now considered sane!
        
       | astrolx wrote:
       | This is excellent news. Their HTML formatting is also more
       | pleasant than the HTML articles offered by most journals in my
       | field (e.g arXiv HTML footnotes displayed as sidenotes on large
       | displays!)
        
       | amai wrote:
       | This will be on of the most popular applications written in Perl,
       | because this is based on 20 year old
       | https://en.wikipedia.org/wiki/LaTeXML.
        
       | jcq3 wrote:
       | It will ease data scraping, automated meta analysis...
        
       | alexmolas wrote:
       | This makes downloading and parsing paper data easily, which is
       | pretty handy in the LLM era.
        
       | HeavyStorm wrote:
       | Thank God. Maybe we can now adapt those for mobile?
        
       | injuly wrote:
       | For anyone who needs it, arxiv-vanity is amazing:
       | https://www.arxiv-vanity.com/
        
         | westurner wrote:
         | arxiv-sanity- _lite_ : https://github.com/karpathy/arxiv-
         | sanity-lite
        
       | killjoywashere wrote:
       | So, I'm seeing a lot of chatter in the thread about LaTeX and
       | converting that to HTML and PDF, so LaTeX should be the superior
       | single source of truth. Please keep in mind that many areas of
       | science think of latex as an allergy. I even have a colleague, a
       | plasma physicist, who strongly encourages his team to not use
       | LaTeX because a) collaborators get confused and b) it can be a
       | massive time suck.
        
         | clircle wrote:
         | I agree with your colleague.
         | 
         | At my institution, all of the lowest quality drafts I read are
         | made with latex. I think it's because the programs people use
         | to write latex do not have spelling and grammar checking. Also,
         | the people that prefer latex, are the same types of people that
         | are more interested in technical things, than spelling and
         | grammar.
        
       | 101008 wrote:
       | Is there an open source tool to convert any PDF to something like
       | this?
        
         | mcpherrinm wrote:
         | It sounds like (from the shout-out in the post) they're using
         | https://math.nist.gov/~BMiller/LaTeXML/ to convert the paper's
         | LaTeX into HTML, not from PDF.
         | 
         | The most versatile tool I know of for converting various
         | document formats, including PDF to HTML, is the oss ebook tool
         | Calibre: https://manual.calibre-ebook.com/conversion.html
         | 
         | I have seen https://pdfbox.apache.org/ used for extracting text
         | from PDFs for analysis, but you won't get HTML output.
        
       ___________________________________________________________________
       (page generated 2023-12-22 23:01 UTC)