[HN Gopher] ArXiv now offers papers in HTML format
___________________________________________________________________
ArXiv now offers papers in HTML format
Author : programd
Score : 1089 points
Date : 2023-12-21 18:34 UTC (1 days ago)
(HTM) web link (blog.arxiv.org)
(TXT) w3m dump (blog.arxiv.org)
| shrimpx wrote:
| Since the article doesn't link to any example HTML article,
| here's a random link:
|
| https://browse.arxiv.org/html/2312.12451v1
|
| It's cool that it has a dark mode. Didn't see a toggle but
| renders in the system mode.
|
| Overall will make arXiv a lot more accessible on mobile.
| burkaman wrote:
| And here's the PDF of the same paper for comparison:
| https://arxiv.org/pdf/2312.12451.pdf
| FredPret wrote:
| The contrast is massive. I'm much more likely to read the
| html version; that PDF is deeply off-putting in some hard to
| define way. Maybe it's the two columns, or the font, or the
| fact that the format doesn't adjust to fit different screen
| sizes.
| ForkMeOnTinder wrote:
| Definitely the two columns for me. It's super annoying
| skimming a paper and having to scroll down and back up
| again in a zig-zag pattern.
| mmis1000 wrote:
| I think the consuming device matters. A ipad or computer
| have much wider screen width. One column layout is too
| wide for them for average people to scan text lines
| quickly.
|
| While it looks perfectly fine on a phone. Two columns
| layout looks terrible on a smartphone, the text is too
| tiny to read comfortably.
|
| It would probably be even better if you can flip it left
| and right like a ebook instead of scrolling to allocate
| the content faster. But current design is good enough
| IMO. (Compare to reading a pdf on cellphone)
| kjkjadksj wrote:
| Just zoom the smartphone into one column. Problem solved.
| mmis1000 wrote:
| And then you will have to scroll both top bottom and left
| right, a even worst experience.
| GoblinSlayer wrote:
| It's about "One column layout is too wide" - if you zoom,
| it's not too wide anymore, also smartphones have narrow
| screen, not wide, and tablets can do that too afaik.
| GoblinSlayer wrote:
| To display two column layout you need a tall screen, now
| wide. If you display two column layout on a short wide
| screen, you have to scroll it up and down in zigzag
| pattern to read one page.
| tobias2014 wrote:
| This is very interesting, because for me it's just the
| opposite. In particular the two column layout is just more
| readable and approachable for me. The PDF version also
| allows for a presentation just as the authors intended. I
| guess it's good that they offer both now.
| kjkjadksj wrote:
| The authors don't format the pdf, the editor does.
| Authors probably sent a double spaced word document with
| figures and tables on another file.
| tonyg wrote:
| In computer science, the usual case is that the author
| fully formats the paper.
| z2h-a6n wrote:
| Not on arXiv (unless I'm much mistaken), which is a
| preprint server, not a conventional journal.
|
| arXiv accepts various flavors of TeX, or PDFs not
| produced by TeX [0], and automatically produces PDFs and
| HTML where possible (e.g. if TeX is submitted). In the
| case of the example paper under discussion, the authors
| submitted TeX with PDF figures [1], and the PDF version
| of the paper was produced by arXiv. The formatting was
| mainly set by using REVTeX, which is a set of macros for
| LaTeX intended for American Physical Society journals.
|
| [0]
| https://info.arxiv.org/help/submit/index.html#formats-
| for-te... [1] https://arxiv.org/format/2312.12451
| smartmic wrote:
| FWIW, I recently learned that it is also possible to
| produce nice PDF papers with GNU roff (groff), have a
| look at this example: https://github.com/SudarsonNantha/L
| inuxConfigs/blob/master/....
| pimlottc wrote:
| Looks nice but seems strange to switch from two columns
| to one column after the first page? Although maybe
| they're just trying to demonstrate its capabilities.
| macintux wrote:
| W. Richard Stevens (RIP, still hurts) famously used troff
| for his books.
| frocmlol wrote:
| You are very confidently wrong.
|
| In the arxiv you use latex and do everything yourself.
| There is no editor.
| cozzyd wrote:
| You typically send a .tar.gz of tex files (and, figures,
| .bbl, etc.) to the journal. And then you typically upload
| something very similar to the arxiv (I have an arxivify
| Makefile target for for my papers that handles some arxiv
| idiosyncrasies like requiring all figures to be in the
| same folder as the .tex file, and it also clears all the
| comments; sometimes you can find amusing things in source
| file comments for some papers).
|
| Some fields may use Word files, but in most of physics
| you would get laughed at...
|
| It is true that most journals will typically reformat
| your .tex in a different way than is displayed on the
| arXiv.
| eigenket wrote:
| You are completely wrong. ArXiv doesn't work like that.
| aragilar wrote:
| Not only is this wrong about physics/astronomy, I
| regularly use the arxiv version because the typography is
| better (e.g. in the published paper an equation is split
| with part of the equation being at the bottom of one
| column, and the top of the next, whereas the equation is
| on one line in the arxiv version).
| JumpCrisscross wrote:
| Do you work extensively with LaTeX?
|
| Two columns is good, albeit annoying on mobile. But the
| font. The typeface kills me, and almost every LaTeX-
| generated document sports it.
| saurik wrote:
| Hilariously, I would probably tolerate the HTML version a
| lot better if it had the font from the PDF (and FWIW, the
| answer for me is "no: I don't work with LaTeX at all... I
| just read a lot of papers").
| folmar wrote:
| If you disable the font rule :root,
| [data-theme=light] { /* --text-font-family:
| "freight-sans-pro"; }
|
| it switches to "Noto Serif" that is way easier on the
| eyes.
| GoblinSlayer wrote:
| I hard override the font in browser, designers never get
| it right.
| borg16 wrote:
| what is your font of choice?
| GoblinSlayer wrote:
| Verdana
| westurner wrote:
| https://github.com/neilpanchal/spinzero-jupyter-theme
| /fonts/{cmu-text,cmu-mono} :
|
| > _" Computer Modern" is used for body text to give it a
| professional/academic look_
| cozzyd wrote:
| Hating on Computer Modern (ok, probably now Latin Modern)
| is something close to blasphemy.
| hollerith wrote:
| I hate Computer Modern, and I'm not even particularly
| fussy about typefaces.
| kibwen wrote:
| Computer Modern was not designed for easy viewing on
| screens (think about the screens Knuth would have been
| using in 1977), it was designed for printing in books.
| isaacfung wrote:
| What device and app are you using to read the document?
| kjkjadksj wrote:
| If you read a lot of papers in your line of work you will
| quickly appreciate the two columns and justification.
| FredPret wrote:
| Admittedly, I don't read research papers. But with HTML,
| surely the choice between one or two columns is a
| checkbox away.
| IlliOnato wrote:
| Which checkbox?
|
| I cannot find anything relevant in any of the 3 browsers
| I use (Vivialdi, Firefox, Chrome). Would really
| appreciate this option.
|
| A quick search gave some apparently unmaintained browser
| extensions, and it's it.
| FredPret wrote:
| No, I'm saying there _should_ be a checkbox. That way,
| you can switch between two columns formatted like LaTeX
| and that font they always use, and one column with
| Helvetica / Arial.
| IlliOnato wrote:
| It would be nice, but I am not holding my breath.
| jabroni_salad wrote:
| Only problem is jagoffs like me who need the text to be
| bigger. On PDFs you now get to experience a horizontal
| scrollbar. HTML has text reflow and I can set the line
| length by resizing the window. I'm willing to make a lot
| of sacrifices for that experience.
| z2h-a6n wrote:
| For what it's worth, two column layouts are very common in
| the physical sciences, or at least in physics which I'm
| more familliar with. I have a feeling that the reason is at
| least partly to save page space when using displayed math
| (e.g. equations that are formatted in a break between
| blocks of text), which use the full text width (i.e. the
| width of one column) to display what may be much less than
| half a page wide.
| FredPret wrote:
| It makes sense - for paper. But pixels are infinite -
| HTML is far better for screen display, which is how
| people read things nowadays.
|
| The extra column next to the one I'm reading introduces a
| lot of visual noise, and the content is hard enough as it
| is. I'm sure physicists have all gotten used to it, but
| it certainly trips me up.
| nyssos wrote:
| > The extra column next to the one I'm reading introduces
| a lot of visual noise
|
| Papers are generally not read start to finish in one go:
| there's lots of rereading and jumping back and forth
| between key parts, and anything that moves them further
| apart makes this harder.
| FredPret wrote:
| Ah, that makes more sense. I imagined scientists just
| reading the whole thing start-to-finish.
|
| I still think a flexible layout is best. If you like
| multi-columns and have a wide screen, why not display 12
| columns next to each other?
|
| With PDF this is not possible. With HTML the content can
| in principle be sliced and diced how you like it.
| fuck_google wrote:
| One can also view PDF pages side by side, which works
| pretty well with a 4K monitor.
| arp242 wrote:
| I need to scroll up and down a lot more with two-column
| layout because a single page doesn't fit on my screen in
| my chosen font size (which is fairly large).
|
| But HTML is so much more flexible, and ideally people can
| choose how they want it, although at this point it seems
| that's not (yet) implemented.
|
| I find jumping back and forth is always a pain on
| computer screens and ebooks by the way, and is the major
| reason I much prefer print for this type of thing.
| aragilar wrote:
| Two column is the default in astronomy also.
| mastazi wrote:
| I wonder if perhaps it's a generational thing, I prefer the
| PDF because it reminds me of printed paper, which is what I
| used growing up.
|
| (For reference: I am at the end of Gen X, people 3-4 years
| younger than me are considered Millennials).
| Blikkentrekker wrote:
| Quite so. The font annoys me. This is one of the reasons I
| hate PDF and why I believe these things should be
| controlled by the person reading it, not the publisher.
|
| I do not much care what font the auctor finds pleasant to
| read, but what I find pleasant to read, and this font isn't
| it, and neither are the colors.
| wruza wrote:
| Seconded. I can (will) actually just read referenced papers
| now instead of hesitating to either get a headache or stay
| uninformed.
|
| Defaults and UX rule the world. It's unfortunate that $subj
| wasn't a thing for so long and probably scared millions of
| curious minds from material. It is _so_ important.
| znpy wrote:
| I prefer the pdf version, mostly. I can annotate it on the
| side both in print and digitally with my iPad. I can also
| invert colors in pdf readers to get some kind of "dark mode"
| easily.
|
| The html version is wasting a lot of space on the right side
| and the color scheme is awful (dark grey on a brown
| background, seriously? How is that any better? Edit:
| disabling dark mode yields a better reading experience wrt
| color scheme). Also, somehow links to references make another
| http request and have no backlink?
|
| The html version could make sense if it had more dynamic
| functionalities: change fonts/line spacing, toggle color
| schemes, maybe a mini map or some other navigational tool?
| Also, some kind of support for highlighting and/or
| annotating?
| alephnerd wrote:
| This is a great UX addition. Why did it take them so long?
| gwern wrote:
| The conversion is still very error-prone. It can't convert a
| lot of packages, and the last paper I read, StarVector, half
| the HTML version is just missing. (I think it hit an error at a
| figure of some sort.) I reported an error, but I've been
| reporting errors against the ar5iv and abstracts for years now
| and the long tail of problems just seems like an incredible
| slog.
| KRAKRISMOTT wrote:
| Where are the computer vision people? This is the perfect
| type of problem for multi modal LLMs
| IlliOnato wrote:
| Except that the errors made by an LLM might be harder to
| spot then converter errors that typically are very blatant,
| and don't usually alter text (perhaps just drop parts of
| it).
|
| Also, a bug in a converter is conceptually much easier to
| fix than to re-train your LLM.
|
| I am not sure that AI in it's current state is useful when
| "high fidelity" is required.
| dginev wrote:
| Can confirm. From an ar5iv standpoint, 2.56% articles
| currently fail to convert entirely, and 22.9% have known
| errors to the converter. That leaves 74.5% of nominally
| usable articles. This success rate is noticeably _lower_ for
| the newest batches of arXiv submissions, as the converter
| hasn 't caught up with the most recent package innovations.
|
| We have a plan in place to meaningfully fall back for unknown
| packages, but that will take at least another year to put in
| place, and likely another couple of years to stabilize.
|
| Meanwhile, there is some hope that with arXiv launching the
| HTML Beta we will get more contributions for package support
| (LaTeXML is an open source project, with public domain
| licensing, everybody benefits).
|
| But again the original point is spot on. Coverage will be
| hit-or-miss for a while longer yet, for an arbitrary arXiv
| submission. The good news is that authors _could_ work
| towards better support for their articles, if they wanted to.
| eviks wrote:
| Because this is a rather conservative field with little
| dependency on the general public, so without much interest in
| hepling disseminate the knowledge broadly & accessibly
| (relative to other priorities, not absolute)
| Strilanc wrote:
| How would you do it quickly?
|
| For example, HTML isn't divided into numbereres pages while
| PDFs are. A lot of latex interacts with page boundaries.
| Figures tend towards the tops of pages. And there's \clearpage.
| And the reference list might say which page each citation
| appeared on. All that stuff needs someone to decide how to
| handle it and then to implement that handling. Like... what
| value does \pageheight return? Sometimes I resize things to fit
| the page height, and if it was doubled then I should have
| resized to fit the width instead.
| lynndotpy wrote:
| Almost universally, we prepare conference papers as LaTeX files
| made to export to PDFs which fit within the conferences
| template.
|
| It's nontrivial to export this to HTML in all cases, and even
| then, nobody is asking for HTML from us even though we all want
| it. I'm guessing Arxiv is using some kind of converter which
| _usually_ but not _always_ works.
|
| That said, this is a long time coming and PDF as the standard
| should've died a decade ago. I wish I had this when I was in my
| PhD program.
| alright2565 wrote:
| Latex is a very complicated programming language for creating
| documents. It is not easy to create a new backend for it.
|
| As a glimpse into the very tip of the iceberg, this diagram is
| https://tex.stackexchange.com/a/158740/ generated with 100%
| Latex code.
| binarymax wrote:
| Nice! Now I don't need to manually replace arxiv with ar5iv.
| Congrats to the team.
| imjonse wrote:
| "Our ultimate goal is to backfill arXiv's entire corpus so that
| every paper will have an HTML version, but for now this feature
| is reserved for new papers."
|
| For now it only works for papers submitted this month. But it's
| great to have this feature, makes it so much easier to read on
| phones.
| eviks wrote:
| Finally a modern format you can copy&paste from and read on one
| of the most popular computing platforms!!!
| shusaku wrote:
| Seems like the references aren't working very well.
|
| I really want journals to have two way links in a paper. I get
| google scholar alerts about certain papers being cited, and I
| want to skip to "why did they cite this? Did they use it, improve
| it, it just mention it?"
| r3trohack3r wrote:
| I'd never considered setting up citation alerts like this.
|
| Thank you for the idea!
| shrimpx wrote:
| Looks like clicking a reference adds the hash to the URL but
| doesn't scroll to the reference. If you load the hash URL
| directly in the browser you get a 404 page...
| burkaman wrote:
| https://browse.arxiv.org/html/2312.12451v1#bib.bib1 works,
| but https://browse.arxiv.org/html/2312.12451v1/#bib.bib1
| doesn't.
| IlliOnato wrote:
| Yeah, it seems like a bug in HTML generator...
| cbf66 wrote:
| It is a bug. Will be fixed soon.
| pushfoo wrote:
| Previously discussed:
| https://news.ycombinator.com/item?id=38713215
| carlosjobim wrote:
| With the 2024 browser update, this means I can read these
| articles on my ancient Kindle perfectly fine.
| ChrisArchitect wrote:
| [dupe] from yesterday
|
| More here: https://news.ycombinator.com/item?id=38713215
| winwang wrote:
| Probably more accessible in general. (PDF) Papers are
| psychologically scary.
| mmis1000 wrote:
| Pdf is by design a image format that can also embed text. It
| just don't have the primitives to properly retain the article
| structure.
| PaulHoule wrote:
| Nah, it's a super-complex system that creates a graph of
| components, can draw vectors like PostScript, can embed 3-d
| models, etc. The spec is here
|
| https://opensource.adobe.com/dc-acrobat-sdk-
| docs/pdfstandard...
|
| if you look at sections 14.6 through 14.10 you will find
| quite baroque facilities for representing the structure of
| documents in great detail, making documents with
| accessibility data, making documents that can reflow with
| HTML, etc. Note to mention the 14.11 stuff which addresses
| problems with high end printing (say you want to make litho
| plates for a book.)
|
| For that matter sections 14.4 and 14.5 describe facilities
| that can be used to add additional private data to PDF files
| for particular applications. For instance Adobe Illustrator's
| files are PDF files with some extra private data, and
| https://en.wikipedia.org/wiki/GeoPDF
|
| I like to complain that PDF has no facility to draw a circle
| but instead makes you approximate a circle with (accursed)
| Bezier curves but other than that the main complaint people
| make about PDF is that it is too complicated not that it is
| lacking this feature or that feature.
|
| Contrast that to a highly opinionated document format like
| DjVu
|
| https://en.wikipedia.org/wiki/DjVu
|
| which came out around the same time as PDF and is specialized
| for the problem of scanned documents and works by decomposing
| the document into three layers, one of which is a bilevel
| layer intended to represent text. All three layers have
| specialized coding schemes, the text layer in particular
| tries to identify that every copy of (say) the letter "e" or
| the character "Han " is the same and reuse s the same bitmap
| for them.
| anonimo37 wrote:
| You would normally use a library to create the PDF so you
| don't need deal with the complexity of the format. A
| library would likely provide a function for drawing circles
| that translates the circle into Bezier curves.
| mmis1000 wrote:
| The adode can surely add whatever extension they want to
| address whatever problem. But unfortunately, most
| implementation outside of Adobe acrobat itself won't
| implement all of them. Most library would just implement
| basic part for printing and marking (At best, supports
| forms and javascript). That part is basically non-exist for
| anyone.
|
| There is a reason that most people still use docx for forms
| even pdf technically support forms.
|
| PS: pdf reader of firefox and chrome don't really supports
| forms until very late versions.
| ZeroCool2u wrote:
| Wow, this is _so_ much better!
| choppaface wrote:
| Hope they benefit from CDN caching now too.
|
| Edit: aaaand they got Fastly
| https://news.ycombinator.com/item?id=38723373
| cozzyd wrote:
| doesn't work great with long author lists...
|
| https://browse.arxiv.org/html/2312.12907v1
| degenerate wrote:
| The PDF is worse, so there is no simple answer to this:
| https://arxiv.org/pdf/2312.12907v1.pdf
|
| At least the HTML version pairs each author with their
| affiliations, instead of the PDF which has all the names on
| page 1, and all the affiliations on page 2. That's completely
| unreadable.
| cozzyd wrote:
| The PDF is better because I'm trained to scroll past the
| author list. That takes forever on the html version .
| mattigames wrote:
| You can click the "Introduction" anchor on the left side
| and it scrolls for you past the author list
| cozzyd wrote:
| well it skips the abstract too, but yes, you can scroll
| back up to see it.
| mattigames wrote:
| Yeah, its a bit weird that the abstract doesn't have a
| link on the left
| cozzyd wrote:
| Probably because \abstract{ } is treated differently than
| \section{ }, I guess...
| IlliOnato wrote:
| For me the PDF is much better. It's compact and clean, if I
| really need to see an affiliation for a particular author,
| it's really easy to do so in the PDF, not so in the HTML.
|
| It's highly unlikely anybody will read an entire author list
| this long; typically you would read the first two or three
| names, or check if some particular name is on the list. So
| the compactness of the list and being able to quickly get to
| the article contents is important.
| Al-Khwarizmi wrote:
| Nice! It would be even better if they offered authors of previous
| papers the option of converting to HTML, as the latex sources are
| already in the system.
| fprog wrote:
| The article states they're going to backfill all, or nearly
| all, previously submitted papers!
| FredPret wrote:
| This is brilliant. I don't share academia's love of LateX multi-
| column PDFs.
| tiagod wrote:
| I like multi-column text on paper (literally), but it's awkward
| in digital where you can just shape text on the fly to whatever
| column size you want
| golol wrote:
| The oroblem is that gaining this responsiveness fundamentally
| makes your task much more difficult. Instead of just creating
| a picture you're now writing code that has to be maintained.
| In my philosophy arxiv is for documents which are set in
| granite - pictures.
| leoncaet wrote:
| I just hope they don't stop to offer the papers in PDF. Even when
| I'm on a computer, I still prefer to read PDFs.
| creatonez wrote:
| There is a taste component to it of course, but the history of
| PDF shows that it's the wrong format for reading on a computer.
| It was originally meant to be the end result of a publishing
| process before printing, a layer that sits right between the
| publishing software and the postscript that gets sent to the
| printer. This makes the PDF format quite inflexible for reading
| on a computer, with it being impossible to properly zoom or
| adjust the reading experience.
|
| Unfortunately many institutions and businesses have ignored its
| limitation because PDF turned out to be an obvious-but-naive to
| put a 'sheets of paper' metaphor into a computer system, which
| in the 1990s appealed to tech illiterate folks doing bare-bones
| computerization of existing paper systems. So later we got
| complicated and error-prone tools for editing PDFs, and many
| random additions to the spec to allow for unusual use cases.
| impendia wrote:
| > This makes the PDF format quite inflexible for reading on a
| computer, with it being impossible to properly zoom or adjust
| the reading experience.
|
| As an academic researcher, generally speaking I also prefer
| PDF, and the inflexibility and static nature is a feature,
| not a bug. I appreciate the fact that a paper will appear the
| same everywhere, that I can refer to "the top of page 7",
| etc.
|
| The exception is if I wanted to just skim a paper; in this
| case, I think I'd prefer HTML.
|
| I'm a huge fan of what arXiv is doing here. It effectively
| preserves the status quo, while adding an additional option
| on the side. The HTML option might prove a little bit useful
| for me, and it is likely to prove extremely useful for people
| with disabilities.
| aragilar wrote:
| I know of no-one who provides only HTML to arxiv, it's either
| latex, or doc/odt, so the PDFs should always be there.
| sylware wrote:
| Like the maths noscript/basic (x)html wikipedia generator:
|
| The magic of inline images at a known DPI, of course you can
| provide images for different DPIs.
|
| Reading maths/science noscript/basic (x)html documents on my 100
| DPI monitor, on wikipedia. Not yet fully ready on arxiv.
| gms7777 wrote:
| About time. Biorxiv and medrxiv have been doing this for probably
| half a decade at this point?
| dginev wrote:
| Wrong, arXiv was first. Check this HTML paper from 1997:
|
| https://arxiv.org/html/astro-ph/9708066
| cbf66 wrote:
| medRxiv and bioRxiv get most of their submissions as Word
| files. It's a much easier conversion, and if necessary they
| have manual touch-up. Not feasible for arXiv's volume.
| jez wrote:
| It would be neat if they offered submitters the chance to upload
| their own HTML version alongside the PDF version, instead of
| always relying on an automatic conversion process.
|
| - I can imagine authors feeling frustrated if someone reaches out
| about a problem in the HTML version of their paper, but they have
| no way to correct it except by hoping that a change to the PDF
| fixes a change to the generated HTML. Easier to just fix the
| formatting problem in the PDF outright.
|
| - It would be neat to allow people to experiment with alternative
| formatting for their papers. For example, imagine a paper about a
| programming language that embeds a sandbox you can use to play
| around with the language under discussion. Or a paper about
| multivariable calculus and you can interact with a three
| dimensional plot of some function.
| layer8 wrote:
| They'd have to define and document a "safe" subset of HTML, and
| implement a filter/checker for it. Otherwise we'd end up with
| papers containing ads and tracking and XSS vulnerabilities and
| whatnot.
| digging wrote:
| Those are issues with JavaScript, not HTML. Wouldn't
| filtering out iframes pretty much keep us in the clear?
| layer8 wrote:
| The parent wanted interactive 3D plots, which means
| JavaScript embedded in or linked from the HTML. Then
| there's stuff like JavaScript embedded in SVG.
| CaptainOfCoit wrote:
| > Those are issues with JavaScript, not HTML
|
| What about various HTML tags that remote load resources?
| From script, link, to things like img or CSS `background-
| image` attribute, added in a `style` attribute.
|
| There is a bunch of ways to do remote requests even without
| HTML.
| quickthrower2 wrote:
| The same _problem_ exists in HN comments. This comment
| _gets_ converted _to_ html. But it is
| fine!
| fdupress wrote:
| "gets converted to" and "gets rendered as uploaded by the
| user" are two different things.
|
| There are no issues with arXiv generating the HTML and
| sending that over: they control the generation process,
| and users who visit arXiv already trust it to not be
| malicious. The issue is with letting the user upload
| their own and having it sent on to other users as is.
| diffeomorphism wrote:
| > It would be neat if they offered submitters the chance to
| upload their own HTML version alongside the PDF version,
| instead of always relying on an automatic conversion process.
|
| Please don't. Then you will have a mismatch between the source
| and the "own html" which ruins the point of uploading the
| source.
| eviks wrote:
| Pdf isn't the source
| IlliOnato wrote:
| But the PDF is also generated. LaTeX is the single source
| of truth.
| kjkjadksj wrote:
| Most authors probably have no interest in learning html. Also
| most authors want nothing to do with the work by the time its
| submitted. It was probably hell getting the project to that
| point of publishing, they want to be done with it and move on
| to the next thing going on in their career asap.
| jez wrote:
| I think this is an argument in favor of doing automatic PDF
| -> HTML conversion for the authors that don't want to touch
| it, but I don't think it's an argument against letting those
| who are fine with HTML provide their own.
| IlliOnato wrote:
| HTML is not generated from PDF. Both PDF and HTML are
| generated from LaTeX.
| bookofjoe wrote:
| You hit on an unappreciated truth. By the time my papers
| appeared in print, I was so sick of them and the endless
| effort involved in taking them from raw data to finished,
| edited, proofed, rewritten a zillion times to meet the
| reviewers' and editors' requests and corrections and
| suggestions, that I didn't even read the published paper when
| it arrived as preprints and in the journal.
|
| Enough!
|
| My proof:
| https://scholar.google.com/citations?user=5DdrMc8AAAAJ&hl=en
| tiagod wrote:
| I was under the impression the source authors publish to arxiv
| was a latex file
| jraph wrote:
| It is.
| jez wrote:
| Ah, thanks for clarifying!
|
| I looked up the submission formats, and it looks like if you
| authored the paper in TeX/LaTeX, they do not accept pre-
| rendered versions of the document.
|
| https://info.arxiv.org/help/submit/index.html#formats-for-
| te...
|
| But if you did not author it in TeX/LaTeX (e.g., Word, Google
| Docs, etc.) it appears you can upload a PDF or HTML yourself.
| IlliOnato wrote:
| But it's still a single source of truth. Only one document
| is submitted. So for works submitted as HTML no PDF or
| LaTeX version is available.
| IlliOnato wrote:
| No, it would not. It's critically important that there is only
| one "logical" article, albeit with different representations.
| In other words, a single "source of truth".
|
| With "sideloading" of HTML there is no way in general to make
| sure that the _contents_ of LaTeX (and PDF) on one side and
| HTML on the other side is the same.
| dataflow wrote:
| > With "sideloading" of HTML there is no way in general to
| make sure that the contents of LaTeX (and PDF) on one side
| and HTML on the other side is the same.
|
| Is it not possible to write LaTeX code that produces
| different contents in HTML vs. PDF?
| IlliOnato wrote:
| Well, perhaps by exploiting bugs/shortcomings in PDF and
| HTML converters. Not by design.
|
| However, bugs get fixed, and since the PDF and HTML are
| generated dynamically, any such hack would be extremely
| fragile.
|
| And while "single source of truth" can help to prevent such
| malicious discrepancy, it's unlikely that people would try
| to hack the system this way: what for?
|
| Far more likely scenario is unintentional discrepancy, and
| single source of truth definitely helps to prevent that!
| GoblinSlayer wrote:
| Huh? What's the point of html version if you define it as
| source of deception?
| thomasahle wrote:
| > It would be neat if they offered submitters the chance to
| upload their own HTML version alongside the PDF version,
| instead of always relying on an automatic conversion process.
|
| Can you recommend a system I can use to compile my latex, while
| also making sure the html is going to look good? I'd like some
| kinds of css style @media queries to switch between certain
| parts of the layout, while keeping a single latex file.
| turing_complete wrote:
| With the shelf life of web technologies, authors would
| constantly have to maintain their "papers" or they just would
| not be accessible after a while.
| endergen wrote:
| I was hoping this meant that html native submissions would be
| possible, so that people made interactive explanations.
| lucidrains wrote:
| nice! will make reading papers on the phone so much more
| pleasant!
| tarboreus wrote:
| One of the reasons is to make the papers more accessible to
| people with disabilities, especially the blind. I participated in
| a conference they hosted on this a few months ago, I recommend
| taking a look at the recordings if you're interested in thinking
| on this.
|
| https://accessibility2023.arxiv.org/
| miki123211 wrote:
| Blind person here, can confirm this. Reading PDFs with a screen
| reader is bad, reading PDFs that come from LaTeX is worse,
| reading LaTeX math is pretty much impossible. All the semantic
| info you need is just thrown away.
|
| You _can_ make decently accessible PDFs but it 's lots of work,
| you need Acrobat on the producer' side and might also need it
| on the consumer's side. Free tools don't even come close.
| There's also the fact that the process of making accessible
| PDFs in Acrobat isn't itself accessible.
|
| With that said, the way screen readers treat HTML math
| certainly isn't perfect, it's geared more towards school
| children than anything above calculus. I'm probably going to
| stay with my LaTeX source files for now. At least ArXiv offers
| those, not many sites do. To be fair, that approach also has
| its own set of problems (particularly when people use some
| extra fancy formatting in their math equations, making the
| markup hard to read), but I find this to be the best approach
| for me so far, at least on AI/ML papers.
| saurik wrote:
| Huh. It would seem like, of all the things which should make
| it easy to generate the correct accessibility information,
| the pipeline of compiling a paper from source code in LaTeX
| should nail it... maybe we should all pitch in to some pool
| to pay someone to put in the required effort to connect all
| the dots?
| semi-extrinsic wrote:
| Kind of tangential, but it's also kind of surprising how
| difficult it is in LaTeX to make a plot of an equation.
|
| Say I have Equation \ref{eq}. Why can't I just say "plot
| \ref{eq} for x from -6 to 11" and get my graph?
|
| And yes, I know about pgfplots, PSTricks, TikZ etc. But in
| all those cases, I need to define the same equation twice,
| in different syntax to boot. It's kind of unsatisfying.
| fsh wrote:
| TeX is a very arcane language, and it doesn't support
| floating point numbers. Few languages would be less
| suited for making a plotting library.
| semi-extrinsic wrote:
| Both pgfplots and PSTricks and TikZ are plotting
| libraries. It seems like it shouldn't be that hard to let
| them plot an equation written elsewhere in different
| syntax.
| IlliOnato wrote:
| > Say I have Equation \ref{eq}. Why can't I just say
| "plot \ref{eq} for x from -6 to 11" and get my graph?
|
| Pretty much for the same reason you cannot press a word
| and get a pop-up dictionary definition in a paper book.
| semi-extrinsic wrote:
| To be clear, I meant in the LaTeX source code. And there
| I can already write code that plots equations, I just
| have to re-type the equation in a new syntax.
| jahewson wrote:
| Surprisingly it's not easy, and depending on the field it
| can be quite challenging. The reason for this is that TeX
| captures the visual aspects of typesetting, not the
| semantic meaning of the mathematics.
|
| A simple example is '\sum' which provides no way to capture
| the expression being summed over - because that's not
| necessary for typesetting. That's not the case in, say,
| MathML.
|
| Writing MathML is no fun though because mathematical
| formulae are visually ambiguous and we rely on the context
| to know how to read them, e.g. does 'f(x - 1)' mean
| function f called with argument x - 1, or does it mean
| variable f multiplied by x - 1?
| ldenoue wrote:
| I wrote an app called PDF Reflow that reflows the original
| PDF using image processing to cut out words into tiles so you
| see the reflowed version of the text in their original look.
|
| https://www.appblit.com/pdfreflow
| sydbarrett74 wrote:
| Any chance of releasing an Android version?
| no_identd wrote:
| +1
| hedora wrote:
| Gv (part of ghostscript) used to do a good job of this
| for two column documents. When zoomed in to show one
| column width of text, the spacebar ran through the top of
| column 1, then the bottom of column 1, then the top of
| column 2 and so on.
|
| The amount it scrolled probably depended on the aspect
| ratio of the window, so it might be multiple key presses
| to scroll an entire column.
| ldenoue wrote:
| It's using web technologies so yes it could also be on
| Android. I'll see what can be done.
| IlliOnato wrote:
| +1
| jakderrida wrote:
| Hold on... Are you telling me that all these complex
| sentences are being typed out based on your voice alone?
| That's insane.
| ehPReth wrote:
| ? blind people can use keyboards
| kzrdude wrote:
| Hm tangential question but shouldn't touch typing be well
| accessible for many blind computer users?
| topato wrote:
| I'd say it would be simple to talk type these using windows
| 11's redux of voice typing. Pretty damn accurate and easy
| to modify/variate text/options. I use it all the time to
| make tech/engineering blog posts, faster and more organic
| than typing, typically, and it learns your technoacronyms.
| Combined with voice access, it makes it trivial to fully
| operate your computer (well, at least, browse the web,
| email, and media apps) from across the room. For anyone who
| hasn't tried the updated version, highly suggest hitting
| windowskey+h and giving it a shot.
| spookie wrote:
| There are braille keyboards too
| Blikkentrekker wrote:
| Or normal keyboards? Many people can type blind. Some
| learned to do so while born blind, others became blind
| after they had already learned this skill.
|
| I would assume that the majority of persons on HN are not
| looking at their keyboard as they type.
| spookie wrote:
| I was just giving an additional way to use a computer not
| known by many. Either way, we shouldn't rely on the
| skills of a few to interact with a computer.
| anthk wrote:
| Emacs with Emacspeak has a math reading module.
| ahepp wrote:
| Do you think there's potential for language models to play a
| role here? I know that AI can get tossed around as a
| buzzword, but hasn't it proved quite successful in fields
| like computer vision?
|
| I'm not deeply familiar with the state of that art, but it
| seems like recovering the metadata from a PDF generated by
| LaTeX would be no more impressive than many other things
| we're currently seeing language models achieve?
| staunton wrote:
| I'm absolutely positive a few million dollars could get you
| a system that can "read aloud" pdf math papers in no time.
| I guess people will wait for it to become cheaper though.
| hutzlibu wrote:
| You can also have that cheaper already. But having it
| stable and reliable - will take some time and possibly
| more money, depending on your definition of reliable.
| throwaway287391 wrote:
| You wouldn't need to use computer vision on a picture of
| the PDF. arXiv has the tex source for most of the papers.
| An LLM trained on code could do a pretty good job of
| translating tex to readable html with a bit of effort.
| miki123211 wrote:
| Mathpix is trying to achieve something like this, and they
| do consider the visually impaired market AFAIK, but it's
| pretty expensive and I have no experience with it
| personally, so I can't say how good it is.
| spookie wrote:
| Yup LaTeX math doesn't make sense. I've been trying to hack
| my way into getting a voice model to read it but no real
| progress.
| IlliOnato wrote:
| LaTeX is a programming language for generating beautiful
| pages, basically a typesetting system. It serves this
| purpose fantastically well.
|
| It was not designed to provide semantic information,
| unfortunately. So getting anything other than visual
| representation out of it is _hard_.
| Blikkentrekker wrote:
| I made these arguments two decades ago when I was still in
| university that PDF is a horrible format because it's purely
| praesentational, especially for people with disabilities
| whose software relies on semantic information. LaTeX last
| time I used it didn't even have a different symbol for
| uppercase Alpha and A because the glyphs are
| indistinguishable.
|
| They argued that PDF was superior because the publisher could
| control how it looked and it looked the same everywhere but
| the point is that it should not. Things such as font size and
| line spacing should be at the control of the consumer, not
| the publisher. This isn't simply blind people but for
| instance also persons with dyslexia who use particular fonts
| to make it easier to read for them. Or in my case, someone
| who simply gets a headache from fronts and line-spacing that
| is too big. I've also been using darkmode everywhere for so
| long now that reading black text on a white surface on a
| screen gives me a headache.
| seanhunter wrote:
| To write uppercase Alpha you need a modern version of latex
| (ie xelatex or lualatex) and to include the unicode-math
| package
|
| https://tex.stackexchange.com/questions/485593/how-to-
| write-...
| IlliOnato wrote:
| For scientific articles pagination is still important,
| because it's how you refer to a particular part of a paper.
| If things like font size and line spacing are at the
| control of the consumer, pagination is not preserved.
|
| This problem is harder than you one would think naively.
| lsaferite wrote:
| Seems like they should use detailed section numbering
| like military documents and laws. Referring by page
| number seems very course by comparison.
| phlakaton wrote:
| For the math equations, I'm curious: does MathML do any
| better for you than LaTeX?
| seanhunter wrote:
| Not the person you're asking the question to, but it's
| worth noting (if you don't already know) that MathML is
| really not designed at all as an input language for
| practitioners who just want to write a few equations in
| some document. It's designed as an output/presentation
| language so that devices that want to render some maths can
| do so faithfully[1]. As such, if you're a human being who
| wants to typeset some equation, you'll want to go to latex
| every single time rather than mathml and then someone else
| has to figure out the conversion.
|
| [1] Great explanation here https://tex.stackexchange.com/qu
| estions/57717/relationship-b...
| IlliOnato wrote:
| On the other hand, "semantic" flavor of MathML (as
| opposed to "presentation") is much easier than TeX for
| things like screen readers, both conceptually and in
| practice.
| kkylin wrote:
| I teach math at a university. A couple years ago I had two
| blind students in my section of first-year calculus, and I
| really struggled with the tooling. Using latexml, I could
| produce documents that _one_ of the students could use with a
| screen reader, but the other student never managed to make it
| work on their machine. Both students prefer braille but I
| didn 't find anything open source that could typeset
| mathematical braille easily. Our disability resource office
| sends things out to a contractor to typeset into braille; the
| turn-around is measured in _weeks_.
|
| Anyway, if you (or anyone else reading this) has suggestions
| I'd really appreciate it!
| lostlogin wrote:
| > Our disability resource office sends things out to a
| contractor to typeset into braille; the turn-around is
| measured in weeks.
|
| This seems a massive gap in the market - many institutions
| have funding earmarked for such things.
| hedora wrote:
| I wonder if this is a useful service that an llm could
| actually outperform humans on.
| Saigonautica wrote:
| Interesting! I never thought about this, thank you for
| sharing.
|
| What kind of turn-around time would be practical? Could you
| point me to any typeset mathematical braille that would be
| an example of a solution to your problem? Is Nemeth the
| only important standard, or are others important for you
| too?
|
| I'm wondering if it's practical to set this up as back-
| office work here in Vietnam. There are some outlying
| provinces here where there are very few job opportunities.
| Job opportunities for the blind also round down to zero
| here (e.g. I could hire for proofreading). Maybe there's
| room to do something cool here.
| miki123211 wrote:
| How's English proficiency (and American braille code
| proficiency) like in Vietnam?
|
| Keep in mind that most blind people who speak English
| fluently but don't live in an English-speaking country
| (myself included) can't read English braille, or at least
| not well. Because of how voluminous Braille is, it uses
| contractions, single characters that replace common words
| and character combinations like "the", "would", "ing" or
| "ed". Those tend to be language specific, never taught
| outside their country or countries of use, and hard to
| get accessible electronic materials for. The math codes
| are completely different too, we use something derived
| from Marburg, while English-speaking countries use
| Nemeth. Even basic characters like + and - differ between
| those two, not to mention more complicated structures.
| It's not just the dot patterns that are different but
| also the design principles, like where you put spaces or
| when you can omit "begin fraction" / "end fraction"
| characters.
| miki123211 wrote:
| I learned (the basics of) LaTeX in my last year of middle
| school, and stuck with it ever since. To be fair, I was
| into computers since I was a child, played with Rockbox at
| the age of 10, started to dabble in programming shortly
| after, so this was a lot less scary than most of the things
| I was doing already. I took my middle and high school
| finals (they're kind of like SAT but matter a lot more) by
| producing LaTeX output, which I then compiled to PDF and
| printed. The test itself was in braille, as that was all
| that our government could do.
|
| Throughout college, my first question to most of my
| professors of math subjects was "do you do LaTeX, and can
| you give me your source code." Most said yes, and that's
| how we worked. LaTeX in, LaTeX or PDF out, depending on
| what the professor preferred.
|
| The amount of LaTeX you need for calculus 1 isn't that
| great, you could probably teach it to a relatively bright
| student if you had an hour or two to spare, and then give
| them the source files. If you have the time, I'd suggest
| producing "stripped" versions of your files, with as little
| markup as possible to get your point across and no fancy
| formatting unless absolutely necessary. The amount of hoops
| some books and papers jump through to "look nice" drives me
| crazy.
|
| You could also consider producing, teaching and consuming
| ASCII math, which seems like an even simpler and friendlier
| format. I couldn't really use it much in my school career
| for boring technical reasons, but it looks like a promising
| option.
| wilg wrote:
| For accessibility purposes (and regular reading), it would be
| so much better to drop the justified text. Ragged edge is the
| way to go!
|
| https://www.boia.org/blog/why-justified-or-centered-text-is-...
| jonatanheyman wrote:
| Not necessarily:
|
| https://heyman.info/2023/fill-justified-text-on-the-web
| wilg wrote:
| Perhaps someone can publish a paper to arXiv that provides
| a meta-analysis. But still there doesn't seem to be a clear
| reason _to_ justify it, given that almost all internet text
| is not justified.
| dginev wrote:
| To me one of the exciting aspects of HTML is that we can
| theme the same article in different ways, tailored to
| individual preferences - just swap in a different CSS
| file.
|
| Having a two-column theme, or left-aligned vs justified
| themes, could be workable in the long run. I hope that we
| get to see some browser extensions modding the pages
| before too long.
|
| The reason for the current justified text is that it is
| the default aesthetic for a LaTeX-based article, and a
| lot of authors expect it.
| odyssey7 wrote:
| article { text-justify: Knuth-Plass; }
| SushiHippie wrote:
| Mind explaining?
| odyssey7 wrote:
| The comment is invalid CSS to apply the Knuth-Plass algorithm
| in rendering an HTML article. Knuth being a perfectionist's
| perfectionist, TeX uses this algorithm to determine optimal
| line breaks to provide for better text justification.
|
| Here's a discussion of hacks to achieve the algorithm's
| results on web pages and an upcoming CSS feature as of 2020.
| https://mpetroff.net/2020/05/pre-calculated-line-breaks-
| for-...
| computerfriend wrote:
| If only.
| matt1 wrote:
| For anyone interested in staying informed about important new
| AI/ML papers on arXiv, check out https://www.emergentmind.com, a
| site I'm building that should help.
|
| Emergent Mind works by checking social media for arXiv paper
| mentions (HackerNews, Reddit, X, YouTube, and GitHub), then ranks
| the papers based on how much social media activity there has been
| and how long since the paper was published (similar to how HN and
| Reddit work, except using social media activity, not upvotes, for
| the ranking). Then, for each paper, it summarizes it using GPT-4,
| links to the social media discussions, paper references, and
| related papers.
|
| It's a fairly new site and I haven't shared it much yet. Would
| love any feedback or requests you all have for improving it.
| raccoonDivider wrote:
| That looks great. No real feedback yet, but it's the kind of
| thing I've always been looking for as a better alternative to
| Twitter.
| matt1 wrote:
| Thanks! I've got a lot more planned for it too. If anyone has
| any feedback that doesn't make sense to share here, or if
| you're a researcher who is open to some questions about how
| you currently follow arXiv papers, drop me a note at
| matt@emergentmind.com.
| CodeCube wrote:
| Love to see Energent Mind continuing to innovate!
| sureglymop wrote:
| Love the clean design of the website! Looks amazing on mobile.
| matt1 wrote:
| Thanks! If you ever run into any issues or have any
| suggestions for improving the site, drop me a note:
| matt@emergentmind.com.
| jakderrida wrote:
| This is exactly what I was using HN for. But, yeah, in kinda
| sucked compared to yours. Another thing I was trying to create
| was some sort of NN model that could use the semanticscholar
| h-index of authors along with the abstract text and T5 to
| estimate the one-year out citations. Just for personal use,
| though. That whole thing fell apart because semanticscholar is
| kinda crap for associating author links to the same author. I
| frequently ended up with the wrong professors, which I'd think
| would be easily fixable for them.
| carlossouza wrote:
| I did that (used other features). This is how new papers are
| ranked here:
|
| https://trendingpapers.com
| matt1 wrote:
| Great site, thanks for sharing. Can you explain how you're
| determining how many times a paper is cited? Obviously
| papers include a list of references, but extracting them
| accurately from the PDF is difficult in my experience (two
| column formats, ugh) - though the new HTML versions help.
| And even if you have a list, many authors just mention
| arXiv paper titles, not their ids, making identifying
| specific references tricky.
| carlossouza wrote:
| Difficult, yes... but not impossible :)
|
| I just extract the titles and look for their respective
| ids.
|
| The real challenge was how to do that at scale. Only in
| CS there are well over half a million papers
| matt1 wrote:
| Just a note to say that factoring authors into the ranking
| system is high on my todo list. v1 won't be too fancy - just
| a hardcoded list of prominent authors whose papers warrant
| extra visibility. A future version will likely automate it to
| avoid the hardcoded list.
|
| Also, soon-ish I'm going to add the ability for users to
| follow specific authors, so you can get notified when they
| publish new papers.
| jakderrida wrote:
| > Also, soon-ish I'm going to add the ability for users to
| follow specific authors, so you can get notified when they
| publish new papers.
|
| If you could do it, this would be a dream. My original
| intent was to be able to look through only papers citing a
| popular one and filtering the results for ones having at
| least one author with a set minimum h-index. Using Google
| Scholar data required using SerpAPI, which has some
| annoying limitations.
|
| The core goal is obviously just not to miss out on a paper
| that will very likely be influential while not having to
| comb through the mountain of irrelevant papers.
|
| What's funny is that Microsoft Academic was the best
| suited, but was retired in 2021.
| nojvek wrote:
| Great site. Bookmarked it.
|
| Would be nice if I could change timeframe. Top this week,
| month, year, all time.
| matt1 wrote:
| I'm slowly adding older papers as I work out the kinks in the
| site. Down the road when the database is more comprehensive,
| this should definitely be possible.
| team_dale wrote:
| Would love to see a comments feature at the bottom there.
| Reddit / HN style
|
| Love the concept though. Added it to my Home Screen on iOS
| matt1 wrote:
| Thanks for the kind words, it's appreciated.
|
| I might add comments down the road if there's enough interest
| and if there's enough traffic to warrant it. Don't want to
| add them just yet and have zero comments on everything and it
| look like a ghost town.
|
| Keep the suggestions coming though as you use it more:
| matt@emergentmind.com.
| danielbln wrote:
| Works in Chrome, but does not seem to work in Firefox.
| matt1 wrote:
| Can you (or anyone experiencing similar issues) share any
| details about what's not working in Firefox? I tested it and
| all is well for me, though it's definitely possible there's
| an issue with some other version of it.
| keyle wrote:
| I've got a somewhat related question:
|
| is there a site that lists and rates the various LLM models of
| hugginface.co alongside their various applications?
| matt1 wrote:
| FYI I started embedding the HTML pages in an iframe on Emergent
| Mind when the HTML version is available:
| https://www.emergentmind.com/papers/2312.11444 // should make
| it even easier to stay informed about trending papers
| apstats wrote:
| I wonder if this could be used to train an LLM to convert PDFs
| with rich charts into HTML?
| reqo wrote:
| A lot of AI/ML papers these days have an accompanying interactive
| page like [0], will we see anything like these now directly in
| arXive?
|
| [0] https://voyager.minedojo.org/
| z2h-a6n wrote:
| I think then arXiv would have to deal with mantaining the tech
| stack and providing the presumably much higher server capacity
| to serve the more varied web pages that would result, so it
| seems like a tall order. arXiv already has an experimental
| integration with Papers with Code [0], which I guess provides
| similar results for the reader, though the authors have to
| figure out their own web hosting.
|
| [0] https://info.arxiv.org/labs/showcase.html#arxiv-links-to-
| cod...
| ansk wrote:
| When I open a large pdf on arxiv (100+ MB, not uncommon for ML
| papers focused on hi-res image generation), there is a
| significant load time (10+ seconds) before anything is rendered
| at all other than a loading bar. Does anyone know what the source
| of this delay is? Is it network-bound or is Chrome just really
| slow to render large PDFs? Do PDFs have to be fully downloaded to
| begin rendering? In any case, this delay is my only gripe with
| arxiv and a progressively rendered HTML doc that instantly loads
| the document text would be a huge improvement.
| IlliOnato wrote:
| It may be even that the time is taken to _generate_ a PDF.
|
| The format in which articles are submitted and stored in arXive
| is LaTeX. PDF is automatically generated from it.
|
| Probably arXiv does some caching of PDFs so they don't have to
| be generated anew every time they are requested, but I don't
| know how this caching works.
| upbeat_general wrote:
| I have the same issue. From what I can tell it's just network-
| bound and the Arxiv servers are slow. They theoretically allow
| for you to setup a caching server but after spending a while
| trying to get it setup, I haven't been able to get it to work.
|
| https://info.arxiv.org/help/faq/cache.html
| arccy wrote:
| maybe it'll be faster now with fastly
|
| https://news.ycombinator.com/item?id=38723373
| 10000truths wrote:
| > Does anyone know what the source of this delay is? Is it
| network-bound or is Chrome just really slow to render large
| PDFs? Do PDFs have to be fully downloaded to begin rendering?
| In any case, this delay is my only gripe with arxiv and a
| progressively rendered HTML doc that instantly loads the
| document text would be a huge improvement.
|
| The default PDF format puts the xref table at the end of the
| file, forcing a full download before rendering can take place.
| PDF-1.2 onwards supports linearized PDFs, and most PDF export
| tools have some way of enabling it (usually an option like
| "optimize for web").
| ww520 wrote:
| That's great. Now I can read the papers on my phone.
| svag wrote:
| The tool that it's being used for this offering is this one,
| https://github.com/arXiv/arxiv-readability, just to save a few
| clicks :)
| IshKebab wrote:
| Wow I did not know they have the LaTeX for all the papers and
| compile it themselves! That's pretty crazy. What if they don't
| have packages you need? What if your paper isn't written with
| LaTeX?
| r4indeer wrote:
| > What if they don't have packages you need?
|
| Unlikely. But if so, you can provide the packages yourself:
| https://info.arxiv.org/help/submit_tex.html#wegotem
|
| > What if your paper isn't written with LaTeX?
|
| Then they still accept PDF or HTML. See:
| https://info.arxiv.org/help/submit/index.html#formats-for-
| te...
| aragilar wrote:
| They specify what version of texlive they use. This is
| _significantly_ better than what publishers offer (usually a
| _really_ old latex version, not even pdflatex).
| ofou wrote:
| I wonder how better is this compared to Pandoc's
| dginev wrote:
| That's it in spirit, but in practice it's refreshed:
|
| https://github.com/arXiv/arxiv-view-as-html
| WendyTheWillow wrote:
| I'm so far left wanting for an app that gives me a way to easily
| track and consume newly published work of a given topic. The
| existing apps are not great, and maybe this change will make it
| easier to provide better "reader" views, and possibly even tts (I
| like to listen+read).
| codethief wrote:
| Ugh. I don't belong to the target audience (people with
| disabilities) but the typesetting doesn't exactly look pleasant
| on my machine (Chrome on Linux).
| aragonite wrote:
| A lot of academic journals (say from Springer) also offer HTML
| formats for papers published in the past decade or so, which I
| personally often find more convenient for reading purposes than
| PDFs. For example, I parse text a lot faster if I use a regex to
| split each paragraph into sentences and place a linebreak after
| each sentence, or if I do natural language "syntax highlighting"
| by assigning a distinctive color to functional words indicating
| logical structure like 'if/then', 'and', 'or', 'not', 'because',
| and 'is'. And sometimes it really improves readability to be able
| to do "semantic highlighting", in the sense of say assigning a
| different hashed color to each proper name (or each labeled
| thesis, etc) that occurs in the paper. Such manipulations are
| basically impossible with PDFs. It makes me wish sci-hub would
| start archiving HTML versions in addition to PDFs!
| johnsillings wrote:
| https://www.arxiv-vanity.com/
| jakderrida wrote:
| And, of course, https://ar5iv.labs.arxiv.org/html
|
| However, ar5iv isn't a la carte like arxiv-vanity. They pretty
| much do last month's papers every month or so. Something like
| that.
| dginev wrote:
| Hi, ar5iv creator here.
|
| You can think of both arxiv-vanity and ar5iv as the "alpha"
| experiments that lead into the official arXiv "beta" HTML
| announced today.
|
| Once a few rounds of feedback and improvements are
| integrated, and the full collection of articles acquires HTML
| in the main arXiv site, ar5iv will be decommissioned.
|
| The plan is to turn all existing ar5iv links into redirects
| to the official HTML, and free up the resources for
| maintaining it. I am not sure what are the plans for
| maintaining arxiv-vanity, but I suspect they may head down a
| similar path some time later.
| jakderrida wrote:
| lmao! The actual creator of ar5iv? Sometimes I forget this
| isn't reddit and legit accomplished people comment here.
|
| Reminds of Burning Man when people kept telling me, "Never
| talk trash on the art at the main landmarks. The artists
| are frequently within listening distance."
|
| So, of course, I'd walk around talking about buying the art
| for $50K-$60k, knowing it's already scheduled to be burned
| with the landmark.
| philipashlock wrote:
| 30 years after HTML was invented to support accessibility and
| collaboration for research and academia and the same day the
| White House released their new accessibility guidance which
| happens to be the first time they've published formal new policy
| natively has HTML rather than PDF -
| https://www.whitehouse.gov/omb/management/ofcio/m-24-08-stre...
| murphyslab wrote:
| I feel surprised by how succinct, easy-to-understand, and
| sensible the policy (M-23-22) is:
|
| > Default to HTML: HyperText Markup Language (HTML) is the
| standard for publishing documents designed to be displayed in a
| web browser. HTML provides numerous advantages (e.g., easier to
| make accessible, friendlier to assistive technology, more
| dynamic and responsive, easier to maintain). When developing
| information for the web, agencies should default to creating
| and publishing content in an HTML format in lieu of publishing
| content in other electronic document formats that are designed
| for printing or preserving and protecting the content and
| layout of the document (e.g., PDF and DOCX formats). An agency
| should develop online content in a non-HTML format only if
| necessitated by a specific user need.
|
| https://www.whitehouse.gov/omb/management/ofcio/delivering-a...
| wolverine876 wrote:
| Hmmm ... accessibility is essential, but PDF is far better
| for static documents: There's no straightfoward, standard way
| to read an html document on another platform. Also, the html
| document may not be readable in 10+ years (unlike most PDFs),
| and updates are too fluid and hard to track.
|
| I think the general problem is that the end-user doesn't
| control an html document, e.g., for annotation, as a local
| record, etc.
| shakow wrote:
| > There's no straightfoward, standard way to read an html
| document on another platform.
|
| What do you think of the epub format?
| wolverine876 wrote:
| I wish so much for it:
|
| Despite all our advances, we lack an editable, local,
| multimedia, platform (and form-factor) independent, self-
| contained file - essentially a word-processing file for
| the 21st century (and I mean it's almost a quarter-
| century overdue). epub has that potential as a format,
| and being based on web standards it has capability, a
| universe of supporting tools and technology, and easy
| adoption to different applications.
|
| But I haven't heard anyone else express that particular
| interest, and as of a few years ago epub doesn't allow
| annotations and is not stable (i.e., I don't know that
| today's epub file will be readable in 20 or 50 years) -
| two essential requirements for a serious local content,
| imho.
|
| And even if it meets those specifications, we need epub
| editors that are the equivalent of word processsors for
| non-technical users.
| LordDragonfang wrote:
| ...What are you talking about? HTML files are readable on
| basically _every_ platform, even moreso because they are
| fundamentally text files (unlike PDFs, which are binaries).
| PDFs need special software, html can be read on the
| _command line_. Likewise, HTML is dead simple to edit and
| annotate.
|
| Seriously, name a single device that has PDF support that
| doesn't allow you to view HTML.
|
| I think you're conflating "html" and "things stored on a
| server", because all of your objections apply to pdfs
| stored on a server. The ability to save and annotate pdfs
| is not an inherent feature of the file format, they exist
| _because_ the format is such a PITA to interact with that
| specialized programs have to be written. HTML can be saved
| just as easily, and _usually is_ (on archive.org).
| wolverine876 wrote:
| How do I save an HTML document locally, and annotate it,
| in an easily sharable form, and in a form that is stable
| - i.e., in a way that will be readable and useable in
| 20-50 years?
| GoblinSlayer wrote:
| You say it as if pdf is somehow better. To begin with
| it's a proprietary format. If Adobe goes bankrupt or
| obscure tomorrow, pdf will go out of use as a failed
| technology.
| LordDragonfang wrote:
| Basically any _HTML_ document from 20-30 years ago (can
| 't go any further because it didn't exist 50 years ago)
| will be completely readable and usable. The only issue is
| people creating _content_ (not styling) in formats
| besides HTML.
|
| As far as annotations, you can use the native <ruby>[1]
| tag, or strikethough, but if you mean "literally drawing
| on the text" then, yeah, you're looking for an image
| format at that point (which is fundamentally what PDF
| _is_ ), but _we shouldn 't default to storing text in
| image formats_ just because of one specific use case.
| (Also, as I said above, the only reason tools exist to
| easily do that in PDFs exist is because everyone insists
| on using a format that's hard to edit. )
|
| Also, note that the context I was responding to was _US
| legal documents_ , not something more presentation-heavy.
|
| [1]https://twitter.com/antumbral/status/17308297560133758
| 75
| jpeloquin wrote:
| I just tested saving
| https://browse.arxiv.org/html/2312.12451v1 to disk using
| Chrome, transferring it to my Android phone, and opening
| it on the phone. Results:
|
| 1. Saving as "Webpage, Single File" (.mhtml): Neither
| Firefox nor Chrome even showed up in the list of
| available apps to open it.
|
| 2. Saving as "Webpage, Complete": Opened in Chrome but
| images were broken. Also very difficult to open with the
| default file browser because it uses a flat folder view
| and the sidecar folder pollutes the file list.
|
| I was hoping this would work, perhaps you will have
| different findings. I agree that HTML is the superior
| format in theory but usability in practice is often
| lacking. I'm resigned to using both depending on context.
| wolverine876 wrote:
| Yes, that's the kind of issue I was talking about. I wish
| it were otherwise. As a nearby comment pointed out, epub
| is a potential solution (and I wish arXiv embraced it -
| without my knowing their other requirements or epub's
| accessibility features). It's essentially packaged html.
| znpy wrote:
| Of course, they're "just text files" only in theory...
| but theory and practice diverge very very often.
| nonethewiser wrote:
| > There's no straightfoward, standard way to read an html
| document on another platform.
|
| Such as? What doesn't have a browser but can render pdfs?
| wolverine876 wrote:
| I mean, how do I save it locally on one platform and read
| it on any platform? Or share it with someone else to read
| (without them downloading software)? I.e., we don't have
| a standard, local, single-file html format.
| thfuran wrote:
| Print it to a pdf
| Zuiii wrote:
| You're right.
|
| We could have such a format if browser and os vendors
| were interested in supporting such a use case.
| Unfortunately, they aren't.
|
| On the browser side, supporting all-in-one html files can
| be as simple a reading a single multipart-encoded page.
| Heck, if they support automatically serializing all
| external resources as datauris when saving pages, then
| most browsers will be able to open them without any
| modification.
|
| On the OS side, operating systems can treat html files as
| first class citizens; execute them in an offline sandbox
| (most operating systems have embedded webviews), then
| extract icon, title, description and other metadata to
| present to the user. An icon the consists of a blank page
| with a small browser icon in the corner doesn't tell me
| anything about what the page is about. This needs to
| change.
|
| In short, html can be easily made nicer to deal with
| locally thanks to all the parts already being in place.
| The problem is that no one (tech giants, os vendors) are
| interested in doing this.
| GoblinSlayer wrote:
| There's epub as one file html document.
| CaptainOfCoit wrote:
| > I mean, how do I save it locally on one platform and
| read it on any platform?
|
| Ctrl/Meta/Cmd + S should do the trick, or "File > Save
| page", and you get a HTML file you can open in any
| browser. If there is images, they'll most likely be
| loaded remotely, or worst case not load at all. But the
| rest of the structure is there.
| blackoil wrote:
| > If there is images, they'll most likely be loaded
| remotely
|
| Most sites have images as a relative path which won't
| work with saved html and there is also CSS.
| wolverine876 wrote:
| A web page is much more than one file. Also, I'm looking
| for something with end-user control, where they can save
| the current document statically and long-term.
| 8organicbits wrote:
| If both devices have internet, you share the URL. If not,
| see other replies.
| jll29 wrote:
| It's a cool feature because it makes the papers more finable,
| more easily navigatable, easier to read online and faster to
| scroll through. I am also happy for blind people that they can
| more easily use ArXive with Braille readers now.
|
| (I'm still a fan of printing the PDFs, because I annotate on
| paper and refer to page numbers, but the HTML feature is in
| addition to PDF download, not a replacement.)
|
| One thing that still sucks (not ArXiv related though) is reading
| mathematical formulae on the Kindle - wonder if someone with
| rendering expertise could have a look into the MOBI format.
| isaacfung wrote:
| This would never happen but in an ideal world, we should be
| able to click on a citation to jump to the part of the paper
| that is being referenced and each paper page should have a
| discussion board so we can easily communicate with the authors
| and group the discussion in one place instead of us having to
| google to see if there is relevant discussion on
| twitter/reddit. We can even put links to talks, tutorials,
| blogs, github repo, demo, paperswithcode/google scholar/open
| review, background material, a timeline of citations in tree
| form on the same page(actually I am seeing more machine
| learning papers that have a project page that does some of
| these) or even turn it into a mini wiki. I just think html has
| so much more potential(especially now with LLM we can do
| semantic search). I wonder if there would be interest in such a
| chrom extension overlay.
|
| Related projects:
|
| https://github.com/ahrm/sioyek
|
| https://github.com/arxiv-vanity/engrafo
|
| https://github.com/dginev/ar5iv
|
| https://academ.us/article/2111.15588/ (powered by
| https://github.com/jgm/pandoc I believe)
| me_jumper wrote:
| I think https://web.hypothes.is/ would be of interest to you.
| golol wrote:
| IMO pdf and HTML optimize for different things. pdf is easy and
| pretty. HTML is easy and responsive. But making pdf responsive is
| impossible and making HTML pretty is not easy. I think having
| arxiv for well-polished pretty documents, not responsive ugly
| documents. Most researchers don't have time to make an HTML
| responsive and pretty.
| querez wrote:
| Am researcher, care about responsiveness way more than pretty.
| I am super glad for the option. Downloading PDFs is super
| annoying. I'm stoked.
| mmis1000 wrote:
| Well... download html is even harder nowadays, because many
| pages are dynamically generated. Although there are surely
| some browser extensions that can help you to finish it in a
| few clicks..
| radicalriddler wrote:
| FUCK YES (excuse my profanity). I have a tool that converts HTML
| to Neural Speech and I always wanted to push arXiv papers through
| it, but couldn't be bothered with a PDF implementation.
| topicseed wrote:
| What do they use to convert a PDF document to a clean, correct
| HTML document? It's a difficult space, especially with the
| variety of layouts you may find in PDF documents...
| blackbear_ wrote:
| Arxiv encourages users to submit the latex source of their
| papers rather than the PDF
| SushiHippie wrote:
| > The tool that it's being used for this offering is this one,
| https://github.com/arXiv/arxiv-readability, just to save a few
| clicks :)
|
| https://news.ycombinator.com/item?id=38726582
| vegabook wrote:
| PDF is objectively much better than HTML at rendering text
| documents. And it's not even close. This could easily have been
| done 10, even 15-20 years ago. That it didn't is not just
| inertia. Latex and PDF have enormously better text rendering, and
| the static format locks a state-commit in time that is much
| easier to go back to and reference/critique. Unlike the
| intrinsically fluid nature of HTML. For academic work, milestone-
| like formats, that lock state in time, are useful for those who
| later build on them. And again, the rendering just doesn't
| compare and that imparts [sub]conscious quality signals.
| imranq wrote:
| At this point are academic papers simply peer-reviewed blog
| posts?
| acjohnson55 wrote:
| This is great! I browse papers on mobile, and PDF is so bad for
| that use case.
| alecsm wrote:
| I don't read many papers but this makes it easier for me to save
| them in Joplin.
| wolverine876 wrote:
| Many here say they prefer html documents. How do you annotate
| them? How do you make local copies? Also, how will you read them
| in the decades to come?
|
| I love PDF.
| hollerith wrote:
| I'm sad that the best they can do is HTML format. HTML is a mess.
| nojvek wrote:
| OMG. This is amazing. I legit hated reading two column pdfs on a
| smartphone.
| wildpeaks wrote:
| Very good decision, always bet on the web.
| sicariusnoctis wrote:
| Personally, I would prefer the conventional Latin Modern math
| font instead of Palatino math.
|
| Latin Modern is used by:
|
| - Wikipedia. - Math.StackExchange. - Nearly all papers, including
| the ones hosted on arxiv in PDF format. - Nearly any math videos,
| slides/presentations, notes. - Almost everything, really.
|
| Palatino just looks weird.
|
| Also, I imagine that authors might do math formatting hacks that
| were only tested on Latin Modern, and might end up breaking on
| Palatino.
|
| TL;DR:
|
| Palatino :(
|
| Latin Modern :)
| IHLayman wrote:
| Fun fact: if seems that if you use Lockdown mode on Apple devices
| you can't open PDFs from a browser (no official documentation
| says it but there is anecdotal evidence). This would allow people
| with Lockdown mode to open Arxiv papers more easily.
| matrix2596 wrote:
| thats great news. I was using arxiv vanity to read on mobile
| phones. I am not seeing it on all articles, is it only for new
| papers?
| therealmarv wrote:
| This is the reason I've never liked LaTeX from a data point view.
| It's made to be printed out or get to look beautiful on a PDF but
| was never designed to get you to a HTML file or a Word file.
|
| I've written my thesis in Markdown in the past because of this
| (best for humans) which can be easily transformed to HTML, Word,
| PDF and even LaTeX
| https://github.com/tompollard/phd_thesis_markdown
|
| And I think that XML is the best format for machines.
| delhanty wrote:
| > If you are familiar with ar5iv, an arXivLabs collaboration, our
| HTML offering is essentially bringing this impactful project
| fully "in-house". Our ultimate goal is to backfill arXiv's entire
| corpus so that every paper will have an HTML version, but for now
| this feature is reserved for new papers.
|
| IIRC, ar5iv was created on his own initiative by Deynan Ginev
|
| https://twitter.com/dginev/status/1736792316675825981
|
| and it seems that he has worked tirelessly to fix nearly all of
| the edge cases during the collaboration.
|
| This project creates huge value to humanity so Deynan is to be
| heartily thanked.
| dginev wrote:
| Thanks for the kind words, but some corrections:
|
| 1. My name is Deyan (hi!)
|
| 2. ar5iv was the latest frontend incarnation, but our actual
| work on converting LaTeX to HTML goes back nearly 20 years
| behind the scenes.
|
| 3. I was an undergraduate student when I was introduced to the
| project back in 2007. It was started "in spirit" by 3 senior
| co-conspirators back then: Michael Kohlhase, Bruce Miller and
| Robert Miner. And I am by no means a solitary actor today, even
| if I may be the chief online presence of the people involved.
| Bruce is doing the bulk of the hard work on LaTeXML to this
| day.
|
| I documented some of the history in an invited talk for CICM
| 2022, which you can find on youtube, or see the slides at:
|
| https://prodg.org/talks/welcome_to_ar5iv
|
| It's really great that the HTML has now reached "home base" in
| arXiv, and I hope their team gets a lot more of the positive
| attention going forward - today's achievement is entirely
| theirs!
| indrora wrote:
| I remember stumbling upon your work long ago when I was
| working on a project to have "e-zines" that consumed a series
| of `article` class files and rendered them out into PDF and
| HTML as a series package.
|
| I had come across latex2html, Dan Gildea's project, and found
| myself unpleasantly dissatisfied with how it worked. As I
| understand it, it's more a "half implementation of lots of
| packages" rather than what ar5iv seems to be, which is
| "enough of the core LaTeX engine producing HTML instead of
| DVI"? I'd love to know more about the nitty gritty of how the
| engine does its thing.
|
| I'm curious: How has modern web tech (e.g. WebAssembly,
| Canvas, etc) helped or gotten in the way of getting _good_
| LaTeX rendering in the browser?
| dginev wrote:
| Right, that's LaTeXML - it tries to emulate as much as
| possible of the TeX typesetting system, while retaining
| enough control to emit structured markup.
|
| Which also allows us (and generally all contributors of
| latexml package support) to conveniently maintain various
| parallel data structures and metadata needed along the way.
|
| Modern HTML is very often helpful to produce higher quality
| article renderings. Examples:
|
| 1. we recently started using flexbox for subfigures,
| allowing them to reflow.
|
| 2. we have started emitting ARIA accessibility annotations
| (there is now an "alt" key for \includegraphics)
|
| 3. MathML Core allowed us to have native web rendering for
| math expressions in every browser.
|
| As to LaTeX rendering _in_ the browser, there are various
| other projects out there you could look up with partial
| support. For latexml the WebAssembly route seems most
| realistic, as we are undergoing a rewrite in Rust. But
| there are quite a number of pieces to flesh out before we
| get there.
| trostaft wrote:
| Taking a look at a paper I have that went up this month and
| another that went up before the dec cutoff on ar5iv, they look
| 90% OK! Figures with side-by-side plots and algorithm
| environments are the common culprit for being broken though.
| Particularly in figures, it seems like the width argument isn't
| being interpreted correctly.
|
| Interestingly this review paper seems to have their side by side
| figures intact (e.g. fig 2 fig 4). Maybe it's because he used a
| subfigure like environment (judging by the subcaptions)?
|
| https://ar5iv.labs.arxiv.org/html/1609.04747
| dginev wrote:
| For the image widths, there is some CSS fine-tuning that is
| still needed on the arXiv HTML side. I think that will get
| fixed soon, just needs the right height directive set.
|
| Getting subfigures emulated via flexbox is one of our more
| recent LaTeXML enhancements, and still has some ongoing work
| (working on it today actually). It can be a bit finicky to test
| - there are easily 20 different ways people can write LaTeX for
| subfigures in arXiv.
| blackoil wrote:
| > Didn't see a toggle
|
| you can run toggleColorScheme() twice in console to switch to
| light theme or dark theme.
| charleshan wrote:
| This is awesome! Push to Kindle (HTML to EPUB) isn't converting
| the page properly but I'm sure it's coming soon
| zerop wrote:
| They should also add commenting capabilities under the paper.. a
| good discussion will lead to more research and information
| discovery
| krick wrote:
| Curious to see how well it will work. Does anybody here know a
| robust and not crazy computationally expensive solution to
| extract tables from fairly clean PDF files (especially non-
| english)?
| llamaInSouth wrote:
| Nice.... a website that offers even more web pages.
| happyyalda wrote:
| Unfortunately, I am from Iran so I can't use this new feature. I
| got '403 Forbidden' message from the arXiv server. Worse than
| that, I totally lost my access to arXiv since they changed their
| CDN to fastly, because fucking mullahs don't like fastly!
| forgingahead wrote:
| What I would like is for ArXiv to have an LLM to rewrite all
| papers away from the stodgy, stilted language prevalent in every
| paper. Just write clearly gang, use proper paragraph breaks and
| stop with the run-on sentences.
| creatonez wrote:
| I am glad to see a sans font being used, rather than trying to
| replicate the serif font from the original papers. It's a bit
| narrow and fuzzy on low resolutions, but a massive improvement
| just by switching to sans.
| dang wrote:
| We detached this comment from
| https://news.ycombinator.com/item?id=38724925.
| SallyThinks wrote:
| Saw it last night ! I was sooo happy ! Reading papers on phone is
| a nightmare. Well done guys !
| quickthrower2 wrote:
| Reading papers on mobile now considered sane!
| astrolx wrote:
| This is excellent news. Their HTML formatting is also more
| pleasant than the HTML articles offered by most journals in my
| field (e.g arXiv HTML footnotes displayed as sidenotes on large
| displays!)
| amai wrote:
| This will be on of the most popular applications written in Perl,
| because this is based on 20 year old
| https://en.wikipedia.org/wiki/LaTeXML.
| jcq3 wrote:
| It will ease data scraping, automated meta analysis...
| alexmolas wrote:
| This makes downloading and parsing paper data easily, which is
| pretty handy in the LLM era.
| HeavyStorm wrote:
| Thank God. Maybe we can now adapt those for mobile?
| injuly wrote:
| For anyone who needs it, arxiv-vanity is amazing:
| https://www.arxiv-vanity.com/
| westurner wrote:
| arxiv-sanity- _lite_ : https://github.com/karpathy/arxiv-
| sanity-lite
| killjoywashere wrote:
| So, I'm seeing a lot of chatter in the thread about LaTeX and
| converting that to HTML and PDF, so LaTeX should be the superior
| single source of truth. Please keep in mind that many areas of
| science think of latex as an allergy. I even have a colleague, a
| plasma physicist, who strongly encourages his team to not use
| LaTeX because a) collaborators get confused and b) it can be a
| massive time suck.
| clircle wrote:
| I agree with your colleague.
|
| At my institution, all of the lowest quality drafts I read are
| made with latex. I think it's because the programs people use
| to write latex do not have spelling and grammar checking. Also,
| the people that prefer latex, are the same types of people that
| are more interested in technical things, than spelling and
| grammar.
| 101008 wrote:
| Is there an open source tool to convert any PDF to something like
| this?
| mcpherrinm wrote:
| It sounds like (from the shout-out in the post) they're using
| https://math.nist.gov/~BMiller/LaTeXML/ to convert the paper's
| LaTeX into HTML, not from PDF.
|
| The most versatile tool I know of for converting various
| document formats, including PDF to HTML, is the oss ebook tool
| Calibre: https://manual.calibre-ebook.com/conversion.html
|
| I have seen https://pdfbox.apache.org/ used for extracting text
| from PDFs for analysis, but you won't get HTML output.
___________________________________________________________________
(page generated 2023-12-22 23:01 UTC)