[HN Gopher] ArXiv now offers papers in HTML format
___________________________________________________________________
ArXiv now offers papers in HTML format
Author : programd
Score : 454 points
Date : 2023-12-21 18:34 UTC (4 hours ago)
(HTM) web link (blog.arxiv.org)
(TXT) w3m dump (blog.arxiv.org)
| shrimpx wrote:
| Since the article doesn't link to any example HTML article,
| here's a random link:
|
| https://browse.arxiv.org/html/2312.12451v1
|
| It's cool that it has a dark mode. Didn't see a toggle but
| renders in the system mode.
|
| Overall will make arXiv a lot more accessible on mobile.
| burkaman wrote:
| And here's the PDF of the same paper for comparison:
| https://arxiv.org/pdf/2312.12451.pdf
| FredPret wrote:
| The contrast is massive. I'm much more likely to read the
| html version; that PDF is deeply off-putting in some hard to
| define way. Maybe it's the two columns, or the font, or the
| fact that the format doesn't adjust to fit different screen
| sizes.
| ForkMeOnTinder wrote:
| Definitely the two columns for me. It's super annoying
| skimming a paper and having to scroll down and back up
| again in a zig-zag pattern.
| mmis1000 wrote:
| I think the consuming device matters. A ipad or computer
| have much wider screen width. One column layout is too
| wide for them for average people to scan text lines
| quickly.
|
| While it looks perfectly fine on a phone. Two columns
| layout looks terrible on a smartphone, the text is too
| tiny to read comfortably.
|
| It would probably be even better if you can flip it left
| and right like a ebook instead of scrolling to allocate
| the content faster. But current design is good enough
| IMO. (Compare to reading a pdf on cellphone)
| kjkjadksj wrote:
| Just zoom the smartphone into one column. Problem solved.
| mmis1000 wrote:
| And then you will have to scroll both top bottom and left
| right, a even worst experience.
| tobias2014 wrote:
| This is very interesting, because for me it's just the
| opposite. In particular the two column layout is just more
| readable and approachable for me. The PDF version also
| allows for a presentation just as the authors intended. I
| guess it's good that they offer both now.
| kjkjadksj wrote:
| The authors don't format the pdf, the editor does.
| Authors probably sent a double spaced word document with
| figures and tables on another file.
| tonyg wrote:
| In computer science, the usual case is that the author
| fully formats the paper.
| z2h-a6n wrote:
| Not on arXiv (unless I'm much mistaken), which is a
| preprint server, not a conventional journal.
|
| arXiv accepts various flavors of TeX, or PDFs not
| produced by TeX [0], and automatically produces PDFs and
| HTML where possible (e.g. if TeX is submitted). In the
| case of the example paper under discussion, the authors
| submitted TeX with PDF figures [1], and the PDF version
| of the paper was produced by arXiv. The formatting was
| mainly set by using REVTeX, which is a set of macros for
| LaTeX intended for American Physical Society journals.
|
| [0]
| https://info.arxiv.org/help/submit/index.html#formats-
| for-te... [1] https://arxiv.org/format/2312.12451
| smartmic wrote:
| FWIW, I recently learned that it is also possible to
| produce nice PDF papers with GNU roff (groff), have a
| look at this example: https://github.com/SudarsonNantha/L
| inuxConfigs/blob/master/....
| frocmlol wrote:
| You are very confidently wrong.
|
| In the arxiv you use latex and do everything yourself.
| There is no editor.
| cozzyd wrote:
| You typically send a .tar.gz of tex files (and, figures,
| .bbl, etc.) to the journal. And then you typically upload
| something very similar to the arxiv (I have an arxivify
| Makefile target for for my papers that handles some arxiv
| idiosyncrasies like requiring all figures to be in the
| same folder as the .tex file, and it also clears all the
| comments; sometimes you can find amusing things in source
| file comments for some papers).
|
| Some fields may use Word files, but in most of physics
| you would get laughed at...
|
| It is true that most journals will typically reformat
| your .tex in a different way than is displayed on the
| arXiv.
| eigenket wrote:
| You are completely wrong. ArXiv doesn't work like that.
| JumpCrisscross wrote:
| Do you work extensively with LaTeX?
|
| Two columns is good, albeit annoying on mobile. But the
| font. The typeface kills me, and almost every LaTeX-
| generated document sports it.
| saurik wrote:
| Hilariously, I would probably tolerate the HTML version a
| lot better if it had the font from the PDF (and FWIW, the
| answer for me is "no: I don't work with LaTeX at all... I
| just read a lot of papers").
| cozzyd wrote:
| Hating on Computer Modern (ok, probably now Latin Modern)
| is something close to blasphemy.
| kjkjadksj wrote:
| If you read a lot of papers in your line of work you will
| quickly appreciate the two columns and justification.
| FredPret wrote:
| Admittedly, I don't read research papers. But with HTML,
| surely the choice between one or two columns is a
| checkbox away.
| IlliOnato wrote:
| Which checkbox?
|
| I cannot find anything relevant in any of the 3 browsers
| I use (Vivialdi, Firefox, Chrome). Would really
| appreciate this option.
|
| A quick search gave some apparently unmaintained browser
| extensions, and it's it.
| FredPret wrote:
| No, I'm saying there _should_ be a checkbox. That way,
| you can switch between two columns formatted like LaTeX
| and that font they always use, and one column with
| Helvetica / Arial.
| jabroni_salad wrote:
| Only problem is jagoffs like me who need the text to be
| bigger. On PDFs you now get to experience a horizontal
| scrollbar. HTML has text reflow and I can set the line
| length by resizing the window. I'm willing to make a lot
| of sacrifices for that experience.
| z2h-a6n wrote:
| For what it's worth, two column layouts are very common in
| the physical sciences, or at least in physics which I'm
| more familliar with. I have a feeling that the reason is at
| least partly to save page space when using displayed math
| (e.g. equations that are formatted in a break between
| blocks of text), which use the full text width (i.e. the
| width of one column) to display what may be much less than
| half a page wide.
| FredPret wrote:
| It makes sense - for paper. But pixels are infinite -
| HTML is far better for screen display, which is how
| people read things nowadays.
|
| The extra column next to the one I'm reading introduces a
| lot of visual noise, and the content is hard enough as it
| is. I'm sure physicists have all gotten used to it, but
| it certainly trips me up.
| nyssos wrote:
| > The extra column next to the one I'm reading introduces
| a lot of visual noise
|
| Papers are generally not read start to finish in one go:
| there's lots of rereading and jumping back and forth
| between key parts, and anything that moves them further
| apart makes this harder.
| FredPret wrote:
| Ah, that makes more sense. I imagined scientists just
| reading the whole thing start-to-finish.
|
| I still think a flexible layout is best. If you like
| multi-columns and have a wide screen, why not display 12
| columns next to each other?
|
| With PDF this is not possible. With HTML the content can
| in principle be sliced and diced how you like it.
| shusaku wrote:
| Seems like the references aren't working very well.
|
| I really want journals to have two way links in a paper. I get
| google scholar alerts about certain papers being cited, and I
| want to skip to "why did they cite this? Did they use it,
| improve it, it just mention it?"
| r3trohack3r wrote:
| I'd never considered setting up citation alerts like this.
|
| Thank you for the idea!
| shrimpx wrote:
| Looks like clicking a reference adds the hash to the URL but
| doesn't scroll to the reference. If you load the hash URL
| directly in the browser you get a 404 page...
| burkaman wrote:
| https://browse.arxiv.org/html/2312.12451v1#bib.bib1 works,
| but https://browse.arxiv.org/html/2312.12451v1/#bib.bib1
| doesn't.
| IlliOnato wrote:
| Yeah, it seems like a bug in HTML generator...
| winwang wrote:
| Probably more accessible in general. (PDF) Papers are
| psychologically scary.
| mmis1000 wrote:
| Pdf is by design a image format that can also embed text. It
| just don't have the primitives to properly retain the article
| structure.
| PaulHoule wrote:
| Nah, it's a super-complex system that creates a graph of
| components, can draw vectors like PostScript, can embed 3-d
| models, etc. The spec is here
|
| https://opensource.adobe.com/dc-acrobat-sdk-
| docs/pdfstandard...
|
| if you look at sections 14.6 through 14.10 you will find
| quite baroque facilities for representing the structure of
| documents in great detail, making documents with
| accessibility data, making documents that can reflow with
| HTML, etc. Note to mention the 14.11 stuff which addresses
| problems with high end printing (say you want to make litho
| plates for a book.)
|
| For that matter sections 14.4 and 14.5 describe facilities
| that can be used to add additional private data to PDF
| files for particular applications. For instance Adobe
| Illustrator's files are PDF files with some extra private
| data, and https://en.wikipedia.org/wiki/GeoPDF
|
| I like to complain that PDF has no facility to draw a
| circle but instead makes you approximate a circle with
| (accursed) Bezier curves but other than that the main
| complaint people make about PDF is that it is too
| complicated not that it is lacking this feature or that
| feature.
|
| Contrast that to a highly opinionated document format like
| DjVu
|
| https://en.wikipedia.org/wiki/DjVu
|
| which came out around the same time as PDF and is
| specialized for the problem of scanned documents and works
| by decomposing the document into three layers, one of which
| is a bilevel layer intended to represent text. All three
| layers have specialized coding schemes, the text layer in
| particular tries to identify that every copy of (say) the
| letter "e" or the character "Han " is the same and reuse s
| the same bitmap for them.
| anonimo37 wrote:
| You would normally use a library to create the PDF so you
| don't need deal with the complexity of the format. A
| library would likely provide a function for drawing
| circles that translates the circle into Bezier curves.
| tarboreus wrote:
| One of the reasons is to make the papers more accessible to
| people with disabilities, especially the blind. I participated
| in a conference they hosted on this a few months ago, I
| recommend taking a look at the recordings if you're interested
| in thinking on this.
|
| https://accessibility2023.arxiv.org/
| miki123211 wrote:
| Blind person here, can confirm this. Reading PDFs with a
| screen reader is bad, reading PDFs that come from LaTeX is
| worse, reading LaTeX math is pretty much impossible. All the
| semantic info you need is just thrown away.
|
| You _can_ make decently accessible PDFs but it 's lots of
| work, you need Acrobat on the producer' side and might also
| need it on the consumer's side. Free tools don't even come
| close. There's also the fact that the process of making
| accessible PDFs in Acrobat isn't itself accessible.
|
| With that said, the way screen readers treat HTML math
| certainly isn't perfect, it's geared more towards school
| children than anything above calculus. I'm probably going to
| stay with my LaTeX source files for now. At least ArXiv
| offers those, not many sites do. To be fair, that approach
| also has its own set of problems (particularly when people
| use some extra fancy formatting in their math equations,
| making the markup hard to read), but I find this to be the
| best approach for me so far, at least on AI/ML papers.
| saurik wrote:
| Huh. It would seem like, of all the things which should
| make it easy to generate the correct accessibility
| information, the pipeline of compiling a paper from source
| code in LaTeX should nail it... maybe we should all pitch
| in to some pool to pay someone to put in the required
| effort to connect all the dots?
| semi-extrinsic wrote:
| Kind of tangential, but it's also kind of surprising how
| difficult it is in LaTeX to make a plot of an equation.
|
| Say I have Equation \ref{eq}. Why can't I just say "plot
| \ref{eq} for x from -6 to 11" and get my graph?
|
| And yes, I know about pgfplots, PSTricks, TikZ etc. But
| in all those cases, I need to define the same equation
| twice, in different syntax to boot. It's kind of
| unsatisfying.
| ldenoue wrote:
| I wrote an app called PDF Reflow that reflows the original
| PDF using image processing to cut out words into tiles so
| you see the reflowed version of the text in their original
| look.
|
| https://www.appblit.com/pdfreflow
| jakderrida wrote:
| Hold on... Are you telling me that all these complex
| sentences are being typed out based on your voice alone?
| That's insane.
| ehPReth wrote:
| ? blind people can use keyboards
| kzrdude wrote:
| Hm tangential question but shouldn't touch typing be well
| accessible for many blind computer users?
| topato wrote:
| I'd say it would be simple to talk type these using
| windows 11's redux of voice typing. Pretty damn accurate
| and easy to modify/variate text/options. I use it all the
| time to make tech/engineering blog posts, faster and more
| organic than typing, typically, and it learns your
| technoacronyms. Combined with voice access, it makes it
| trivial to fully operate your computer (well, at least,
| browse the web, email, and media apps) from across the
| room. For anyone who hasn't tried the updated version,
| highly suggest hitting windowskey+h and giving it a shot.
| anthk wrote:
| Emacs with Emacspeak has a math reading module.
| codethief wrote:
| Ugh. I don't belong to the target audience (people with
| disabilities) but the typesetting doesn't exactly look pleasant
| on my machine (Chrome on Linux).
| jll29 wrote:
| It's a cool feature because it makes the papers more finable,
| more easily navigatable, easier to read online and faster to
| scroll through. I am also happy for blind people that they can
| more easily use ArXive with Braille readers now.
|
| (I'm still a fan of printing the PDFs, because I annotate on
| paper and refer to page numbers, but the HTML feature is in
| addition to PDF download, not a replacement.)
|
| One thing that still sucks (not ArXiv related though) is
| reading mathematical formulae on the Kindle - wonder if someone
| with rendering expertise could have a look into the MOBI
| format.
| alephnerd wrote:
| This is a great UX addition. Why did it take them so long?
| gwern wrote:
| The conversion is still very error-prone. It can't convert a
| lot of packages, and the last paper I read, StarVector, half
| the HTML version is just missing. (I think it hit an error at a
| figure of some sort.) I reported an error, but I've been
| reporting errors against the ar5iv and abstracts for years now
| and the long tail of problems just seems like an incredible
| slog.
| KRAKRISMOTT wrote:
| Where are the computer vision people? This is the perfect
| type of problem for multi modal LLMs
| IlliOnato wrote:
| Except that the errors made by an LLM might be harder to
| spot then converter errors that typically are very blatant,
| and don't usually alter text (perhaps just drop parts of
| it).
|
| Also, a bug in a converter is conceptually much easier to
| fix than to re-train your LLM.
|
| I am not sure that AI in it's current state is useful when
| "high fidelity" is required.
| dginev wrote:
| Can confirm. From an ar5iv standpoint, 2.56% articles
| currently fail to convert entirely, and 22.9% have known
| errors to the converter. That leaves 74.5% of nominally
| usable articles. This success rate is noticeably _lower_ for
| the newest batches of arXiv submissions, as the converter
| hasn 't caught up with the most recent package innovations.
|
| We have a plan in place to meaningfully fall back for unknown
| packages, but that will take at least another year to put in
| place, and likely another couple of years to stabilize.
|
| Meanwhile, there is some hope that with arXiv launching the
| HTML Beta we will get more contributions for package support
| (LaTeXML is an open source project, with public domain
| licensing, everybody benefits).
|
| But again the original point is spot on. Coverage will be
| hit-or-miss for a while longer yet, for an arbitrary arXiv
| submission. The good news is that authors _could_ work
| towards better support for their articles, if they wanted to.
| eviks wrote:
| Because this is a rather conservative field with little
| dependency on the general public, so without much interest in
| hepling disseminate the knowledge broadly & accessibly
| (relative to other priorities, not absolute)
| Strilanc wrote:
| How would you do it quickly?
|
| For example, HTML isn't divided into numbereres pages while
| PDFs are. A lot of latex interacts with page boundaries.
| Figures tend towards the tops of pages. And there's \clearpage.
| And the reference list might say which page each citation
| appeared on. All that stuff needs someone to decide how to
| handle it and then to implement that handling. Like... what
| value does \pageheight return? Sometimes I resize things to fit
| the page height, and if it was doubled then I should have
| resized to fit the width instead.
| lynndotpy wrote:
| Almost universally, we prepare conference papers as LaTeX files
| made to export to PDFs which fit within the conferences
| template.
|
| It's nontrivial to export this to HTML in all cases, and even
| then, nobody is asking for HTML from us even though we all want
| it. I'm guessing Arxiv is using some kind of converter which
| _usually_ but not _always_ works.
|
| That said, this is a long time coming and PDF as the standard
| should've died a decade ago. I wish I had this when I was in my
| PhD program.
| alright2565 wrote:
| Latex is a very complicated programming language for creating
| documents. It is not easy to create a new backend for it.
|
| As a glimpse into the very tip of the iceberg, this diagram is
| https://tex.stackexchange.com/a/158740/ generated with 100%
| Latex code.
| binarymax wrote:
| Nice! Now I don't need to manually replace arxiv with ar5iv.
| Congrats to the team.
| imjonse wrote:
| "Our ultimate goal is to backfill arXiv's entire corpus so that
| every paper will have an HTML version, but for now this feature
| is reserved for new papers."
|
| For now it only works for papers submitted this month. But it's
| great to have this feature, makes it so much easier to read on
| phones.
| eviks wrote:
| Finally a modern format you can copy&paste from and read on one
| of the most popular computing platforms!!!
| pushfoo wrote:
| Previously discussed:
| https://news.ycombinator.com/item?id=38713215
| carlosjobim wrote:
| With the 2024 browser update, this means I can read these
| articles on my ancient Kindle perfectly fine.
| ChrisArchitect wrote:
| [dupe] from yesterday
|
| More here: https://news.ycombinator.com/item?id=38713215
| ZeroCool2u wrote:
| Wow, this is _so_ much better!
| choppaface wrote:
| Hope they benefit from CDN caching now too.
|
| Edit: aaaand they got Fastly
| https://news.ycombinator.com/item?id=38723373
| cozzyd wrote:
| doesn't work great with long author lists...
|
| https://browse.arxiv.org/html/2312.12907v1
| degenerate wrote:
| The PDF is worse, so there is no simple answer to this:
| https://arxiv.org/pdf/2312.12907v1.pdf
|
| At least the HTML version pairs each author with their
| affiliations, instead of the PDF which has all the names on
| page 1, and all the affiliations on page 2. That's completely
| unreadable.
| cozzyd wrote:
| The PDF is better because I'm trained to scroll past the
| author list. That takes forever on the html version .
| mattigames wrote:
| You can click the "Introduction" anchor on the left side
| and it scrolls for you past the author list
| cozzyd wrote:
| well it skips the abstract too, but yes, you can scroll
| back up to see it.
| mattigames wrote:
| Yeah, its a bit weird that the abstract doesn't have a
| link on the left
| cozzyd wrote:
| Probably because \abstract{ } is treated differently than
| \section{ }, I guess...
| IlliOnato wrote:
| For me the PDF is much better. It's compact and clean, if I
| really need to see an affiliation for a particular author,
| it's really easy to do so in the PDF, not so in the HTML.
|
| It's highly unlikely anybody will read an entire author list
| this long; typically you would read the first two or three
| names, or check if some particular name is on the list. So
| the compactness of the list and being able to quickly get to
| the article contents is important.
| Al-Khwarizmi wrote:
| Nice! It would be even better if they offered authors of previous
| papers the option of converting to HTML, as the latex sources are
| already in the system.
| fprog wrote:
| The article states they're going to backfill all, or nearly
| all, previously submitted papers!
| FredPret wrote:
| This is brilliant. I don't share academia's love of LateX multi-
| column PDFs.
| tiagod wrote:
| I like multi-column text on paper (literally), but it's awkward
| in digital where you can just shape text on the fly to whatever
| column size you want
| leoncaet wrote:
| I just hope they don't stop to offer the papers in PDF. Even when
| I'm on a computer, I still prefer to read PDFs.
| sylware wrote:
| Like the maths noscript/basic (x)html wikipedia generator:
|
| The magic of inline images at a known DPI, of course you can
| provide images for different DPIs.
|
| Reading maths/science noscript/basic (x)html documents on my 100
| DPI monitor, on wikipedia. Not yet fully ready on arxiv.
| gms7777 wrote:
| About time. Biorxiv and medrxiv have been doing this for probably
| half a decade at this point?
| jez wrote:
| It would be neat if they offered submitters the chance to upload
| their own HTML version alongside the PDF version, instead of
| always relying on an automatic conversion process.
|
| - I can imagine authors feeling frustrated if someone reaches out
| about a problem in the HTML version of their paper, but they have
| no way to correct it except by hoping that a change to the PDF
| fixes a change to the generated HTML. Easier to just fix the
| formatting problem in the PDF outright.
|
| - It would be neat to allow people to experiment with alternative
| formatting for their papers. For example, imagine a paper about a
| programming language that embeds a sandbox you can use to play
| around with the language under discussion. Or a paper about
| multivariable calculus and you can interact with a three
| dimensional plot of some function.
| layer8 wrote:
| They'd have to define and document a "safe" subset of HTML, and
| implement a filter/checker for it. Otherwise we'd end up with
| papers containing ads and tracking and XSS vulnerabilities and
| whatnot.
| digging wrote:
| Those are issues with JavaScript, not HTML. Wouldn't
| filtering out iframes pretty much keep us in the clear?
| layer8 wrote:
| The parent wanted interactive 3D plots, which means
| JavaScript embedded in or linked from the HTML. Then
| there's stuff like JavaScript embedded in SVG.
| diffeomorphism wrote:
| > It would be neat if they offered submitters the chance to
| upload their own HTML version alongside the PDF version,
| instead of always relying on an automatic conversion process.
|
| Please don't. Then you will have a mismatch between the source
| and the "own html" which ruins the point of uploading the
| source.
| eviks wrote:
| Pdf isn't the source
| IlliOnato wrote:
| But the PDF is also generated. LaTeX is the single source
| of truth.
| kjkjadksj wrote:
| Most authors probably have no interest in learning html. Also
| most authors want nothing to do with the work by the time its
| submitted. It was probably hell getting the project to that
| point of publishing, they want to be done with it and move on
| to the next thing going on in their career asap.
| jez wrote:
| I think this is an argument in favor of doing automatic PDF
| -> HTML conversion for the authors that don't want to touch
| it, but I don't think it's an argument against letting those
| who are fine with HTML provide their own.
| tiagod wrote:
| I was under the impression the source authors publish to arxiv
| was a latex file
| jraph wrote:
| It is.
| jez wrote:
| Ah, thanks for clarifying!
|
| I looked up the submission formats, and it looks like if you
| authored the paper in TeX/LaTeX, they do not accept pre-
| rendered versions of the document.
|
| https://info.arxiv.org/help/submit/index.html#formats-for-
| te...
|
| But if you did not author it in TeX/LaTeX (e.g., Word, Google
| Docs, etc.) it appears you can upload a PDF or HTML yourself.
| IlliOnato wrote:
| No, it would not. It's critically important that there is only
| one "logical" article, albeit with different representations.
| In other words, a single "source of truth".
|
| With "sideloading" of HTML there is no way in general to make
| sure that the _contents_ of LaTeX (and PDF) on one side and
| HTML on the other side is the same.
| thomasahle wrote:
| > It would be neat if they offered submitters the chance to
| upload their own HTML version alongside the PDF version,
| instead of always relying on an automatic conversion process.
|
| Can you recommend a system I can use to compile my latex, while
| also making sure the html is going to look good? I'd like some
| kinds of css style @media queries to switch between certain
| parts of the layout, while keeping a single latex file.
| endergen wrote:
| I was hoping this meant that html native submissions would be
| possible, so that people made interactive explanations.
| lucidrains wrote:
| nice! will make reading papers on the phone so much more
| pleasant!
| odyssey7 wrote:
| article { text-justify: Knuth-Plass; }
| matt1 wrote:
| For anyone interested in staying informed about important new
| AI/ML papers on arXiv, check out https://www.emergentmind.com, a
| site I'm building that should help.
|
| Emergent Mind works by checking social media for arXiv paper
| mentions (HackerNews, Reddit, X, YouTube, and GitHub), then ranks
| the papers based on how much social media activity there has been
| and how long since the paper was published (similar to how HN and
| Reddit work, except using social media activity, not upvotes, for
| the ranking). Then, for each paper, it summarizes it using GPT-4,
| links to the social media discussions, paper references, and
| related papers.
|
| It's a fairly new site and I haven't shared it much yet. Would
| love any feedback or requests you all have for improving it.
| raccoonDivider wrote:
| That looks great. No real feedback yet, but it's the kind of
| thing I've always been looking for as a better alternative to
| Twitter.
| matt1 wrote:
| Thanks! I've got a lot more planned for it too. If anyone has
| any feedback that doesn't make sense to share here, or if
| you're a researcher who is open to some questions about how
| you currently follow arXiv papers, drop me a note at
| matt@emergentmind.com.
| CodeCube wrote:
| Love to see Energent Mind continuing to innovate!
| sureglymop wrote:
| Love the clean design of the website! Looks amazing on mobile.
| jakderrida wrote:
| This is exactly what I was using HN for. But, yeah, in kinda
| sucked compared to yours. Another thing I was trying to create
| was some sort of NN model that could use the semanticscholar
| h-index of authors along with the abstract text and T5 to
| estimate the one-year out citations. Just for personal use,
| though. That whole thing fell apart because semanticscholar is
| kinda crap for associating author links to the same author. I
| frequently ended up with the wrong professors, which I'd think
| would be easily fixable for them.
| carlossouza wrote:
| I did that (used other features). This is how new papers are
| ranked here:
|
| https://trendingpapers.com
| apstats wrote:
| I wonder if this could be used to train an LLM to convert PDFs
| with rich charts into HTML?
| reqo wrote:
| A lot of AI/ML papers these days have an accompanying interactive
| page like [0], will we see anything like these now directly in
| arXive?
|
| [0] https://voyager.minedojo.org/
| z2h-a6n wrote:
| I think then arXiv would have to deal with mantaining the tech
| stack and providing the presumably much higher server capacity
| to serve the more varied web pages that would result, so it
| seems like a tall order. arXiv already has an experimental
| integration with Papers with Code [0], which I guess provides
| similar results for the reader, though the authors have to
| figure out their own web hosting.
|
| [0] https://info.arxiv.org/labs/showcase.html#arxiv-links-to-
| cod...
| ansk wrote:
| When I open a large pdf on arxiv (100+ MB, not uncommon for ML
| papers focused on hi-res image generation), there is a
| significant load time (10+ seconds) before anything is rendered
| at all other than a loading bar. Does anyone know what the source
| of this delay is? Is it network-bound or is Chrome just really
| slow to render large PDFs? Do PDFs have to be fully downloaded to
| begin rendering? In any case, this delay is my only gripe with
| arxiv and a progressively rendered HTML doc that instantly loads
| the document text would be a huge improvement.
| IlliOnato wrote:
| It may be even that the time is taken to _generate_ a PDF.
|
| The format in which articles are submitted and stored in arXive
| is LaTeX. PDF is automatically generated from it.
|
| Probably arXiv does some caching of PDFs so they don't have to
| be generated anew every time they are requested, but I don't
| know how this caching works.
| upbeat_general wrote:
| I have the same issue. From what I can tell it's just network-
| bound and the Arxiv servers are slow. They theoretically allow
| for you to setup a caching server but after spending a while
| trying to get it setup, I haven't been able to get it to work.
|
| https://info.arxiv.org/help/faq/cache.html
| arccy wrote:
| maybe it'll be faster now with fastly
|
| https://news.ycombinator.com/item?id=38723373
| ww520 wrote:
| That's great. Now I can read the papers on my phone.
| svag wrote:
| The tool that it's being used for this offering is this one,
| https://github.com/arXiv/arxiv-readability, just to save a few
| clicks :)
| IshKebab wrote:
| Wow I did not know they have the LaTeX for all the papers and
| compile it themselves! That's pretty crazy. What if they don't
| have packages you need? What if your paper isn't written with
| LaTeX?
| WendyTheWillow wrote:
| I'm so far left wanting for an app that gives me a way to easily
| track and consume newly published work of a given topic. The
| existing apps are not great, and maybe this change will make it
| easier to provide better "reader" views, and possibly even tts (I
| like to listen+read).
| aragonite wrote:
| A lot of academic journals (say from Springer) also offer HTML
| formats for papers published in the past decade or so, which I
| personally often find more convenient for reading purposes than
| PDFs. For example, I parse text a lot faster if I use a regex to
| split each paragraph into sentences and place a linebreak after
| each sentence, or if I do natural language "syntax highlighting"
| by assigning a distinctive color to functional words indicating
| logical structure like 'if/then', 'and', 'or', 'not', 'because',
| and 'is'. And sometimes it really improves readability to be able
| to do "semantic highlighting", in the sense of say assigning a
| different hashed color to each proper name (or each labeled
| thesis, etc) that occurs in the paper. Such manipulations are
| basically impossible with PDFs. It makes me wish sci-hub would
| start archiving HTML versions in addition to PDFs!
| johnsillings wrote:
| https://www.arxiv-vanity.com/
| jakderrida wrote:
| And, of course, https://ar5iv.labs.arxiv.org/html
|
| However, ar5iv isn't a la carte like arxiv-vanity. They pretty
| much do last month's papers every month or so. Something like
| that.
| dginev wrote:
| Hi, ar5iv creator here.
|
| You can think of both arxiv-vanity and ar5iv as the "alpha"
| experiments that lead into the official arXiv "beta" HTML
| announced today.
|
| Once a few rounds of feedback and improvements are
| integrated, and the full collection of articles acquires HTML
| in the main arXiv site, ar5iv will be decommissioned.
|
| The plan is to turn all existing ar5iv links into redirects
| to the official HTML, and free up the resources for
| maintaining it. I am not sure what are the plans for
| maintaining arxiv-vanity, but I suspect they may head down a
| similar path some time later.
| philipashlock wrote:
| 30 years after HTML was invented to support accessibility and
| collaboration for research and academia and the same day the
| White House released their new accessibility guidance which
| happens to be the first time they've published formal new policy
| natively has HTML rather than PDF -
| https://www.whitehouse.gov/omb/management/ofcio/m-24-08-stre...
| murphyslab wrote:
| I feel surprised by how succinct, easy-to-understand, and
| sensible the policy (M-23-22) is:
|
| > Default to HTML: HyperText Markup Language (HTML) is the
| standard for publishing documents designed to be displayed in a
| web browser. HTML provides numerous advantages (e.g., easier to
| make accessible, friendlier to assistive technology, more
| dynamic and responsive, easier to maintain). When developing
| information for the web, agencies should default to creating
| and publishing content in an HTML format in lieu of publishing
| content in other electronic document formats that are designed
| for printing or preserving and protecting the content and
| layout of the document (e.g., PDF and DOCX formats). An agency
| should develop online content in a non-HTML format only if
| necessitated by a specific user need.
|
| https://www.whitehouse.gov/omb/management/ofcio/delivering-a...
| golol wrote:
| IMO pdf and HTML optimize for different things. pdf is easy and
| pretty. HTML is easy and responsive. But making pdf responsive is
| impossible and making HTML pretty is not easy. I think having
| arxiv for well-polished pretty documents, not responsive ugly
| documents. Most researchers don't have time to make an HTML
| responsive and pretty.
| querez wrote:
| Am researcher, care about responsiveness way more than pretty.
| I am super glad for the option. Downloading PDFs is super
| annoying. I'm stoked.
| radicalriddler wrote:
| FUCK YES (excuse my profanity). I have a tool that converts HTML
| to Neural Speech and I always wanted to push arXiv papers through
| it, but couldn't be bothered with a PDF implementation.
| topicseed wrote:
| What do they use to convert a PDF document to a clean, correct
| HTML document? It's a difficult space, especially with the
| variety of layouts you may find in PDF documents...
| blackbear_ wrote:
| Arxiv encourages users to submit the latex source of their
| papers rather than the PDF
| vegabook wrote:
| PDF is objectively much better than HTML at rendering text
| documents. And it's not even close. This could easily have been
| done 10, even 15-20 years ago. That it didn't is not just
| inertia. Latex and PDF have enormously better text rendering, and
| the static format locks a state-commit in time that is much
| easier to go back to and reference/critique. Unlike the
| intrinsically fluid nature of HTML. For academic work, milestone-
| like formats, that lock state in time, are useful for those who
| later build on them. And again, the rendering just doesn't
| compare and that imparts [sub]conscious quality signals.
| imranq wrote:
| At this point are academic papers simply peer-reviewed blog
| posts?
| acjohnson55 wrote:
| This is great! I browse papers on mobile, and PDF is so bad for
| that use case.
___________________________________________________________________
(page generated 2023-12-21 23:00 UTC)