[HN Gopher] Classifying all of the pdfs on the internet
___________________________________________________________________
Classifying all of the pdfs on the internet
Author : Nydhal
Score : 234 points
Date : 2024-08-19 12:23 UTC (10 hours ago)
(HTM) web link (snats.xyz)
(TXT) w3m dump (snats.xyz)
| afh1 wrote:
| Interesting read, I did not know about Common Crawl. I feel like
| RTBF is kind of a lost battle these days with more and more
| crawlers for AI and whatnot. Once on the internet there is no way
| back, for better or for worse. This tangent aside, 8TB is really
| not a lot of data, it's just 8 consumer-grade 1TB hard drives. I
| find it hard to believe this is "the largest corpus of PDFs
| online", maybe the largest public one. Not sure how
| representative it is of "the whole internet".
| Propelloni wrote:
| Doesn't sound like a lot, but where I am now we routinely work
| on very large infrastructure projects and the plans, documents
| and stuff mostly come as PDF. We are talking of thousands of
| documents, often with thousands of pages, per project and even
| very big projects almost never break 20 GB.
|
| If you like, you could say, PDF are information dense, but data
| sparse. After all it is mostly white space ;)
| IggleSniggle wrote:
| They often aren't like you're describing, though. For
| example, pdfs with high res images embedded that are drafts
| of future book or pamphlets prints. These can be hundreds of
| Mbs for a single pdf with less than 100 pages, and are so
| common in marketing departments that it's hard to imagine
| that you could fit anywhere close to all the pdfs on 8TB.
| Propelloni wrote:
| True, we get plenty of high-res pictures of film in PDF
| here and some of them are ridiculously large, easily
| approaching gigabyte sizes, like you said. But that's more
| a problem of the user creating the PDF than inherent to
| PDFs. A raw 36 megapixels (our fancy 4K displays are only
| 8.3 megapixels, for comparison) picture reproduction of an
| ISO 400 film takes only about 70 MB, which tells us that
| something went wrong in the transfer if a PDF containing 10
| pages of them cracks 1 GB.
|
| So, yeah, there are these monsters that send even beefy
| computers thrashing. But in my experience something in the
| creation process went wrong and it is appallingly common
| for a trade where PDFs are the go-to transfer format (I'm
| looking at you AutoCAD users!) I'd guess that the archive
| is doing the same we do, reprocess them for sensible
| results and store them. I assume you think the archive does
| not and then I'd agree with you. One determined civil
| engineer with AutoCAD can fill 8 TB in a week ;)
| yfontana wrote:
| I've been doing some work for an infrastructure company as
| well. They have a total of about 1 billion pages of PDF
| documents in their archives. If we assume even just 30 KB per
| page (which is quite low, all the PDFs I just randomly
| checked were higher, sometimes quite a bit so), that's
| already 30 TB of PDFs, just for that one company with 1B in
| annual sales.
| daemonologist wrote:
| I'm doing some work for a company that handles scanned
| documents (PDFs which are purely images) and they accumulate
| about 15 TB / year. Of course the actual amount of
| information is relatively small, just inflated by being
| scanned. Probably 80% of them were typed up, printed, and
| then scanned or faxed, and of course the first thing we do is
| OCR them to try to recover the original text and
| formatting...
| moralestapia wrote:
| Libgen size is ~33TB so, no, it's not "the largest corpus of
| PDFs online".
|
| (Although you could argue libgen is not really "public" in the
| legal sense of the word, lol).
|
| Disregarding that, the article is great!
|
| (edit: why would someone downvote this, HN is becoming quite
| hostile lately)
| simonw wrote:
| 8TB - ~8,000GB - is more than 33GB.
| moralestapia wrote:
| Whoops, typo!
|
| But that's what the comments are for, not the downvotes.
| matthewaveryusa wrote:
| It's being down voted because your number is really off.
| Libgen's corpus is 100+ TB
| mellosouls wrote:
| I haven't downvoted you but it is presumably because of your
| hasty typing or lack of proofreading/research.
|
| 33TB (first google result from 5 years ago) not 33GB. Larger
| figures from more recently.
| moralestapia wrote:
| >hasty typing or lack of proofreading/research
|
| This is exactly what I meant with "HN is becoming quite
| hostile"
|
| * I brought up something I looked up to support GP's
| argument.
|
| * The argument is correct.
|
| * I do it in good faith.
|
| * G is literally next to T.
|
| * I even praise the article, while at it.
|
| "Oh, but you made a typo!".
|
| Good luck, guys. I'm out.
|
| PS. I will give my whole 7 figure net worth, no questions
| asked, transferred immediately to any account of their
| choice, to anyone here who has not ever made a typo in
| their life.
| ozr wrote:
| Don't take it too personally. Downvoting/flagging it
| makes it clear to people who come across it in the future
| that it's wrong.
| dotancohen wrote:
| > I will give all my 7 figure net worth, no questions
| asked, transferred immediately to any account of their
| choice, to anyone here who has not ever made a typo in
| their life.
|
| My greatest typo was saying "I Do" when it should have
| been "I Go".
| llm_trw wrote:
| > I will give my whole 7 figure net worth
|
| You sound deeply unpleasant to talk to.
|
| Imaginary internet points are just that.
| mellosouls wrote:
| Like I said, I didn't downvote and took the time to
| answer your question. I didn't take the time to sugarcoat
| it.
|
| You are interpreting bluntness as hostility; that's
| ultimately an issue for you to resolve.
| moralestapia wrote:
| You don't have to sugarcoat it.
|
| You just have to read this site's guidelines and follow
| them.
|
| Ez pz.
| mellosouls wrote:
| Have been throughout. Anyway, I hope you are able to
| reconsider and move on within HN.
| samatman wrote:
| > _Please don 't comment about the voting on comments. It
| never does any good, and it makes boring reading._
|
| https://news.ycombinator.com/newsguidelines.html
| Kerb_ wrote:
| I haven't ever made a typo, all of my mispelings are
| intended and therefore not mistakes
| tecleandor wrote:
| I think Libgen is ~100TB, and the full Anna's Archive is near
| a PB.
|
| They all probably contain lots of duplicates but...
|
| https://annas-archive.se/datasets
| dotancohen wrote:
| I upvoted this comment because, though the number is wrong,
| it proves the point. The fact that the correct number proves
| the point even more, is a reason _not_ to downvote the
| comment.
| reaperducer wrote:
| _(edit: why would someone downvote this, HN is becoming quite
| hostile lately)_
|
| Also, there are browser extensions that will automatically
| downvote and/or hide HN comments that use words like "lol,"
| or start with "So..." or include any of a number of words
| that the user considers indicative of low-grade content.
| tokai wrote:
| Yeah 8TB is really tiny. Google scholar was estimated to index
| 160.000.000 pdfs in 2015.[0] If we assume that a third of those
| are not behind paywalls, and average pdf size is 1mb, its ends
| up as something above 50TB of documents. Almost ten years later
| the number of available pdfs of just scholarly communication
| should be substantially higher.
|
| [0] https://link.springer.com/article/10.1007/s11192-015-1614-6
| elorant wrote:
| Anna's archive has some 300M pdfs.
| tokai wrote:
| We're talking about the open web here. But yeah that's the
| point, the dataset is unreasonably small.
| deweller wrote:
| Is it possible that the 8 TB is just the extracted text?
| tokai wrote:
| No, the Safedocs dataset is unprocessed pdfs.
| ziddoap wrote:
| > _I feel like RTBF is kind of a lost battle these days_
|
| For those of us who aren't familiar with this random acronym, I
| _think_ RTBF = right to be forgotten.
| ronsor wrote:
| RTBF was a ludicrous concept before AI and these new crawlers.
|
| Only EU bureaucracts would have the hubris to believe you could
| actually, comprehensively _remove_ information from the
| Internet. Once something is spread, it is there, forever.
| oersted wrote:
| Correct me if I'm wrong, but I always took RTBF to mean you
| have the right to be forgotten by any specific service
| provider: that you can request they delete the data they have
| that relates to you, and that they forward the request to any
| subprocessors. That's fairly reasonable and doable, it is
| enforced by GDPR and a number of other wide-reaching laws
| already, and it is a relatively common practice nowadays to
| allow users to make such requests with certain guarantees.
|
| It never meant that you have the right to ask "the Internet"
| as a whole to scrub you from all possible records, that's
| indeed ludicrous. And if someone took it to mean that and
| they were pushing for it, they were just confused, no serious
| law ever proposed that.
| miohtama wrote:
| There is a whole business sector for "Online reputation
| fixers"
|
| https://www.mycleanslate.co.uk/
|
| What they usually do
|
| - Spam Google with the name to bury content
|
| - Send legal threads and use GDPR
|
| They have legit use cases, but are often used by convicted
| or shady businessmen, politicians, and scammers to hide
| their earlier misdeeds.
| gsck wrote:
| RTBF isn't about having your information wiped from the
| internet. Its a safe assumption any public information about
| you is completely out of your control as soon as its public.
|
| RTBF is about getting companies to get rid of any trace of
| you so they cannot use that data, not removing all traces
| about you across the internet.
| fsckboy wrote:
| > _RTBF isn 't about having your information wiped from the
| internet. _
|
| your take is misleading enough to be considered wrong. It's
| "don't use public information about me in search engines, I
| don't want people to find that information about me", not
| simply "don't use my information for marketing purposes"
|
| https://en.wikipedia.org/wiki/Right_to_be_forgotten
|
| first paragraph of the article: _The right to be forgotten
| (RTBF) is the right to have private information about a
| person be removed from Internet searches and other
| directories in some circumstances. The issue has arisen
| from desires of individuals to "determine the development
| of their life in an autonomous way, without being
| perpetually or periodically stigmatized as a consequence of
| a specific action performed in the past". The right
| entitles a person to have data about them deleted so that
| it can no longer be discovered by third parties,
| particularly through search engines._
| specialist wrote:
| Once demographic data cannot be crawled or cached by 3rd
| parties, we get RTBF for free.
| gwervc wrote:
| > Once something is spread, it is there, forever.
|
| Really depends on the content. Tons of websites are going
| down everyday, link rot is a real thing. Internet archive or
| people don't save nearly everything.
|
| Something I should do more often is saving mhtml copies of
| webpages I find interesting.
| PaulHoule wrote:
| Also a neurodivergent person I feel very much discriminated
| against when a whole continent weaponizes the law to protect
| scam artists who weaponize their social skills to steal from
| people. It makes me feel unwelcome going to Europe and for
| all the handwriting about Europe's poor economic performance
| it is yet another explanation of why Europe is falling behind
| -- their wealth is being stolen by people who can't be held
| accountable.
| jononor wrote:
| Which scam artists are you referring to?
| PaulHoule wrote:
| The ones who have filed lawsuits to try to get people in
| Europe to forget about their crimes.
| jononor wrote:
| Do you have some examples? I was not aware that this was
| a thing. And are we talking about sentences fully served,
| or before that time?
| tivert wrote:
| > RTBF
|
| Right to be forgotten, not the Belgian public service
| broadcaster (https://en.wikipedia.org/wiki/RTBF)?
| lkuty wrote:
| Living in Belgium, I first thought that it was about the
| TV/radio service. Never saw the acronym R.T.B.F.
| seanw265 wrote:
| Tangentially related, I was once handed a _single_ PDF between
| 2 and 5 GBs in size and asked to run inference on it. This was
| the result of a miscommunication with the data provider, but I
| think it 's funny and almost impressive that this file even
| exists.
| SnowflakeOnIce wrote:
| The common crawl only pulls documents less than a small limit
| (1MiB last I checked). Without special handling in this
| project, bigger documents than that would be missing.
|
| So indeed, not representative of the whole Internet.
| ziddoap wrote:
| From the article:
|
| > _Specifically, when Common Crawl gets to a pdf, it just
| stores the first megabyte of information and truncates the
| rest.
|
| This is where SafeDocs or CC-MAIN-2021-31-PDF-UNTRUNCATED
| enters the picture. This corpus was originally created by the
| DARPA SafeDocs program and what it did was refetch all the
| different pdfs from a snapshot of Common Crawl to have
| untruncated versions of them._
| buildbot wrote:
| I have 20-40TB (pre-dedup) of PDFs - 8TB is a lot but not even
| close to the total number of PDFs available.
| sporedro wrote:
| Just wondering what do you collect? Is it mainly mirroring
| things like libgen?
|
| I have a decent collection of ebooks/pdfs/manga from reading.
| But I can't imagine how large a 20TB library is.
| buildbot wrote:
| No torrents at all in this data, all publicly available/open
| access. Mostly scientific pdfs, and a good portion of those
| are scans not just text. So the actual text amount is
| probably pretty low compared to the total. But still, a lot
| more than 8TB of raw data out there. I bet the total number
| of PDFs is close to a petabyte if not more.
| tylerflick wrote:
| > I bet the total number of PDFs is close to a petabyte if
| not more.
|
| That's a safe bet. I'v seen PDF's in the GBs from users
| treating it like a container format (which it is).
| Maxion wrote:
| It's probably tens of petabytes if not more, if you count
| PDFs that'd be private. Invoices, order confirmations,
| contracts. There's just so so much.
| reaperducer wrote:
| _Just wondering what do you collect?_
|
| I can't speak for the OP, but you can buy optical media of
| old out-of-print magazines scanned as PDFs.
|
| I bought the entirety of _Desert Magazine_ from 1937-1985. It
| arrived on something like 15 CD-ROMS.
|
| I drag-and-dropped the entire collection into iBooks, and
| read them when I'm on the train.
|
| (Yes, they're probably on archive.org for free, but this is
| far easier and more convenient, and I prefer to support
| publishers rather than undermine their efforts.)
| buildbot wrote:
| Yep, a good bit of them are from sources like this :)
| mehulashah wrote:
| Care to make it publicly available? Or is that not permitted on
| your dataset? Certainly, there's a lot more PDFs out there than
| 8TB. I bet there's a lot of redundancy in yours, but doesn't
| dedup well because of all the images.
| buildbot wrote:
| I think that would be legally iffy for the stuff like
| collections of old magazines that were purchased on CD/DVD
| and such :/
| Thaxll wrote:
| First you need a good PDF library :/
| llm_trw wrote:
| Back in 2006 there were multiple 1tb collections of textbooks as
| torrents. I imagine the size and number has only grown since
| then.
| namrog84 wrote:
| That was before hoarding and building questionable businesses
| around them became a thing. I remember it being really easy to
| find textbooks, solution manuals, and related pdf and other
| stuff as late as 2008 far easier than 6-8 years later.
|
| The main difference were sites like chegg and many other sites
| started slurping them up to resell in some way.
| loa_in_ wrote:
| It doesn't take away the torrents, no?
| TuringNYC wrote:
| Ive been playing with https://www.aryn.ai/ for Partitioning.
| Curious if anyone has tried these tools for better data
| extraction from PDFs. Any other suggestions?
|
| (I'm a bit disappointed that most of the discussion is about
| estimating the size of PDFs on the internet, I'd love to hear
| more about different approaches to extracting better data from
| the PDFs.)
| dwynings wrote:
| https://www.sensible.so/
|
| Full disclosure: I'm an employee
| ned_at_codomain wrote:
| This is a really cool idea, thanks for sharing. I don't have that
| much free time these days, but I was thinking of trying a
| similar-but-different project not too long ago.
|
| I wanted to make a bit of an open source tool to pull down useful
| time series data for the social sciences (e.g. time series of
| social media comments about grocery prices). Seems like LLMs have
| unlocked all kinds of new research angles that people aren't
| using yet.
|
| I may steal some of your good ideas if I ever get to work on that
| side project :)
| whistle650 wrote:
| Interesting read with lots of good detail, thank you. A comment:
| if you are balancing the classes when you do one vs all binary
| training, and then use the max probability for inference, your
| probabilities might not be calibrated well, which could be a
| problem. Do you correct the probabilities before taking the
| argmax?
| minimaxir wrote:
| One of the now-underdiscussed features of embeddings is that you
| can indeed use any existing statistical modeling techniques on
| them out of the box, and as a bonus avoid the common NLP
| preprocessing nuances and pitfalls (e.g. stemming) entirely.
|
| This post is a good example on why going straight to LLM
| embeddings for NLP is a pragmatic first step, especially for long
| documents.
| guiomie wrote:
| Interesting and fun article! I've been experimenting with various
| LLMs/GenAI solutions to extract tabular data from PDFs with
| underwhelming results. It seems like they are good at extracting
| strings of text and summarizing (e.g what was the total price?
| when was this printed?) but extracting reliably into a CSV has a
| decent margin of error.
| abhi_p wrote:
| Disclosure: I'm an employee.
|
| Give the Aryn partitioning service a shot:
| https://www.aryn.ai/post/announcing-the-aryn-partitioning-se...
|
| We recently released it and we've a few examples here:
| https://sycamore.readthedocs.io/en/stable/aryn_cloud/get_sta...
| that show you how to turn the tabular data from the pdf into a
| pandas dataframe(which you can then turn into csv).
| layer8 wrote:
| I would have expected categories for product brochures and
| product manuals.
| autokad wrote:
| would be interesting to see if they tried LDA (latent direchelet
| allocation) topics
| snats wrote:
| Hi! Author here, I wasn't expecting this to be at the top of HN,
| AMA
| ks2048 wrote:
| I don't have 8TB laying around, but we can be a bit more
| clever.... In particular I cared about a specific column called
| url. I really care about the urls because they essentially tell
| us a lot more from a website than what meats the eye.
|
| I'm I correct that it is only only using the URL of the PDF to do
| classification? Maybe still useful, but that's quite a different
| story than "classifying all the pdfs".
| xattt wrote:
| It's just classifying the URLs if that's the case.
|
| The legwork to classify PDFs is already done, and the
| authorship of the article can go to anyone who can get a grant
| for a $400 NewEgg order for an 8TB drive.
| gnewton77 wrote:
| Did some similar work with similar visualizations ~2009, on ~5.7M
| research articles (PDFs, private corpus) from scientific
| publishers Elsevier, Springer:
|
| Newton, G., A. Callahan & M. Dumontier. 2009. Semantic Journal
| Mapping for Search Visualization in a Large Scale Article Digital
| Library. Second Workshop on Very Large Digital Libraries at the
| European Conference on Digital Libraries (ECDL) 2009.
| https://lekythos.library.ucy.ac.cy/bitstream/handle/10797/14...
|
| I am the first author.
| byteknight wrote:
| This seems like cool work but with a ton of "marketing hype
| speak" that immediately gets watered down by the first paragraph.
|
| Ordering of statements.
|
| 1. (Title) Classifying _all_ of the pdfs on the internet
|
| 2. (First Paragraph) Well not all, but all the PDFs in Common
| Crawl
|
| 3. (First Image) Well not all of them, but 500k of them.
|
| I am not knocking the project, but while categorizing 500k PDFs
| is something we couldnt necessarily do well a few years ago, this
| is far from "The internet's PDFs".
| 1-6 wrote:
| Overpromise with headline, underdeliver on details.
| muratsu wrote:
| I would have expected the finetuned model to perform much better.
| Would be curious to see the performance with other models
| mehulashah wrote:
| Classification is just a start. Wondering if it's worth doing
| something more -- like turning all of the text into Markdown or
| HTML? Would anyone find that interesting?
| Treesrule14 wrote:
| There are a lot of webcrawlers where the chief feature is
| turning the website into markdown, I don't quite understand
| what they are doing for me thats useful since I can just do
| something like `markdownify(my_html)` or whatever, all this to
| say is that I wouldn't find this useful, but also clearly
| people think this is a useful feature as part of an LLM
| pipeline.
| loa_in_ wrote:
| You don't want the footer or navigation in the output.
| Ideally you want the main content of the page, if it exists.
| How do you assign header level if they're only differentiated
| by CSS left-margin in a variety of units? How do you
| interpret documents that render properly but are hardly
| correct HTML?
| Treesrule14 wrote:
| Thanks, I guess, none of that stuff seemed super useful to
| cut systematically, but I'm gonna run some tests.
| excalibur wrote:
| > How would you classify all the pdfs in the internet?
|
| Definitely as 'hot dog' or 'not a hot dog'.
___________________________________________________________________
(page generated 2024-08-19 23:00 UTC)