hngopher.com

       [HN Gopher] Classifying all of the pdfs on the internet
       ___________________________________________________________________
        
       Classifying all of the pdfs on the internet
        
       Author : Nydhal
       Score  : 234 points
       Date   : 2024-08-19 12:23 UTC (10 hours ago)
        
 (HTM) web link (snats.xyz)
 (TXT) w3m dump (snats.xyz)
        
       | afh1 wrote:
       | Interesting read, I did not know about Common Crawl. I feel like
       | RTBF is kind of a lost battle these days with more and more
       | crawlers for AI and whatnot. Once on the internet there is no way
       | back, for better or for worse. This tangent aside, 8TB is really
       | not a lot of data, it's just 8 consumer-grade 1TB hard drives. I
       | find it hard to believe this is "the largest corpus of PDFs
       | online", maybe the largest public one. Not sure how
       | representative it is of "the whole internet".
        
         | Propelloni wrote:
         | Doesn't sound like a lot, but where I am now we routinely work
         | on very large infrastructure projects and the plans, documents
         | and stuff mostly come as PDF. We are talking of thousands of
         | documents, often with thousands of pages, per project and even
         | very big projects almost never break 20 GB.
         | 
         | If you like, you could say, PDF are information dense, but data
         | sparse. After all it is mostly white space ;)
        
           | IggleSniggle wrote:
           | They often aren't like you're describing, though. For
           | example, pdfs with high res images embedded that are drafts
           | of future book or pamphlets prints. These can be hundreds of
           | Mbs for a single pdf with less than 100 pages, and are so
           | common in marketing departments that it's hard to imagine
           | that you could fit anywhere close to all the pdfs on 8TB.
        
             | Propelloni wrote:
             | True, we get plenty of high-res pictures of film in PDF
             | here and some of them are ridiculously large, easily
             | approaching gigabyte sizes, like you said. But that's more
             | a problem of the user creating the PDF than inherent to
             | PDFs. A raw 36 megapixels (our fancy 4K displays are only
             | 8.3 megapixels, for comparison) picture reproduction of an
             | ISO 400 film takes only about 70 MB, which tells us that
             | something went wrong in the transfer if a PDF containing 10
             | pages of them cracks 1 GB.
             | 
             | So, yeah, there are these monsters that send even beefy
             | computers thrashing. But in my experience something in the
             | creation process went wrong and it is appallingly common
             | for a trade where PDFs are the go-to transfer format (I'm
             | looking at you AutoCAD users!) I'd guess that the archive
             | is doing the same we do, reprocess them for sensible
             | results and store them. I assume you think the archive does
             | not and then I'd agree with you. One determined civil
             | engineer with AutoCAD can fill 8 TB in a week ;)
        
           | yfontana wrote:
           | I've been doing some work for an infrastructure company as
           | well. They have a total of about 1 billion pages of PDF
           | documents in their archives. If we assume even just 30 KB per
           | page (which is quite low, all the PDFs I just randomly
           | checked were higher, sometimes quite a bit so), that's
           | already 30 TB of PDFs, just for that one company with 1B in
           | annual sales.
        
           | daemonologist wrote:
           | I'm doing some work for a company that handles scanned
           | documents (PDFs which are purely images) and they accumulate
           | about 15 TB / year. Of course the actual amount of
           | information is relatively small, just inflated by being
           | scanned. Probably 80% of them were typed up, printed, and
           | then scanned or faxed, and of course the first thing we do is
           | OCR them to try to recover the original text and
           | formatting...
        
         | moralestapia wrote:
         | Libgen size is ~33TB so, no, it's not "the largest corpus of
         | PDFs online".
         | 
         | (Although you could argue libgen is not really "public" in the
         | legal sense of the word, lol).
         | 
         | Disregarding that, the article is great!
         | 
         | (edit: why would someone downvote this, HN is becoming quite
         | hostile lately)
        
           | simonw wrote:
           | 8TB - ~8,000GB - is more than 33GB.
        
             | moralestapia wrote:
             | Whoops, typo!
             | 
             | But that's what the comments are for, not the downvotes.
        
           | matthewaveryusa wrote:
           | It's being down voted because your number is really off.
           | Libgen's corpus is 100+ TB
        
           | mellosouls wrote:
           | I haven't downvoted you but it is presumably because of your
           | hasty typing or lack of proofreading/research.
           | 
           | 33TB (first google result from 5 years ago) not 33GB. Larger
           | figures from more recently.
        
             | moralestapia wrote:
             | >hasty typing or lack of proofreading/research
             | 
             | This is exactly what I meant with "HN is becoming quite
             | hostile"
             | 
             | * I brought up something I looked up to support GP's
             | argument.
             | 
             | * The argument is correct.
             | 
             | * I do it in good faith.
             | 
             | * G is literally next to T.
             | 
             | * I even praise the article, while at it.
             | 
             | "Oh, but you made a typo!".
             | 
             | Good luck, guys. I'm out.
             | 
             | PS. I will give my whole 7 figure net worth, no questions
             | asked, transferred immediately to any account of their
             | choice, to anyone here who has not ever made a typo in
             | their life.
        
               | ozr wrote:
               | Don't take it too personally. Downvoting/flagging it
               | makes it clear to people who come across it in the future
               | that it's wrong.
        
               | dotancohen wrote:
               | > I will give all my 7 figure net worth, no questions
               | asked, transferred immediately to any account of their
               | choice, to anyone here who has not ever made a typo in
               | their life.
               | 
               | My greatest typo was saying "I Do" when it should have
               | been "I Go".
        
               | llm_trw wrote:
               | > I will give my whole 7 figure net worth
               | 
               | You sound deeply unpleasant to talk to.
               | 
               | Imaginary internet points are just that.
        
               | mellosouls wrote:
               | Like I said, I didn't downvote and took the time to
               | answer your question. I didn't take the time to sugarcoat
               | it.
               | 
               | You are interpreting bluntness as hostility; that's
               | ultimately an issue for you to resolve.
        
               | moralestapia wrote:
               | You don't have to sugarcoat it.
               | 
               | You just have to read this site's guidelines and follow
               | them.
               | 
               | Ez pz.
        
               | mellosouls wrote:
               | Have been throughout. Anyway, I hope you are able to
               | reconsider and move on within HN.
        
               | samatman wrote:
               | > _Please don 't comment about the voting on comments. It
               | never does any good, and it makes boring reading._
               | 
               | https://news.ycombinator.com/newsguidelines.html
        
               | Kerb_ wrote:
               | I haven't ever made a typo, all of my mispelings are
               | intended and therefore not mistakes
        
           | tecleandor wrote:
           | I think Libgen is ~100TB, and the full Anna's Archive is near
           | a PB.
           | 
           | They all probably contain lots of duplicates but...
           | 
           | https://annas-archive.se/datasets
        
           | dotancohen wrote:
           | I upvoted this comment because, though the number is wrong,
           | it proves the point. The fact that the correct number proves
           | the point even more, is a reason _not_ to downvote the
           | comment.
        
           | reaperducer wrote:
           | _(edit: why would someone downvote this, HN is becoming quite
           | hostile lately)_
           | 
           | Also, there are browser extensions that will automatically
           | downvote and/or hide HN comments that use words like "lol,"
           | or start with "So..." or include any of a number of words
           | that the user considers indicative of low-grade content.
        
         | tokai wrote:
         | Yeah 8TB is really tiny. Google scholar was estimated to index
         | 160.000.000 pdfs in 2015.[0] If we assume that a third of those
         | are not behind paywalls, and average pdf size is 1mb, its ends
         | up as something above 50TB of documents. Almost ten years later
         | the number of available pdfs of just scholarly communication
         | should be substantially higher.
         | 
         | [0] https://link.springer.com/article/10.1007/s11192-015-1614-6
        
           | elorant wrote:
           | Anna's archive has some 300M pdfs.
        
             | tokai wrote:
             | We're talking about the open web here. But yeah that's the
             | point, the dataset is unreasonably small.
        
         | deweller wrote:
         | Is it possible that the 8 TB is just the extracted text?
        
           | tokai wrote:
           | No, the Safedocs dataset is unprocessed pdfs.
        
         | ziddoap wrote:
         | > _I feel like RTBF is kind of a lost battle these days_
         | 
         | For those of us who aren't familiar with this random acronym, I
         | _think_ RTBF = right to be forgotten.
        
         | ronsor wrote:
         | RTBF was a ludicrous concept before AI and these new crawlers.
         | 
         | Only EU bureaucracts would have the hubris to believe you could
         | actually, comprehensively _remove_ information from the
         | Internet. Once something is spread, it is there, forever.
        
           | oersted wrote:
           | Correct me if I'm wrong, but I always took RTBF to mean you
           | have the right to be forgotten by any specific service
           | provider: that you can request they delete the data they have
           | that relates to you, and that they forward the request to any
           | subprocessors. That's fairly reasonable and doable, it is
           | enforced by GDPR and a number of other wide-reaching laws
           | already, and it is a relatively common practice nowadays to
           | allow users to make such requests with certain guarantees.
           | 
           | It never meant that you have the right to ask "the Internet"
           | as a whole to scrub you from all possible records, that's
           | indeed ludicrous. And if someone took it to mean that and
           | they were pushing for it, they were just confused, no serious
           | law ever proposed that.
        
             | miohtama wrote:
             | There is a whole business sector for "Online reputation
             | fixers"
             | 
             | https://www.mycleanslate.co.uk/
             | 
             | What they usually do
             | 
             | - Spam Google with the name to bury content
             | 
             | - Send legal threads and use GDPR
             | 
             | They have legit use cases, but are often used by convicted
             | or shady businessmen, politicians, and scammers to hide
             | their earlier misdeeds.
        
           | gsck wrote:
           | RTBF isn't about having your information wiped from the
           | internet. Its a safe assumption any public information about
           | you is completely out of your control as soon as its public.
           | 
           | RTBF is about getting companies to get rid of any trace of
           | you so they cannot use that data, not removing all traces
           | about you across the internet.
        
             | fsckboy wrote:
             | > _RTBF isn 't about having your information wiped from the
             | internet. _
             | 
             | your take is misleading enough to be considered wrong. It's
             | "don't use public information about me in search engines, I
             | don't want people to find that information about me", not
             | simply "don't use my information for marketing purposes"
             | 
             | https://en.wikipedia.org/wiki/Right_to_be_forgotten
             | 
             | first paragraph of the article: _The right to be forgotten
             | (RTBF) is the right to have private information about a
             | person be removed from Internet searches and other
             | directories in some circumstances. The issue has arisen
             | from desires of individuals to "determine the development
             | of their life in an autonomous way, without being
             | perpetually or periodically stigmatized as a consequence of
             | a specific action performed in the past". The right
             | entitles a person to have data about them deleted so that
             | it can no longer be discovered by third parties,
             | particularly through search engines._
        
               | specialist wrote:
               | Once demographic data cannot be crawled or cached by 3rd
               | parties, we get RTBF for free.
        
           | gwervc wrote:
           | > Once something is spread, it is there, forever.
           | 
           | Really depends on the content. Tons of websites are going
           | down everyday, link rot is a real thing. Internet archive or
           | people don't save nearly everything.
           | 
           | Something I should do more often is saving mhtml copies of
           | webpages I find interesting.
        
           | PaulHoule wrote:
           | Also a neurodivergent person I feel very much discriminated
           | against when a whole continent weaponizes the law to protect
           | scam artists who weaponize their social skills to steal from
           | people. It makes me feel unwelcome going to Europe and for
           | all the handwriting about Europe's poor economic performance
           | it is yet another explanation of why Europe is falling behind
           | -- their wealth is being stolen by people who can't be held
           | accountable.
        
             | jononor wrote:
             | Which scam artists are you referring to?
        
               | PaulHoule wrote:
               | The ones who have filed lawsuits to try to get people in
               | Europe to forget about their crimes.
        
               | jononor wrote:
               | Do you have some examples? I was not aware that this was
               | a thing. And are we talking about sentences fully served,
               | or before that time?
        
         | tivert wrote:
         | > RTBF
         | 
         | Right to be forgotten, not the Belgian public service
         | broadcaster (https://en.wikipedia.org/wiki/RTBF)?
        
           | lkuty wrote:
           | Living in Belgium, I first thought that it was about the
           | TV/radio service. Never saw the acronym R.T.B.F.
        
         | seanw265 wrote:
         | Tangentially related, I was once handed a _single_ PDF between
         | 2 and 5 GBs in size and asked to run inference on it. This was
         | the result of a miscommunication with the data provider, but I
         | think it 's funny and almost impressive that this file even
         | exists.
        
         | SnowflakeOnIce wrote:
         | The common crawl only pulls documents less than a small limit
         | (1MiB last I checked). Without special handling in this
         | project, bigger documents than that would be missing.
         | 
         | So indeed, not representative of the whole Internet.
        
           | ziddoap wrote:
           | From the article:
           | 
           | > _Specifically, when Common Crawl gets to a pdf, it just
           | stores the first megabyte of information and truncates the
           | rest.
           | 
           | This is where SafeDocs or CC-MAIN-2021-31-PDF-UNTRUNCATED
           | enters the picture. This corpus was originally created by the
           | DARPA SafeDocs program and what it did was refetch all the
           | different pdfs from a snapshot of Common Crawl to have
           | untruncated versions of them._
        
       | buildbot wrote:
       | I have 20-40TB (pre-dedup) of PDFs - 8TB is a lot but not even
       | close to the total number of PDFs available.
        
         | sporedro wrote:
         | Just wondering what do you collect? Is it mainly mirroring
         | things like libgen?
         | 
         | I have a decent collection of ebooks/pdfs/manga from reading.
         | But I can't imagine how large a 20TB library is.
        
           | buildbot wrote:
           | No torrents at all in this data, all publicly available/open
           | access. Mostly scientific pdfs, and a good portion of those
           | are scans not just text. So the actual text amount is
           | probably pretty low compared to the total. But still, a lot
           | more than 8TB of raw data out there. I bet the total number
           | of PDFs is close to a petabyte if not more.
        
             | tylerflick wrote:
             | > I bet the total number of PDFs is close to a petabyte if
             | not more.
             | 
             | That's a safe bet. I'v seen PDF's in the GBs from users
             | treating it like a container format (which it is).
        
               | Maxion wrote:
               | It's probably tens of petabytes if not more, if you count
               | PDFs that'd be private. Invoices, order confirmations,
               | contracts. There's just so so much.
        
           | reaperducer wrote:
           | _Just wondering what do you collect?_
           | 
           | I can't speak for the OP, but you can buy optical media of
           | old out-of-print magazines scanned as PDFs.
           | 
           | I bought the entirety of _Desert Magazine_ from 1937-1985. It
           | arrived on something like 15 CD-ROMS.
           | 
           | I drag-and-dropped the entire collection into iBooks, and
           | read them when I'm on the train.
           | 
           | (Yes, they're probably on archive.org for free, but this is
           | far easier and more convenient, and I prefer to support
           | publishers rather than undermine their efforts.)
        
             | buildbot wrote:
             | Yep, a good bit of them are from sources like this :)
        
         | mehulashah wrote:
         | Care to make it publicly available? Or is that not permitted on
         | your dataset? Certainly, there's a lot more PDFs out there than
         | 8TB. I bet there's a lot of redundancy in yours, but doesn't
         | dedup well because of all the images.
        
           | buildbot wrote:
           | I think that would be legally iffy for the stuff like
           | collections of old magazines that were purchased on CD/DVD
           | and such :/
        
       | Thaxll wrote:
       | First you need a good PDF library :/
        
       | llm_trw wrote:
       | Back in 2006 there were multiple 1tb collections of textbooks as
       | torrents. I imagine the size and number has only grown since
       | then.
        
         | namrog84 wrote:
         | That was before hoarding and building questionable businesses
         | around them became a thing. I remember it being really easy to
         | find textbooks, solution manuals, and related pdf and other
         | stuff as late as 2008 far easier than 6-8 years later.
         | 
         | The main difference were sites like chegg and many other sites
         | started slurping them up to resell in some way.
        
           | loa_in_ wrote:
           | It doesn't take away the torrents, no?
        
       | TuringNYC wrote:
       | Ive been playing with https://www.aryn.ai/ for Partitioning.
       | Curious if anyone has tried these tools for better data
       | extraction from PDFs. Any other suggestions?
       | 
       | (I'm a bit disappointed that most of the discussion is about
       | estimating the size of PDFs on the internet, I'd love to hear
       | more about different approaches to extracting better data from
       | the PDFs.)
        
         | dwynings wrote:
         | https://www.sensible.so/
         | 
         | Full disclosure: I'm an employee
        
       | ned_at_codomain wrote:
       | This is a really cool idea, thanks for sharing. I don't have that
       | much free time these days, but I was thinking of trying a
       | similar-but-different project not too long ago.
       | 
       | I wanted to make a bit of an open source tool to pull down useful
       | time series data for the social sciences (e.g. time series of
       | social media comments about grocery prices). Seems like LLMs have
       | unlocked all kinds of new research angles that people aren't
       | using yet.
       | 
       | I may steal some of your good ideas if I ever get to work on that
       | side project :)
        
       | whistle650 wrote:
       | Interesting read with lots of good detail, thank you. A comment:
       | if you are balancing the classes when you do one vs all binary
       | training, and then use the max probability for inference, your
       | probabilities might not be calibrated well, which could be a
       | problem. Do you correct the probabilities before taking the
       | argmax?
        
       | minimaxir wrote:
       | One of the now-underdiscussed features of embeddings is that you
       | can indeed use any existing statistical modeling techniques on
       | them out of the box, and as a bonus avoid the common NLP
       | preprocessing nuances and pitfalls (e.g. stemming) entirely.
       | 
       | This post is a good example on why going straight to LLM
       | embeddings for NLP is a pragmatic first step, especially for long
       | documents.
        
       | guiomie wrote:
       | Interesting and fun article! I've been experimenting with various
       | LLMs/GenAI solutions to extract tabular data from PDFs with
       | underwhelming results. It seems like they are good at extracting
       | strings of text and summarizing (e.g what was the total price?
       | when was this printed?) but extracting reliably into a CSV has a
       | decent margin of error.
        
         | abhi_p wrote:
         | Disclosure: I'm an employee.
         | 
         | Give the Aryn partitioning service a shot:
         | https://www.aryn.ai/post/announcing-the-aryn-partitioning-se...
         | 
         | We recently released it and we've a few examples here:
         | https://sycamore.readthedocs.io/en/stable/aryn_cloud/get_sta...
         | that show you how to turn the tabular data from the pdf into a
         | pandas dataframe(which you can then turn into csv).
        
       | layer8 wrote:
       | I would have expected categories for product brochures and
       | product manuals.
        
       | autokad wrote:
       | would be interesting to see if they tried LDA (latent direchelet
       | allocation) topics
        
       | snats wrote:
       | Hi! Author here, I wasn't expecting this to be at the top of HN,
       | AMA
        
       | ks2048 wrote:
       | I don't have 8TB laying around, but we can be a bit more
       | clever.... In particular I cared about a specific column called
       | url. I really care about the urls because they essentially tell
       | us a lot more from a website than what meats the eye.
       | 
       | I'm I correct that it is only only using the URL of the PDF to do
       | classification? Maybe still useful, but that's quite a different
       | story than "classifying all the pdfs".
        
         | xattt wrote:
         | It's just classifying the URLs if that's the case.
         | 
         | The legwork to classify PDFs is already done, and the
         | authorship of the article can go to anyone who can get a grant
         | for a $400 NewEgg order for an 8TB drive.
        
       | gnewton77 wrote:
       | Did some similar work with similar visualizations ~2009, on ~5.7M
       | research articles (PDFs, private corpus) from scientific
       | publishers Elsevier, Springer:
       | 
       | Newton, G., A. Callahan & M. Dumontier. 2009. Semantic Journal
       | Mapping for Search Visualization in a Large Scale Article Digital
       | Library. Second Workshop on Very Large Digital Libraries at the
       | European Conference on Digital Libraries (ECDL) 2009.
       | https://lekythos.library.ucy.ac.cy/bitstream/handle/10797/14...
       | 
       | I am the first author.
        
       | byteknight wrote:
       | This seems like cool work but with a ton of "marketing hype
       | speak" that immediately gets watered down by the first paragraph.
       | 
       | Ordering of statements.
       | 
       | 1. (Title) Classifying _all_ of the pdfs on the internet
       | 
       | 2. (First Paragraph) Well not all, but all the PDFs in Common
       | Crawl
       | 
       | 3. (First Image) Well not all of them, but 500k of them.
       | 
       | I am not knocking the project, but while categorizing 500k PDFs
       | is something we couldnt necessarily do well a few years ago, this
       | is far from "The internet's PDFs".
        
         | 1-6 wrote:
         | Overpromise with headline, underdeliver on details.
        
       | muratsu wrote:
       | I would have expected the finetuned model to perform much better.
       | Would be curious to see the performance with other models
        
       | mehulashah wrote:
       | Classification is just a start. Wondering if it's worth doing
       | something more -- like turning all of the text into Markdown or
       | HTML? Would anyone find that interesting?
        
         | Treesrule14 wrote:
         | There are a lot of webcrawlers where the chief feature is
         | turning the website into markdown, I don't quite understand
         | what they are doing for me thats useful since I can just do
         | something like `markdownify(my_html)` or whatever, all this to
         | say is that I wouldn't find this useful, but also clearly
         | people think this is a useful feature as part of an LLM
         | pipeline.
        
           | loa_in_ wrote:
           | You don't want the footer or navigation in the output.
           | Ideally you want the main content of the page, if it exists.
           | How do you assign header level if they're only differentiated
           | by CSS left-margin in a variety of units? How do you
           | interpret documents that render properly but are hardly
           | correct HTML?
        
             | Treesrule14 wrote:
             | Thanks, I guess, none of that stuff seemed super useful to
             | cut systematically, but I'm gonna run some tests.
        
       | excalibur wrote:
       | > How would you classify all the pdfs in the internet?
       | 
       | Definitely as 'hot dog' or 'not a hot dog'.
        
       ___________________________________________________________________
       (page generated 2024-08-19 23:00 UTC)