[HN Gopher] Digital Archivists: Protecting Public Data from Erasure
       ___________________________________________________________________
        
       Digital Archivists: Protecting Public Data from Erasure
        
       Author : rbanffy
       Score  : 127 points
       Date   : 2025-04-02 16:03 UTC (6 hours ago)
        
 (HTM) web link (spectrum.ieee.org)
 (TXT) w3m dump (spectrum.ieee.org)
        
       | Teever wrote:
       | I made this related submission[0] recently but it was flagged.
       | 
       | This stuff is very important to talk about so I hope that this
       | submission by rbanffy isn't also flagged.
       | 
       | [0] https://news.ycombinator.com/item?id=43543075
        
         | hsuduebc2 wrote:
         | I agree. I do not understand how this is perceived as an
         | political issue and thus got flagged.
         | 
         | Climate change is perceived for some reason politically too and
         | not get flagged so often.
        
         | donnachangstein wrote:
         | No it isn't. It's merely a cause du jour for data hoarders to
         | justify their hobby in light of this Chicken Little hysteria.
         | 
         | 30 years ago it was thought collecting every issue of magazines
         | like TV Guide was important. No one even knows what that is
         | anymore.
         | 
         | No one is ever going to look at 99% of this data. In the
         | meantime, send more hard drives for my NAS!!
        
           | dreamworld wrote:
           | It might be of some interest to cultural historians in the
           | future. But I think it makes more sense to take
           | sample+curated data. But in any case if we can afford it, eh
           | why not.
        
             | rbanffy wrote:
             | We don't know now what to curate for the future. We should
             | preserve as much of everything we can - we don't know what
             | will be important in 50, or 500 years.
             | 
             | Case in point: retrocomputing is my hobby. I buy, restore,
             | preserve, and use old computers. Most of them are home
             | computers, because business computers go directly from the
             | office to the recycling facility or the landfill. Unless
             | someone deliberately preserved, say, a Burroughs B-25
             | desktop, or the similar from Data General, they are gone.
        
               | Suppafly wrote:
               | My son is into retrocomputing, mostly using older
               | hardware I have from when I was younger, and we have a
               | stack of old compaq desktops where you can't access the
               | bios because it requires a specific floppy that is nearly
               | impossible to find online. This is 486/pentium era stuff,
               | the older stuff is even harder to find.
        
           | peppermill wrote:
           | I think the data being discussed is quite a bit different
           | than old TV Guides...
        
             | NoMoreNicksLeft wrote:
             | I was, believe it if you wish, thinking about old TV guides
             | just this morning and wondering how one would even go about
             | archiving those. Most of the stumbling blocks for taking
             | apart the glued binding for scanning have been figured out,
             | of course, but for any given week there may have been as
             | many as 60 or 70 editions (for each television market, I
             | think). None of these have proper ISSN numbers as far as
             | I'm aware, and other than the listings they can be visually
             | indistinguishable. Then there is the challenge of finding
             | those, and not knowing whether this or that edition is
             | missing (from time to time, the company would create new
             | additions for new regions, or fold old ones back into some
             | other are) along with even parsing the content. Many of
             | these tv shows aren't on themoviedb or thetvdb, and if the
             | shows are, then there won't be episode listings (there were
             | 6000 Donahue talk show episodes, after all). On top of all
             | of that, you can't necessarily know what was on tv at a
             | given time and day, with federal government preemptions,
             | commercials, unreported last-minute rescheduling, etc.
             | 
             | But I can also see why people might want to keep more
             | interesting data, like when the Federal Cheese-Sniffing
             | Agency moved offices back in 1982 and they have meticulous
             | records of the 483 filing cabinets that had to be moved
             | from the original location to their new home in Furrytown,
             | Pennsylvania.
        
             | zorpner wrote:
             | I wonder if those would be useful in identifying the
             | potential contents of specific Marion Stokes tapes (my
             | understanding is that they're sorted, but are only labeled
             | with channel and date/time and are being archived slowly):
             | https://libwww.freelibrary.org/blog/post/5393
        
           | hermannj314 wrote:
           | My wife takes thousands of photos every year, when my
           | daughter was young she took even more.
           | 
           | When we were moving out of our apartment there was damage to
           | a door hinge that we never noticed when we moved in but that
           | had definitely been there from the onset of our two years of
           | living in that apartment.
           | 
           | Guess what? I had a photo from the day after we moved in of
           | that door hinge in a state of damage! Not because we took the
           | photo for that intention, but because my daughter was playing
           | in the hallway and my wife snapped a photo and it just
           | happened to capture the damage. Saved me several hundreds of
           | dollars in repair costs from my landlord.
           | 
           | You are right, 99% of the data will never be looked at. But
           | do you know what the 1% is today? I'm guessing you don't.
        
             | donnachangstein wrote:
             | Your example of personal family photos is in no way
             | comparable to storing terabytes of essentially unindexed
             | data for which one has no detailed knowledge about, under
             | the notion that the government is somehow lighting a match
             | to everything, and they're going to save it.
             | 
             | The government doesn't delete _anything_. It might be moved
             | or inaccessible to the public but that data is _somewhere_
             | in perpetuity.
             | 
             | It's one of the most deranged larps I've ever seen, then
             | they pat each other on the back on BlueSky, desperately
             | wanting to be a part of something.
             | 
             | These people envision themselves as folk heroes when what
             | they really need to do is go outside and touch grass.
        
               | nancyminusone wrote:
               | If it's inaccessible to the public, it might as well be
               | deleted. What's the difference? If you can't get it, you
               | don't have it.
        
               | alnwlsn wrote:
               | Patently false. https://www.archives.gov/personnel-
               | records-center/fire-1973
        
           | squarefoot wrote:
           | Among the deleted data there was the police accountability
           | database. You probably won't have to deal with thugs now
           | feeling omnipotent and immune from prosecution because of
           | this.
           | 
           | https://www.police1.com/federal-law-enforcement/national-
           | law...
        
           | thowawatp302 wrote:
           | I've had the idea of recreating tv channels on my plex server
           | by using tv guide data from the late 90s early 00s
           | 
           | The insurmountable part of that project would be getting the
           | guide data.
           | 
           | You don't know what other people will want in the future
        
       | badlibrarian wrote:
       | There's a lot of panic and overlap in the space; a way to
       | coordinate these efforts would be helpful.
       | 
       | Internet Archive et al. made noise and promises but told
       | volunteers to stop because they couldn't actually handle the
       | ingest.
       | 
       | https://www.reddit.com/r/Archiveteam/comments/1jbgycm/us_gov...
       | 
       | These folks made a notable effort.
       | 
       | https://webrecorder.net/blog/2025-03-25-govarchive-us-and-mi...
        
       | nla wrote:
       | Best thing I ever heard from the head of archives at the BBC:
       | 
       | Once you format shift, you will always be format shifting.
       | 
       | Keep your originals whenever you can.
        
       | dmillar wrote:
       | Many criminal records, petty or otherwise, are public record.
       | When archived, expunged or dismissed infractions never truly
       | become that. A traffic violation or other petty misdemeanor from
       | 20 years ago, that has been expunged from official record, can
       | show up on a background check because companies archive public
       | data. So, there is a flip side to this.
        
         | overfeed wrote:
         | Public data is incompatible with secrecy. Expunged records
         | still appear in newspapers archives if the local reporter on
         | the Crimes beat captured the proceedings. IMO, "expunged" means
         | removed from _Official court records_ - not from the public
         | memory, including newspapers, archived websites, police
         | blotters and prosecutors ' files.
        
       | Damogran6 wrote:
       | Hypothetically: -Government leader says they're nuking data -Mad
       | rush to back up data through other means -Government leader
       | declares they've 'transferred the cost of maintaining data out of
       | government, thus making for a smaller, more efficient,
       | government'
       | 
       | I hate everything about this.
        
         | krunck wrote:
         | There is inherent inefficiency in government accountability
         | efforts. I'm ok with that.
        
         | riku_iki wrote:
         | In general it makes sense to shift this part to business, if
         | data is valuable, there will be market and services. Probably
         | problem is how fast they nuked without grace period.
        
           | tehjoker wrote:
           | im okay with data being hosted for free or cheap by the
           | government and not being price gouged for access to public
           | data
        
       | mikrl wrote:
       | How does this relate to dox?
       | 
       | Let's say an individual posted identifying or incriminating
       | information online, inadvertently or intentionally, in a public
       | place.
       | 
       | Then a third party decides to store it, and possibly make it
       | accessible to others.
       | 
       | If the original self doxxing user then pulled the original dox,
       | but was unable to scrub the rest, would that information still be
       | considered public, or would it be private? Was it ever truly
       | public? Or private for that matter?
        
         | sixothree wrote:
         | Which data set are you thinking this might apply to?
        
         | calebio wrote:
         | That's a really good question.
         | 
         | In my head, I'm imagining someone early in the morning posting
         | a flyer up on a bulletin board downtown.
         | 
         | Throughout the day many folks walked by and took photos of the
         | flyer with their cell phone.
         | 
         | At the end of the day, the original person came back and
         | removed the flyer.
         | 
         | IMO, at the time that the folks took the photo of the flyer,
         | that flyer was public information. It remains public
         | information even after the flyer is removed[0].
         | 
         | This isn't a great analogy of mine, and has plenty of holes,
         | but was interesting to me after I read your comment. I know it
         | was in the context of doxxing, but I think it's pretty
         | interesting philosophically.
         | 
         | I think something similar applies to photos taken of other
         | people in public spaces. Both the person who took the photo and
         | the subject of the photo are no longer in that physical public
         | space, but the actions took place within that space.
         | 
         | I think something similar applies to digital "public spaces".
         | But what does a public space even mean in the context of walled
         | gardens[1], etc.
         | 
         | [0] you then run into the question of what happens if someone
         | posts non-public information, publicly? [1] are digital walled
         | garden communities that different from physical communities
         | that gate access, whether free or paid. Whether information
         | shared within those contexts are public or private is an
         | interesting thread as well.
        
         | ziddoap wrote:
         | If you intentionally post something publicly, it's public. Full
         | stop.
         | 
         | The tricky part is dealing with inadvertent or malicious (i.e.
         | some other party), posting of private information to a public
         | space. That's really hard to deal with on multiple levels.
         | 
         | For one, the archives would retain the information and
         | scrubbing it is effectively impossible.
         | 
         | Secondly, legitimate things which _should_ remain public (i.e.
         | were posted publicly, are of public interest, etc.) can be
         | argued to have been inadvertently or maliciously posted. So you
         | need some way to moderate and create rulings for each
         | individual case, which quickly becomes untenable due to the
         | sheer volume of information being posted and the inordinate
         | amount of time required to investigate vs. post.
        
       | hsuduebc2 wrote:
       | I wonder. Maybe for this would be blockchain actually usefull
       | technology?
        
       ___________________________________________________________________
       (page generated 2025-04-02 23:00 UTC)