[HN Gopher] Download the Entire Wikimedia Database
___________________________________________________________________
Download the Entire Wikimedia Database
Author : surround
Score : 49 points
Date : 2021-03-06 20:46 UTC (2 hours ago)
(HTM) web link (dumps.wikimedia.org)
(TXT) w3m dump (dumps.wikimedia.org)
| orblivion wrote:
| You can also get it in a user-friendly format with the
| application Kiwix (https://www.kiwix.org/) if that's your use
| case. PC, phone, or server. You get subsets of the data, and
| images are smaller to save space.
| MeinBlutIstBlau wrote:
| Kiwix use is still somewhat hit or miss when browsing. Im not
| sure how it handles text parsing but it either takes forever or
| doesn't return results making it sort of unusable.
|
| But it's still a fantastic and incredible piece of software.
| When it gets to the point where I can portably keep the full
| 60gb zim file seemlessly, it will change simple computer for
| low broadband areas. Imagine the uses as a portable and
| versatile database that could accept json, html/css, data to
| make your own offline encyclopedias!
| kregasaurusrex wrote:
| It's useful to browse on a phone if you have limited mobile
| data, and the text-only English Wikipedia fits onto a modern
| micro SD card.
| karlicoss wrote:
| Wouldn't it be cool if Steam supported distributing offline
| Wikipedia database? It's just a few gigs (depending on
| languages/images/etc, but it fits the DLC model perfectly), and
| it already uses bittorrent.
| dwheeler wrote:
| I'm so glad the download-entire-wikipedia function continues to
| exist. That will help counter the "lost the entire library
| problem" from the city of Alexandria. To be fair, Wikipedia only
| has summaries, not the detailed material, but it's still
| important.
| tablespoon wrote:
| > I'm so glad the download-entire-wikipedia function continues
| to exist. That will help counter the "lost the entire library
| problem" from the city of Alexandria. To be fair, Wikipedia
| only has summaries, not the detailed material, but it's still
| important.
|
| Personally, I think Wikipedia's quality is too poor for that.
| Plus, it's digital, so when our civilization is at risk of
| "[losing] the entire library" it probably would have already
| lost the ability to maintain the computer systems to access
| Wikipedia dumps.
| smoldesu wrote:
| The content on Wikipedia is really not that bad. Obviously a
| Wikipedia article will never be the final say on any specific
| subject, but it tends to do a pretty good of aggregating
| sources and condensing it into a reader-friendly synopsis.
| This data is super valuable, if not just for the sources
| alone.
| buzzerbetrayed wrote:
| > so when our civilization is at risk of "[losing] the entire
| library" it probably would have already lost the ability to
| maintain the computer systems to access Wikipedia dumps
|
| But as long as it continues to exist, some future
| civilization could figure out how to read the data again,
| eventually. Just like we eventually discovered how to read
| ancient languages that were once forgotten.
| tablespoon wrote:
| > But as long as it continues to exist, some future
| civilization could figure out how to read the data again,
| eventually. Just like we eventually discovered how to read
| ancient languages that were once forgotten.
|
| Eh, I think you're vastly underestimating how difficult
| that would be.
|
| 1. The media would have to last hundreds of years at least,
| when it's _hoped_ modern archival media can last _maybe_
| fifty.
|
| 2. Even assuming the media did last, the new civilization
| would have to reverse engineer encoding on top of encoding
| on top of encoding (e.g. disk encoding, complex
| filesystems, file formats, character encodings). Our
| civilization _already_ has trouble reading some old file
| formats.
|
| It took the Rosetta stone to figure out how to read
| encoding of Egyptian hieroglyphics, when that language was
| still alive in the form of Coptic.
|
| 3. Then you're dealing with the probability that the hard
| disks the future archeologists find will even have a
| Wikipedia dump on them. That probability will be very
| small, given very few people will download these dumps.
| LeoPanthera wrote:
| Kiwix offers downloadable material (including the _full_
| Wikipedia), in a format specifically designed for offline
| browsing.
|
| https://wiki.kiwix.org/wiki/Content
|
| Their Wikipedia bundle wasn't being updated for a while and had
| fallen out of date, but that seems to have been fixed now.
| qwertywert_ wrote:
| Can it also download sources if available too, would be cool.
| porphyra wrote:
| It is pretty awesome that there are people like /r/datahoarder
| that are obsessed with backing up the collective knowledge of
| humanity.
| aarchi wrote:
| There's also Archive Team, focused on preserving at-risk
| sites before being taken offline.
|
| https://wiki.archiveteam.org/
| capableweb wrote:
| I'm not familiar with r/datahoarder, but if the name bears
| any significance, it seems they are mostly centered on
| hoarding data, which means just digital I guess? If so, I
| much rather would want to promote efforts like Internet
| Archive that back up all kind of things, not just digital
| data.
| [deleted]
| KMnO4 wrote:
| What does the Internet Archive back up that isn't
| represented by 1s and 0s?
| LeoPanthera wrote:
| The Physical Archive. https://en.wikipedia.org/wiki/Inter
| net_Archive#Physical_medi...
| cguess wrote:
| A ton of 35mm and 16mm film reels, vinyl, and even wax
| recording, physical books and a lot more. They make
| digital copies of them, but they also archive the
| physical versions as well. Here's a selection of the
| movies:
| https://archive.org/details/moviesandfilms?tab=about
| bawolff wrote:
| Well not the entire db, just the public parts. User passwords are
| not included ;)
| dudus wrote:
| Wikipedia is always bugging me about donations, and yet here it
| is a feature they could charge for or at least hint to donate. It
| would be perfectly acceptable to charge here since abuse of this
| can rack up quite a bill. Maybe they don't pay as much as I do
| for outbound traffic on aws, but still
| capableweb wrote:
| > would be perfectly acceptable to charge here since abuse of
| this can rack up quite a bill
|
| Not according to Wikipedia. Wikipedia much rather beg people
| from all corners of the world to donate, than restricting
| access to their data. That's what a good, honest and well-
| meaning foundation does.
|
| And yes, no sane person shuffling a lot of data around is using
| AWS because of their awful bandwidth pricing, Wikipedia
| included.
| morsch wrote:
| I guess a hint would be fine, but charging for access, even
| bulk access, feels quite contrary to the spirit of the project.
| It excludes huge ranges of people who cannot afford it or don't
| have access to Internet payment methods.
|
| I suspect the traffic caused by this is minuscule compared to
| the overall traffic, anyway. But that's just a guess.
| libraryofalex wrote:
| One of the only things you can do to ensure lasting democracy
| today is to download the pages, with complete history, put it on
| a usb drive or microsd card properly labelled for you to keep
| offline, and just forget about it. You can do this as a consumer,
| it's easy. There's no harm in it, it's not some kind of private
| data such as personal photos or documents. If you end up
| forgetting or losing track of it, it really is no big deal. You
| just decided to download it when you saw it on hacker news back
| in 2021, right?
|
| My reason for saying this is one of the only things you can do to
| ensure lasting democracy is that it is in the realm of what is
| possible in a physical sense that at some point through some
| mechanism the online version simply does not inform the public on
| some important public issue, whereas the history as you can
| download it today does. Though, I wouldn't speculate about what
| the mechanism might be or what kinds of subject.
|
| At that point in a physical sense you could consult your offline
| copy on an airgapped PC or future equivalent and I think it would
| be impossible for any group of any kind to even know you were
| doing that let alone stop it.
|
| How you might get the word out is another question but having
| this personal capability is easy for the people here, as
| technical users and simple consumers. Indeed the whole entire
| Internet was set up as a distributed network in case of nuclear
| attack, so the entire topology of the Internet is set up for you
| to do this easily today.
|
| It's a click and a cheap flash drive or slightly more expensive
| microsd card away. You can take this step in less than 20 active
| minutes of your time and for less than $50 if you go with an
| external spinning disk drive (such as 1 terabyte) or $200 or so
| if you go with a microsd card. It doesn't really matter if the
| file ultimately fails, this is not a critical backup for you to
| have just a nice to have. You could write the file's checksum
| onto the drive in marker so you can tell whether it's still
| correct later (as opposed to having bit errors).
|
| Maybe there is some file type that has a bit of redundancy
| (checksums) for long-term storage, since due to the large amount
| (several hundreds gigabytes) I wouldn't be all that surprised if
| a few bits flipped over the course of several years in cold
| storage. But I don't know what kind of file type has any sort of
| redundancy or parity built into it that is supposed to protect
| against this. (Does anyone know?) Most likely the hash just
| wouldn't match what you wrote in pen on it but it would still be
| useable.
|
| Regarding choice of spinning disk or microsd card: I guess it's
| in the realm of what's possible in a physical sense that at some
| point people would have their personal property rummaged through
| by some group and a hard drive is pretty obvious and could be
| stolen or removed for that reason. (In a physical sense, not
| speculating about social or political developments that might
| lead to that.)
|
| So for this reason perhaps best would be to put it on a microsd
| card even though it is quite a bit more expensive. I guess
| written once, bit rot causes microsd cards to decay within a few
| years if not used at all.[1] I don't know for spinning media but
| I guess it's also about 5-10 years at least.[2]
|
| You could put the microsd card under a postage stamp for example
| and put an important unrelated document into the envelope, which
| you would expect to keep for many years. Of course you could
| always end up accidentally discarding your envelope (while
| retaining its contents) but that risk shouldn't matter too much.
| In a physical sense it is possible for groups to x-ray all
| paperwork (such as envelopes as I just suggested) and a microsd
| card's electrical contacts are pretty obvious in an x-ray. (It
| looks like this [3]). I don't have any suggestion that works
| against this attack, which is within the realm of what's possible
| according to the laws of physics.
|
| I'm not speculating on what social or political developments
| might possibly make anything like this necessary at some point in
| the future, but we still live in a world governed by the laws of
| physics so as technical professionals you have a huge leg up on
| most of the world. Spending $50 doing this today might save
| democracy tomorrow. You could also leave it as a time capsule
| however the storage longevity is not that long (between 5 and 20
| years I guess), and in a physical sense, a time capsule is not
| particularly secure and would require instructions for someone
| else to figure out so it's not great in that sense.
|
| So in terms of what you can do today, I would suggest just
| getting an external 1 terabyte usb drive ($50), downloading the
| dump together with history (20 active minutes), writing the
| checksum onto it in marker and just putting it somewhere.
| Obviously this small $50 investment is one you would hope never
| to have to use, but who knows, you might go down in history as
| the one who saved some small part of the world. Though,
| obviously, not in Wikipedia history.
|
| [1] https://www.quora.com/What-is-the-longevity-of-a-sd-
| memory-c...
|
| [2] https://serverfault.com/questions/986911/how-long-will-
| unuse...
|
| [3]
| https://www.reddit.com/r/pics/comments/3b6bjw/i_xrayed_an_sd...
| nayuki wrote:
| Back in 2014 I computed the PageRanks within English Wikipedia,
| thanks to their database dump.
| https://www.nayuki.io/page/computing-wikipedias-internal-pag...
| crazygringo wrote:
| That's intriguing.
|
| Curious if you ever compared how PageRanks correlate to
| traffic? (They make their per-page traffic available too.)
|
| It would be interesting to see the largest disparities --
| super-popular pages in visits but which don't have nearly as
| many internal Wikipedia links to them, versus unpopular pages
| but that have tons of internal Wikipedia links to them.
| tomaszs wrote:
| What page had the highest PageRank?
| vinger wrote:
| The homepage.
| tomaszs wrote:
| Cool. Was there any surprising result in top highest ranks?
| codezero wrote:
| Would be interesting to see the results if you have that
| rank to the most viewed page (maybe that is the homepage
| though)
___________________________________________________________________
(page generated 2021-03-06 23:00 UTC)