[HN Gopher] Data-Mining Wikipedia for Fun and Profit
___________________________________________________________________
Data-Mining Wikipedia for Fun and Profit
Author : billpg
Score : 150 points
Date : 2021-08-19 14:17 UTC (8 hours ago)
(HTM) web link (billpg.com)
(TXT) w3m dump (billpg.com)
| permo-w wrote:
| >The graph was interesting but this wasn't the primary objective
| of this exercise. I wanted to write "He is the n-times great-
| father of his current successor Queen Elizabeth." on King
| Alfred's Wikipedia page.
|
| Wouldn't that contravene Wikipedia's rules on original research?
| barbinbrad wrote:
| For anyone who is interested: I used to work with a guy named
| Richard Wang, who indexed Wikipedia as his training data set in
| order to do named entity recognition. He'd be a good person to
| talk for anyone pursuing this.
|
| Here's a demo: https://www.youtube.com/watch?v=SyhaxCjrZFw
| mrVentures wrote:
| Lol I love that it was edited out right away. I probably agree
| with that decision but this was a really cool project. I have
| followed a few YouTube channels which make graphic visualizations
| for Generations and I think they would really appreciate if you
| shared this Tech with them
| lizhang wrote:
| Where's the profit?
| ad404b8a372f2b9 wrote:
| Good question, as a data-hoarder mining has always been fun but
| I was interested in seeing the profit. I've yet to find any
| monetary value in the troves of data I accumulate.
| billpg wrote:
| For three minutes, something I wrote was cited on Wikipedia.
| ygmelnikova wrote:
| If the task is to answer the question "Does a royal lineage exist
| between Alfred the Great and Queen Elizabeth?" then this works
| fine, but the results only show 1 of possibly 100s of such paths.
| If you're of European descent, it's 100% possible for you to find
| a similar connection to Alfred the Great as well :)
| [deleted]
| papercrane wrote:
| For this particular problem I wonder if wikidata would be better
| instead of scraping the HTML.
| billpg wrote:
| (Author here.)
|
| Perhaps, but I already know how to scrape HTML and I know the
| data I wanted to pull out was in there. I have no idea how to
| query wikidata and it could have ended up being a blind alley.
|
| Also, it was only my reading your comment just now that told me
| wikidata was even a thing.
| papercrane wrote:
| Hard to use it if you don't know about it!
|
| I only thought of it myself because you mentioned the problem
| with deducing which parent is the mother and which is the
| father, and I remember in wikidata those are separate fields.
| mistrial9 wrote:
| earth calling ivory tower, earth calling ivory tower
|
| superior RDF triples are like martian language to millions of
| humans
|
| over
| bawolff wrote:
| Are you saying parsing html is easier than parsing rdf
| triples? Because i dont know about that.
|
| In real life you use tools for both.
| udp wrote:
| The ivory tower is working on it:
| https://github.com/w3c/EasierRDF
| liminal wrote:
| I just looked at the linked repo. Have they made any
| progress? It looks like _very_ early days.
| tasogare wrote:
| RDF has to be the best and saddest example of sunk cost
| fallacy. Instead of redirecting their efforts to a more
| general graph model which has actual hype and use by
| developers, its cultists are double downing on their
| abstruse technology stack, making it always more
| complicated while still not addressing any of its
| fundamental problems.
| shock-value wrote:
| Generally Wikidata would definitely be the way to go here,
| though I just now tried to retrace your graph in Wikidata and
| it seems to be missing at least one relation (Ada of Holland
| has no children listed --
| https://www.wikidata.org/wiki/Q16156475).
| hutzlibu wrote:
| Yup, I had the same situation some months before, even though
| I knew wikidata was a thing.
|
| I know javascript and had the pages at hand.
|
| I looked at wikidata and some pages about, but still had no
| clear idea how to use it and no motivation to digg into it.
| Because js just worked with a small custom script, to
| retrieve some pages and data.
| 3pt14159 wrote:
| Don't worry about the haters. You needed a paltry amount of
| data and you got it with the tools you had and knew.
|
| When I was analyzing Wikipedia about 10 years ago for fun
| and, later, actual profit. I did the responsible thing and
| downloaded one of their megadumps because I needed every
| English page. That's what people here are concerned about,
| but it doesn't matter for your use case.
| tcgv wrote:
| > Don't worry about the haters
|
| To be fair, the original comment just made a valid
| observation in a casual way, he didn't criticize the
| approach of the OP, nor was he impolite.
|
| But I know it's pretty common to see haters nitpicking
| things all around ;)
| powera wrote:
| Definitely. There's even a public query service (
| https://query.wikidata.org/ ) which can do a lot of this
| (though SQL is not good with searching for chains).
| [deleted]
| brodo wrote:
| It's so sad that almost nobody knows or uses SPARQL...
| guidovranken wrote:
| At least the last times I checked, the WikiData SPARQL server
| was extremely slow, frequently timing out.
| lacksconfidence wrote:
| seems to depend on the query. I can issue straight forward
| queries that visit a few hundred thousand triples easily.
| But when i write a query that visits tens of millions of
| triples it times out.
| epaulson wrote:
| There's some mix between "it's slow" and "it sets its
| timeout threshold too low" - a lot of queries would be OK
| if they just had a bit more time to run. And unfortunately,
| the time wasted on the badput of the killed queries just
| slows down everyone else. (They really need a batch queue)
|
| The Wikidata folks are well aware of the limits on their
| SPARQL service. They just posted an update the other day:
|
| https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.
| w...
| devbas wrote:
| Because the syntax is relatively complex and it is difficult
| to judge which endpoints and definitions to use.
| ZeroGravitas wrote:
| I learned SPARQL recently, and would agrre its complicated
| to get info out of Wikidata.
|
| However, having read the article, they didnt have an easy
| time with scraping Wikipedia either.
|
| So I'd probably still recommend people look into wikidata
| and SPARQL if they want to do this kind of thing.
|
| Theres a few tools that generate queries for you, and some
| cli tools as well:
|
| https://github.com/maxlath/wikibase-cli#readme
|
| It makes Wikipedia better too, in a virtuous cycle, with
| some infoboxes like those that he scraped being converted
| to be automatically populated from wikidata.
| matkoniecz wrote:
| In my experience SPARQL is really hard to use, and Wikidata
| data quality is really low. To the point that one of my
| larger project is trying to filter data to make it usable for
| my usecase.
|
| Yes, I made some improvements ( https://www.wikidata.org/wiki
| /Special:Contributions/Mateusz_... ).
|
| But overall I would not encourage using it, if I would know
| how much work it takes to get usable data I would not bother
| with it.
|
| Queries as simple as "is this entry describing event, bridge
| or neither" are requiring extreme effort to get right in a
| reliable way, including maintaining private list of patches
| and exemptions.
|
| And bots creating millions of known duplicated entries and
| expecting people to resolve this manually is quite
| discouraging. Creating Wikidata entries for Cebuano Wikipedia
| 'articles' was accepted, despite that Cebuano botpedia is
| nearly completely bot-generated.
|
| And that is without unclear legal status. Yes, they can
| legally import databases covered by database rights - but
| they should either make clear that Wikidata is a legal
| quagmire in EU or forbid such imports. But Wikidata community
| did neither.
| pphysch wrote:
| Why should more people know SPARQL?
| squeaky-clean wrote:
| Like people on the other comment have said, if you've actually
| tried getting data from wikidata/wikipedia you very quickly
| learn the HTML is much easier to parse than the results
| wikidata gives you.
| dalf wrote:
| Yes:
|
| * http://www.entitree.com/en/family_tree/Elizabeth_II
|
| * https://family.toolforge.org/ancestors.php?q=Q187114
|
| Tools found on this page:
| https://www.wikidata.org/wiki/Wikidata:Tools/Visualize_data/...
|
| ---
|
| Some SPARQL queries:
| https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/...
|
| ---
|
| Out of topic: I wish wikipedia would provide an API to get the
| infoboxes (made using Lua or wikidata).
| zaik wrote:
| There is a tool to get infobox data from Wikipedia into
| Wikidata: https://pltools.toolforge.org/harvesttemplates/
|
| The easily parsable Infobox data can probably already be
| found in Wikidata (assuming there is a property).
| ahP7Deew wrote:
| Has anyone found an easy way to expand their templates without
| using their whole stack? I tried getting Lua templates working
| from Python but didn't get very far...
| kemayo wrote:
| Parsoid[1] is what you'd want for that, most likely. It's the
| new wikitext parser that MediaWiki is gradually switching over
| to, but it has the virtue of being usable entirely outside of
| MediaWiki if you need to.
|
| [1]: https://www.mediawiki.org/wiki/Parsoid
| billpg wrote:
| (Author here)
|
| In an act of divine justice, my website is down.
|
| https://web.archive.org/web/20210711201037/https://billpg.co...
|
| (I'll send you a donation. Thank you!)
| humanistbot wrote:
| Please don't scrape raw HTML from Wikipedia. They do a lot of
| work to make their content accessible in so many machine-readable
| formats, from the raw XML dumps (https://dumps.wikimedia.org) to
| the fully-featured API with a nice sandbox
| (https://en.wikipedia.org/wiki/Special:ApiSandbox) and Wikidata
| (https://wikidata.org).
| whall6 wrote:
| Genuine question from a non-programmer: why? Is it because the
| volume of requests increases load on the servers/costs?
| michaelbuckbee wrote:
| That's part of it, but also it's typically much more
| difficult and there's an element of "why are you making this
| so much harder on yourself".
| billpg wrote:
| (Author of original article here.)
|
| That's the great thing about HtmlAgilityPack, extracting
| data from HTML is really easy. I might even say even easier
| than if I had the page in some table-based data system.
| SlimyHog wrote:
| The HTML is more volatile and subject to change than
| other sources though
| FalconSensei wrote:
| Don't remember the last time wikipedia changed the
| infobox though
| jcun4128 wrote:
| Can make it even harder, use Puppeteer to take screenshots
| then pass it to an OCR to get the text.
| thechao wrote:
| https://xkcd.com/378/
| matkoniecz wrote:
| Scrapping, especially on a large scale, can put a noticeable
| strain on servers.
|
| Bulk downloads (database dumps) are much cheaper to serve for
| someone crawling millions of pages.
|
| It gets even more significant if generation of reply is
| resource intensive (not sure is Wikipedia qualifying for that
| but complex templates may cause this).
| nonameiguess wrote:
| Unlike APIs, html class/tag names or whatever provide no
| stability guarantees. The site owner can break your parser
| whenever they want for any reason. They _can_ do that with an
| API, but usually won 't since some guarantee of stability is
| the point of an API.
| billpg wrote:
| True, but the analysis was done on files downloaded over
| the span of two or three days. If someone had decided to
| change the CSS class of an infobox during that time, I'd
| have noticed, investigated and adjusted my code
| appropriately.
| traceroute66 wrote:
| IANAL but since the pages are published under Creative Commons
| Attribution-ShareAlike, if someone wishes to collect the text
| on the basis of the HTML version then there's not much you can
| do about it.
|
| Wikimedia no doubt have caching, CDNs and all that jazz in
| place so the likely impact on infrastructure is probably de-
| minimis in the grand scheme of things (the thousands or
| millions of humans who visit the site every second).
| learc83 wrote:
| >IANAL but since the pages are published under Creative
| Commons Attribution-ShareAlike, if someone wishes to collect
| the text on the basis of the HTML version then there's not
| much you can do about it.
|
| They said _please_ don 't, not don't do it or they'll sue
| you.
|
| But content license and site terms of use are different
| things.
|
| From their terms of use you aren't allowed to
|
| > [Disrupt] the services by placing an undue burden on a
| Project website or the networks or servers connected with a
| Project website;
|
| Wikipedia is also well within their rights to implement
| scraping countermeasures.
| traceroute66 wrote:
| > [Disrupt] the services by placing an undue burden on a
| Project website or the networks or servers connected with a
| Project website;
|
| Two things: 1) The English wikipedia
| *alone* gets 250 million page views per day ! So you would
| have to be doing an awful lot to cause "undue burden".
| 2) The Wikipedia robots.txt page openly implies that
| crawling (and therefore scraping) is acceptable *as long
| as* you do it in a rate-limited fashion, e.g.:
| >Friendly, low-speed bots are welcome viewing article
| pages, but not dynamically-generated pages please.
| > There are a lot of pages on this site, and there are some
| misbehaved spiders out there that go _way_ too fast.
| >Sorry, wget in its recursive mode is a frequent problem.
| Please read the man page and use it properly; there is a
| --wait option you can use to set the delay between hits,
| for instance.
| learc83 wrote:
| 1. You'd be surprised what kind of traffic scrapers can
| generate. I've seen scraping companies employing botnets
| to get around rate limiting that could easily cost enough
| in extra server fees to cause an "undue burden".
|
| At a previous company we had the exact problem that we
| published all of our content as machine readable xml, but
| we had scrapers costing us money by insisting on using
| our search interface to access our content.
|
| 2. No one is going to jail for scraping a few thousand or
| even a few million pages, but just because low speed web
| crawlers are allowed to index the site, doesn't mean
| scraping for every possible use is permitted.
| _jal wrote:
| "Who's gonna stop me" is kind of a crappy attitude to take
| with a cooperative project like Wikipedia.
|
| I mean, sure, you can do a lot of things you shouldn't with
| freely available services. There's even an economics term
| that describes this: the Tragedy of the Commons.
|
| Individual fish poachers' hauls are also, individually, de-
| minimis.
| cldellow wrote:
| The infoboxes, which is what this guy is scraping, are much
| easier to scrape from the HTML than from the XML dumps.
|
| The reason is that the dumps just have pointers to templates,
| and you need to understand quite a bit about Wikipedia's
| bespoke rendering system to know how to fully realize them (or
| use a constantly-evolving library like wtf_wikipedia [1] to
| parse them).
|
| The rendered HTML, on the other hand, is designed for humans,
| and so what you see is what you get.
|
| [1]: https://github.com/spencermountain/wtf_wikipedia
| lou1306 wrote:
| Still, I guess you could get the dumps and do a local
| Wikimedia setup based on them, and then crawl _that_ instead?
| cldellow wrote:
| You could, and if he was doing this on the entire corpus
| that'd be the responsible thing to do.
|
| But, his project really was very reasonable:
|
| - it fetched ~2,400 pages
|
| - he cached them after first fetch
|
| - Wikipedia aggressively caches anonymous page views (eg
| the Queen Elizabeth page has a cache age of 82,000 seconds)
|
| English Wikipedia does about 250,000,000 pageviews/day.
| This guy's use was 0.001% of traffic on that day.
|
| I get the slippery slope arguments, but to me, it just
| doesn't apply. As someone who has donated $1,000 to
| Wikipedia in the past, I'm totally happy to have those
| funds spent supporting use cases like this, rather than
| demanding that people who want to benefit from Wikipedia be
| able to set up a MySQL server, spend hours doing the
| import, install and configure a PHP server, etc, etc.
| habibur wrote:
| > This guy's use was 0.001% of traffic on that day
|
| For 1 person consuming from one of the most popular sites
| on the web, this really reads big.
| cldellow wrote:
| He was probably one of the biggest users that day, so
| that makes sense.
|
| The 2,400 pages, assuming a 50 KB average gzipped size,
| equate to 120 MB of transfer. I'm assuming CPU usage is
| negligible due to CDN caching, and so bandwidth is the
| main cost. 120 MB is orders of magnitude less transfer
| than the 18.5 GB dump.
|
| Instead of the dumps, he could have used the API -- but
| would that have significantly changed the costs to the
| Wikimedia foundation? I think probably not. In my
| experience, the happy path (serving anonymous HTML) is
| going to be aggressively optimized for costs, eg caching,
| CDNs, negotiated bandwidth discounts.
|
| If we accept that these kinds of projects are permissible
| (which no one seems to be debating, just the manner in
| which he did the project!), I think the way this guy went
| about doing it was not actually as bad as people are
| making it out to be.
| NicoJuicy wrote:
| I don't think I agree. Cache has a cost too.
|
| In theory, you'd want to cache more popular pages and let
| the rarely visited ones go through the uncached flow.
|
| Crawling isn't user-behavior, so the odds are that a
| large percentage of the crawled pages were not cached.
| cldellow wrote:
| That's true. On the other hand, pages with infoboxes are
| likely well-linked and will end up in the cache either
| due to legitimate popularity or due to crawler visits.
|
| Checking a random sample of 50 pages from this guy's
| dataset, 70% of them were cached.
| bawolff wrote:
| Note - there's several levels of caching at wikipedia.
| Even if those pages aren't in cdn (varnish) cache, they
| may be in parser cache (an application level cache of
| most of the page).
|
| This amount of activity really isn't something to worry
| about, especially when taking the fast path of logged out
| user viewing likely to be cached page.
| jancsika wrote:
| > The infoboxes, which is what this guy is scraping, are much
| easier to scrape from the HTML than from the XML dumps.
|
| How is it possible that "give me all the infoboxes, please"
| is more than a single query, download, or even URL at this
| point?
| ceejayoz wrote:
| The problem lies in _parsing_ them.
|
| Look at the template for a subway line infobox, for
| example.
| https://en.wikipedia.org/wiki/Template:Bakerloo_line_RDT
|
| It's a whole little clever language (https://en.wikipedia.o
| rg/wiki/Wikipedia:Route_diagram_templa...) for making
| complex diagrams out of rather simple pictograms
| (https://commons.wikimedia.org/wiki/Template:Bsicon).
| jancsika wrote:
| Oh wow.
|
| But every other infobox I've seen has key/value pairs
| where the key was always a string.
|
| So what's the spec for an info box? Is it simply to have
| a starting `<table class="hello_i_am_infobox">` and an
| ending `</table>`?
| bawolff wrote:
| En wikipedia has some standards. Generally though they
| are user-created tables and its up to the users to make
| them consistent (if they so desire). En Wikipedia
| generally does, but its not exactly a hard garuntee.
|
| If you want machine readable use wikidata (if you hate
| rdf you can still scrape the html preview of the data)
| squeaky-clean wrote:
| The infoboxes aren't standardized at all. The HTML they
| generate is.
| jancsika wrote:
| Hehe-- I am going to rankly speculate nearly all of them
| follow an obvious standard of key/value pairs where the
| key is a string. And then there are like two or three
| subcultures on Wikipedia that put rando stuff in there
| and would troll to the death before being forced to
| change the their infobox class to "rando_box" or whatever
| negligible effort it would take them if a standard were
| to be enforced.
|
| Am I anywhere close to being correct?
| bawolff wrote:
| I think you'll have to more clearly define what you mean
| by "key-value" pairs.
| cldellow wrote:
| Even just being able to download a tarball of the HTML of
| the infoboxes would be really powerful, setting aside the
| difficulty of, say, translating them into a consistent JSON
| format.
|
| That plus a few other key things (categories, opening
| paragraph, redirects, pageview data) enable a lot of
| powerful analysis.
|
| That actually might be kind of a neat thing to publish.
| Hmmmm.
| jancsika wrote:
| Better yet-- what is the set of wikipedia articles which
| have an info box that cannot be sensibly interpreted as
| key/value pairs where the key is a simple string?
| spoonjim wrote:
| How does scraping raw HTML from Wikipedia hurt them? I'd think
| they could serve the HTML from cache more likely than the API
| call.
___________________________________________________________________
(page generated 2021-08-19 23:00 UTC)