[HN Gopher] Data-Mining Wikipedia for Fun and Profit
       ___________________________________________________________________
        
       Data-Mining Wikipedia for Fun and Profit
        
       Author : billpg
       Score  : 150 points
       Date   : 2021-08-19 14:17 UTC (8 hours ago)
        
 (HTM) web link (billpg.com)
 (TXT) w3m dump (billpg.com)
        
       | permo-w wrote:
       | >The graph was interesting but this wasn't the primary objective
       | of this exercise. I wanted to write "He is the n-times great-
       | father of his current successor Queen Elizabeth." on King
       | Alfred's Wikipedia page.
       | 
       | Wouldn't that contravene Wikipedia's rules on original research?
        
       | barbinbrad wrote:
       | For anyone who is interested: I used to work with a guy named
       | Richard Wang, who indexed Wikipedia as his training data set in
       | order to do named entity recognition. He'd be a good person to
       | talk for anyone pursuing this.
       | 
       | Here's a demo: https://www.youtube.com/watch?v=SyhaxCjrZFw
        
       | mrVentures wrote:
       | Lol I love that it was edited out right away. I probably agree
       | with that decision but this was a really cool project. I have
       | followed a few YouTube channels which make graphic visualizations
       | for Generations and I think they would really appreciate if you
       | shared this Tech with them
        
       | lizhang wrote:
       | Where's the profit?
        
         | ad404b8a372f2b9 wrote:
         | Good question, as a data-hoarder mining has always been fun but
         | I was interested in seeing the profit. I've yet to find any
         | monetary value in the troves of data I accumulate.
        
         | billpg wrote:
         | For three minutes, something I wrote was cited on Wikipedia.
        
       | ygmelnikova wrote:
       | If the task is to answer the question "Does a royal lineage exist
       | between Alfred the Great and Queen Elizabeth?" then this works
       | fine, but the results only show 1 of possibly 100s of such paths.
       | If you're of European descent, it's 100% possible for you to find
       | a similar connection to Alfred the Great as well :)
        
       | [deleted]
        
       | papercrane wrote:
       | For this particular problem I wonder if wikidata would be better
       | instead of scraping the HTML.
        
         | billpg wrote:
         | (Author here.)
         | 
         | Perhaps, but I already know how to scrape HTML and I know the
         | data I wanted to pull out was in there. I have no idea how to
         | query wikidata and it could have ended up being a blind alley.
         | 
         | Also, it was only my reading your comment just now that told me
         | wikidata was even a thing.
        
           | papercrane wrote:
           | Hard to use it if you don't know about it!
           | 
           | I only thought of it myself because you mentioned the problem
           | with deducing which parent is the mother and which is the
           | father, and I remember in wikidata those are separate fields.
        
           | mistrial9 wrote:
           | earth calling ivory tower, earth calling ivory tower
           | 
           | superior RDF triples are like martian language to millions of
           | humans
           | 
           | over
        
             | bawolff wrote:
             | Are you saying parsing html is easier than parsing rdf
             | triples? Because i dont know about that.
             | 
             | In real life you use tools for both.
        
             | udp wrote:
             | The ivory tower is working on it:
             | https://github.com/w3c/EasierRDF
        
               | liminal wrote:
               | I just looked at the linked repo. Have they made any
               | progress? It looks like _very_ early days.
        
               | tasogare wrote:
               | RDF has to be the best and saddest example of sunk cost
               | fallacy. Instead of redirecting their efforts to a more
               | general graph model which has actual hype and use by
               | developers, its cultists are double downing on their
               | abstruse technology stack, making it always more
               | complicated while still not addressing any of its
               | fundamental problems.
        
           | shock-value wrote:
           | Generally Wikidata would definitely be the way to go here,
           | though I just now tried to retrace your graph in Wikidata and
           | it seems to be missing at least one relation (Ada of Holland
           | has no children listed --
           | https://www.wikidata.org/wiki/Q16156475).
        
           | hutzlibu wrote:
           | Yup, I had the same situation some months before, even though
           | I knew wikidata was a thing.
           | 
           | I know javascript and had the pages at hand.
           | 
           | I looked at wikidata and some pages about, but still had no
           | clear idea how to use it and no motivation to digg into it.
           | Because js just worked with a small custom script, to
           | retrieve some pages and data.
        
           | 3pt14159 wrote:
           | Don't worry about the haters. You needed a paltry amount of
           | data and you got it with the tools you had and knew.
           | 
           | When I was analyzing Wikipedia about 10 years ago for fun
           | and, later, actual profit. I did the responsible thing and
           | downloaded one of their megadumps because I needed every
           | English page. That's what people here are concerned about,
           | but it doesn't matter for your use case.
        
             | tcgv wrote:
             | > Don't worry about the haters
             | 
             | To be fair, the original comment just made a valid
             | observation in a casual way, he didn't criticize the
             | approach of the OP, nor was he impolite.
             | 
             | But I know it's pretty common to see haters nitpicking
             | things all around ;)
        
         | powera wrote:
         | Definitely. There's even a public query service (
         | https://query.wikidata.org/ ) which can do a lot of this
         | (though SQL is not good with searching for chains).
        
           | [deleted]
        
         | brodo wrote:
         | It's so sad that almost nobody knows or uses SPARQL...
        
           | guidovranken wrote:
           | At least the last times I checked, the WikiData SPARQL server
           | was extremely slow, frequently timing out.
        
             | lacksconfidence wrote:
             | seems to depend on the query. I can issue straight forward
             | queries that visit a few hundred thousand triples easily.
             | But when i write a query that visits tens of millions of
             | triples it times out.
        
             | epaulson wrote:
             | There's some mix between "it's slow" and "it sets its
             | timeout threshold too low" - a lot of queries would be OK
             | if they just had a bit more time to run. And unfortunately,
             | the time wasted on the badput of the killed queries just
             | slows down everyone else. (They really need a batch queue)
             | 
             | The Wikidata folks are well aware of the limits on their
             | SPARQL service. They just posted an update the other day:
             | 
             | https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.
             | w...
        
           | devbas wrote:
           | Because the syntax is relatively complex and it is difficult
           | to judge which endpoints and definitions to use.
        
             | ZeroGravitas wrote:
             | I learned SPARQL recently, and would agrre its complicated
             | to get info out of Wikidata.
             | 
             | However, having read the article, they didnt have an easy
             | time with scraping Wikipedia either.
             | 
             | So I'd probably still recommend people look into wikidata
             | and SPARQL if they want to do this kind of thing.
             | 
             | Theres a few tools that generate queries for you, and some
             | cli tools as well:
             | 
             | https://github.com/maxlath/wikibase-cli#readme
             | 
             | It makes Wikipedia better too, in a virtuous cycle, with
             | some infoboxes like those that he scraped being converted
             | to be automatically populated from wikidata.
        
           | matkoniecz wrote:
           | In my experience SPARQL is really hard to use, and Wikidata
           | data quality is really low. To the point that one of my
           | larger project is trying to filter data to make it usable for
           | my usecase.
           | 
           | Yes, I made some improvements ( https://www.wikidata.org/wiki
           | /Special:Contributions/Mateusz_... ).
           | 
           | But overall I would not encourage using it, if I would know
           | how much work it takes to get usable data I would not bother
           | with it.
           | 
           | Queries as simple as "is this entry describing event, bridge
           | or neither" are requiring extreme effort to get right in a
           | reliable way, including maintaining private list of patches
           | and exemptions.
           | 
           | And bots creating millions of known duplicated entries and
           | expecting people to resolve this manually is quite
           | discouraging. Creating Wikidata entries for Cebuano Wikipedia
           | 'articles' was accepted, despite that Cebuano botpedia is
           | nearly completely bot-generated.
           | 
           | And that is without unclear legal status. Yes, they can
           | legally import databases covered by database rights - but
           | they should either make clear that Wikidata is a legal
           | quagmire in EU or forbid such imports. But Wikidata community
           | did neither.
        
           | pphysch wrote:
           | Why should more people know SPARQL?
        
         | squeaky-clean wrote:
         | Like people on the other comment have said, if you've actually
         | tried getting data from wikidata/wikipedia you very quickly
         | learn the HTML is much easier to parse than the results
         | wikidata gives you.
        
         | dalf wrote:
         | Yes:
         | 
         | * http://www.entitree.com/en/family_tree/Elizabeth_II
         | 
         | * https://family.toolforge.org/ancestors.php?q=Q187114
         | 
         | Tools found on this page:
         | https://www.wikidata.org/wiki/Wikidata:Tools/Visualize_data/...
         | 
         | ---
         | 
         | Some SPARQL queries:
         | https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/...
         | 
         | ---
         | 
         | Out of topic: I wish wikipedia would provide an API to get the
         | infoboxes (made using Lua or wikidata).
        
           | zaik wrote:
           | There is a tool to get infobox data from Wikipedia into
           | Wikidata: https://pltools.toolforge.org/harvesttemplates/
           | 
           | The easily parsable Infobox data can probably already be
           | found in Wikidata (assuming there is a property).
        
       | ahP7Deew wrote:
       | Has anyone found an easy way to expand their templates without
       | using their whole stack? I tried getting Lua templates working
       | from Python but didn't get very far...
        
         | kemayo wrote:
         | Parsoid[1] is what you'd want for that, most likely. It's the
         | new wikitext parser that MediaWiki is gradually switching over
         | to, but it has the virtue of being usable entirely outside of
         | MediaWiki if you need to.
         | 
         | [1]: https://www.mediawiki.org/wiki/Parsoid
        
       | billpg wrote:
       | (Author here)
       | 
       | In an act of divine justice, my website is down.
       | 
       | https://web.archive.org/web/20210711201037/https://billpg.co...
       | 
       | (I'll send you a donation. Thank you!)
        
       | humanistbot wrote:
       | Please don't scrape raw HTML from Wikipedia. They do a lot of
       | work to make their content accessible in so many machine-readable
       | formats, from the raw XML dumps (https://dumps.wikimedia.org) to
       | the fully-featured API with a nice sandbox
       | (https://en.wikipedia.org/wiki/Special:ApiSandbox) and Wikidata
       | (https://wikidata.org).
        
         | whall6 wrote:
         | Genuine question from a non-programmer: why? Is it because the
         | volume of requests increases load on the servers/costs?
        
           | michaelbuckbee wrote:
           | That's part of it, but also it's typically much more
           | difficult and there's an element of "why are you making this
           | so much harder on yourself".
        
             | billpg wrote:
             | (Author of original article here.)
             | 
             | That's the great thing about HtmlAgilityPack, extracting
             | data from HTML is really easy. I might even say even easier
             | than if I had the page in some table-based data system.
        
               | SlimyHog wrote:
               | The HTML is more volatile and subject to change than
               | other sources though
        
               | FalconSensei wrote:
               | Don't remember the last time wikipedia changed the
               | infobox though
        
             | jcun4128 wrote:
             | Can make it even harder, use Puppeteer to take screenshots
             | then pass it to an OCR to get the text.
        
               | thechao wrote:
               | https://xkcd.com/378/
        
           | matkoniecz wrote:
           | Scrapping, especially on a large scale, can put a noticeable
           | strain on servers.
           | 
           | Bulk downloads (database dumps) are much cheaper to serve for
           | someone crawling millions of pages.
           | 
           | It gets even more significant if generation of reply is
           | resource intensive (not sure is Wikipedia qualifying for that
           | but complex templates may cause this).
        
           | nonameiguess wrote:
           | Unlike APIs, html class/tag names or whatever provide no
           | stability guarantees. The site owner can break your parser
           | whenever they want for any reason. They _can_ do that with an
           | API, but usually won 't since some guarantee of stability is
           | the point of an API.
        
             | billpg wrote:
             | True, but the analysis was done on files downloaded over
             | the span of two or three days. If someone had decided to
             | change the CSS class of an infobox during that time, I'd
             | have noticed, investigated and adjusted my code
             | appropriately.
        
         | traceroute66 wrote:
         | IANAL but since the pages are published under Creative Commons
         | Attribution-ShareAlike, if someone wishes to collect the text
         | on the basis of the HTML version then there's not much you can
         | do about it.
         | 
         | Wikimedia no doubt have caching, CDNs and all that jazz in
         | place so the likely impact on infrastructure is probably de-
         | minimis in the grand scheme of things (the thousands or
         | millions of humans who visit the site every second).
        
           | learc83 wrote:
           | >IANAL but since the pages are published under Creative
           | Commons Attribution-ShareAlike, if someone wishes to collect
           | the text on the basis of the HTML version then there's not
           | much you can do about it.
           | 
           | They said _please_ don 't, not don't do it or they'll sue
           | you.
           | 
           | But content license and site terms of use are different
           | things.
           | 
           | From their terms of use you aren't allowed to
           | 
           | > [Disrupt] the services by placing an undue burden on a
           | Project website or the networks or servers connected with a
           | Project website;
           | 
           | Wikipedia is also well within their rights to implement
           | scraping countermeasures.
        
             | traceroute66 wrote:
             | > [Disrupt] the services by placing an undue burden on a
             | Project website or the networks or servers connected with a
             | Project website;
             | 
             | Two things:                 1) The English wikipedia
             | *alone* gets 250 million page views per day !  So you would
             | have to be doing an awful lot to cause "undue burden".
             | 2) The Wikipedia robots.txt page openly implies that
             | crawling (and therefore scraping) is acceptable *as long
             | as* you do it in a rate-limited fashion, e.g.:
             | >Friendly, low-speed bots are welcome viewing article
             | pages, but not dynamically-generated pages please.
             | > There are a lot of pages on this site, and there are some
             | misbehaved spiders out there that go _way_ too fast.
             | >Sorry, wget in its recursive mode is a frequent problem.
             | Please read the man page and use it properly; there is a
             | --wait option you can use to set the delay between hits,
             | for instance.
        
               | learc83 wrote:
               | 1. You'd be surprised what kind of traffic scrapers can
               | generate. I've seen scraping companies employing botnets
               | to get around rate limiting that could easily cost enough
               | in extra server fees to cause an "undue burden".
               | 
               | At a previous company we had the exact problem that we
               | published all of our content as machine readable xml, but
               | we had scrapers costing us money by insisting on using
               | our search interface to access our content.
               | 
               | 2. No one is going to jail for scraping a few thousand or
               | even a few million pages, but just because low speed web
               | crawlers are allowed to index the site, doesn't mean
               | scraping for every possible use is permitted.
        
           | _jal wrote:
           | "Who's gonna stop me" is kind of a crappy attitude to take
           | with a cooperative project like Wikipedia.
           | 
           | I mean, sure, you can do a lot of things you shouldn't with
           | freely available services. There's even an economics term
           | that describes this: the Tragedy of the Commons.
           | 
           | Individual fish poachers' hauls are also, individually, de-
           | minimis.
        
         | cldellow wrote:
         | The infoboxes, which is what this guy is scraping, are much
         | easier to scrape from the HTML than from the XML dumps.
         | 
         | The reason is that the dumps just have pointers to templates,
         | and you need to understand quite a bit about Wikipedia's
         | bespoke rendering system to know how to fully realize them (or
         | use a constantly-evolving library like wtf_wikipedia [1] to
         | parse them).
         | 
         | The rendered HTML, on the other hand, is designed for humans,
         | and so what you see is what you get.
         | 
         | [1]: https://github.com/spencermountain/wtf_wikipedia
        
           | lou1306 wrote:
           | Still, I guess you could get the dumps and do a local
           | Wikimedia setup based on them, and then crawl _that_ instead?
        
             | cldellow wrote:
             | You could, and if he was doing this on the entire corpus
             | that'd be the responsible thing to do.
             | 
             | But, his project really was very reasonable:
             | 
             | - it fetched ~2,400 pages
             | 
             | - he cached them after first fetch
             | 
             | - Wikipedia aggressively caches anonymous page views (eg
             | the Queen Elizabeth page has a cache age of 82,000 seconds)
             | 
             | English Wikipedia does about 250,000,000 pageviews/day.
             | This guy's use was 0.001% of traffic on that day.
             | 
             | I get the slippery slope arguments, but to me, it just
             | doesn't apply. As someone who has donated $1,000 to
             | Wikipedia in the past, I'm totally happy to have those
             | funds spent supporting use cases like this, rather than
             | demanding that people who want to benefit from Wikipedia be
             | able to set up a MySQL server, spend hours doing the
             | import, install and configure a PHP server, etc, etc.
        
               | habibur wrote:
               | > This guy's use was 0.001% of traffic on that day
               | 
               | For 1 person consuming from one of the most popular sites
               | on the web, this really reads big.
        
               | cldellow wrote:
               | He was probably one of the biggest users that day, so
               | that makes sense.
               | 
               | The 2,400 pages, assuming a 50 KB average gzipped size,
               | equate to 120 MB of transfer. I'm assuming CPU usage is
               | negligible due to CDN caching, and so bandwidth is the
               | main cost. 120 MB is orders of magnitude less transfer
               | than the 18.5 GB dump.
               | 
               | Instead of the dumps, he could have used the API -- but
               | would that have significantly changed the costs to the
               | Wikimedia foundation? I think probably not. In my
               | experience, the happy path (serving anonymous HTML) is
               | going to be aggressively optimized for costs, eg caching,
               | CDNs, negotiated bandwidth discounts.
               | 
               | If we accept that these kinds of projects are permissible
               | (which no one seems to be debating, just the manner in
               | which he did the project!), I think the way this guy went
               | about doing it was not actually as bad as people are
               | making it out to be.
        
               | NicoJuicy wrote:
               | I don't think I agree. Cache has a cost too.
               | 
               | In theory, you'd want to cache more popular pages and let
               | the rarely visited ones go through the uncached flow.
               | 
               | Crawling isn't user-behavior, so the odds are that a
               | large percentage of the crawled pages were not cached.
        
               | cldellow wrote:
               | That's true. On the other hand, pages with infoboxes are
               | likely well-linked and will end up in the cache either
               | due to legitimate popularity or due to crawler visits.
               | 
               | Checking a random sample of 50 pages from this guy's
               | dataset, 70% of them were cached.
        
               | bawolff wrote:
               | Note - there's several levels of caching at wikipedia.
               | Even if those pages aren't in cdn (varnish) cache, they
               | may be in parser cache (an application level cache of
               | most of the page).
               | 
               | This amount of activity really isn't something to worry
               | about, especially when taking the fast path of logged out
               | user viewing likely to be cached page.
        
           | jancsika wrote:
           | > The infoboxes, which is what this guy is scraping, are much
           | easier to scrape from the HTML than from the XML dumps.
           | 
           | How is it possible that "give me all the infoboxes, please"
           | is more than a single query, download, or even URL at this
           | point?
        
             | ceejayoz wrote:
             | The problem lies in _parsing_ them.
             | 
             | Look at the template for a subway line infobox, for
             | example.
             | https://en.wikipedia.org/wiki/Template:Bakerloo_line_RDT
             | 
             | It's a whole little clever language (https://en.wikipedia.o
             | rg/wiki/Wikipedia:Route_diagram_templa...) for making
             | complex diagrams out of rather simple pictograms
             | (https://commons.wikimedia.org/wiki/Template:Bsicon).
        
               | jancsika wrote:
               | Oh wow.
               | 
               | But every other infobox I've seen has key/value pairs
               | where the key was always a string.
               | 
               | So what's the spec for an info box? Is it simply to have
               | a starting `<table class="hello_i_am_infobox">` and an
               | ending `</table>`?
        
               | bawolff wrote:
               | En wikipedia has some standards. Generally though they
               | are user-created tables and its up to the users to make
               | them consistent (if they so desire). En Wikipedia
               | generally does, but its not exactly a hard garuntee.
               | 
               | If you want machine readable use wikidata (if you hate
               | rdf you can still scrape the html preview of the data)
        
             | squeaky-clean wrote:
             | The infoboxes aren't standardized at all. The HTML they
             | generate is.
        
               | jancsika wrote:
               | Hehe-- I am going to rankly speculate nearly all of them
               | follow an obvious standard of key/value pairs where the
               | key is a string. And then there are like two or three
               | subcultures on Wikipedia that put rando stuff in there
               | and would troll to the death before being forced to
               | change the their infobox class to "rando_box" or whatever
               | negligible effort it would take them if a standard were
               | to be enforced.
               | 
               | Am I anywhere close to being correct?
        
               | bawolff wrote:
               | I think you'll have to more clearly define what you mean
               | by "key-value" pairs.
        
             | cldellow wrote:
             | Even just being able to download a tarball of the HTML of
             | the infoboxes would be really powerful, setting aside the
             | difficulty of, say, translating them into a consistent JSON
             | format.
             | 
             | That plus a few other key things (categories, opening
             | paragraph, redirects, pageview data) enable a lot of
             | powerful analysis.
             | 
             | That actually might be kind of a neat thing to publish.
             | Hmmmm.
        
               | jancsika wrote:
               | Better yet-- what is the set of wikipedia articles which
               | have an info box that cannot be sensibly interpreted as
               | key/value pairs where the key is a simple string?
        
         | spoonjim wrote:
         | How does scraping raw HTML from Wikipedia hurt them? I'd think
         | they could serve the HTML from cache more likely than the API
         | call.
        
       ___________________________________________________________________
       (page generated 2021-08-19 23:00 UTC)