[HN Gopher] Wikidata or Scraping Wikipedia
       ___________________________________________________________________
        
       Wikidata or Scraping Wikipedia
        
       Author : Lockal
       Score  : 119 points
       Date   : 2021-08-23 15:53 UTC (7 hours ago)
        
 (HTM) web link (simia.net)
 (TXT) w3m dump (simia.net)
        
       | dheera wrote:
       | I'd highly recommend parsing the mobile edition of Wikipedia, it
       | is much easier to parse.
        
       | svat wrote:
       | This is a great post, which also happens to serve as a good
       | illustration of the "curse of knowledge" and the typical blind-
       | spots of enthusiasts. Consider the timeline of events:
       | 
       | * The blog post on scraping Wikipedia (https://billpg.com/data-
       | mining-wikipedia/ , HN discussion 4 days ago:
       | https://news.ycombinator.com/item?id=28234122 which mentions
       | Wikidata as an alternative etc.
       | 
       | * The author of this post, a Wikidata person, finds this an
       | "extremely surprising discussion", and posts a Twitter thread (
       | https://web.archive.org/web/20210820105621/https://twitter.c... )
       | ending with
       | 
       | > _I don 't want to argue or disagree, I am just completely
       | surprised by that statement. Are the docs so bad? Is the API
       | design of Wikidata so weird or undiscoverable? There are plenty
       | of libraries for getting Wikidata data, are they all so hard to
       | use? I am really curious._
       | 
       | This curiosity is a great attitude! (But...)
       | 
       | * After seeing the HN discussion and responses on
       | Twitter/Facebook, he writes this post linked here. In this post,
       | he does mention what he learned from potential users:
       | 
       | > _And there were some very interesting stories about the pain of
       | using Wikidata, and I very much expect us to learn from them and
       | hopefully make things easier. The number of API queries one has
       | to make in order to get data [...], the learning curve about
       | SPARQL and RDF (although, you can ignore both, unless you want to
       | use them explicitly - you can just use JSON and the Wikidata
       | API), the opaqueness of the identifiers (wdt:P25 wd:Q9682 instead
       | of "mother" and "Queen Elizabeth II") were just a few. The
       | documentation seems hard to find, there seem to be a lack of
       | libraries and APIs that are easy to use. And yet, comments like
       | "if you've actually tried getting data from wikidata/wikipedia
       | you very quickly learn the HTML is much easier to parse than the
       | results wikidata gives you" surprised me a lot. [...] I am not
       | here to fight. I am here to listen and to learn, in order to help
       | figuring out what needs to be made better._
       | 
       | Again, very commendable! Almost an opening to really
       | understanding the perspective of casual potential users. But
       | then: the entire rest of the post does not really address "the
       | other side", and instead completely focuses on the kinds of
       | things Wikidata enthusiasts care about: comparing Wikipedia and
       | Wikidata quality in this example, etc.
       | 
       | I mean, sure, this query he presents is short:
       | select * { wd:Q9682 (wdt:P25|wdt:P22)* ?p . ?p wdt:P25|wdt:P22 ?q
       | }
       | 
       | but when he says:
       | 
       | > _I would claim that I invested far less work than Bill in
       | creating my graph data. No data cleansing, no scraping, no
       | crawling, no entity reconciliation, no manual checking._
       | 
       | he's ignoring the work he invested in learning that query
       | language (and where to query it), for instance. And this post
       | would have been a perfect opportunity to teach readers about how
       | to go from the question "all ancestors of Queen Elizabeth" to
       | that query (and in trying to teach it, he may have better
       | discovered exactly what is hard about it), but he just squanders
       | the opportunity (just as when he says "plenty of libraries"
       | without inviting exploration by linking to the easiest one): this
       | is a typical thing enthusiasts do, which is unfortunate IMO.
       | 
       | When scraping HTML from Wikipedia, one is using general-purpose
       | well-known tools. You'll get slightly better at whatever general-
       | purpose programming language and libraries you were using, learn
       | something that may be useful the next time you need to scrape
       | something else. And most importantly, you know that you'll
       | finish, you can see a path to success. When exploring something
       | "alternative" like Wikidata, you aren't sure if it will work, so
       | the alternative path needs to work harder to convince potential
       | users of success.
       | 
       | ---
       | 
       | Personal story: I actually _know_ about the existence of
       | Wikidata. Yet the one time I tried to use it, I couldn 't figure
       | out how. This is what I was trying to do: plot a graph of the
       | average age of Turing Award winners by year. (Reproduce the first
       | figure from here: http://hagiograffiti.blogspot.com/2009/01/when-
       | will-singular... just for fun) One would think this is a perfect
       | use-case for Wikidata: presumably it has a way of going from
       | Turing Award - list of winners - each winner's date of birth. But
       | I was stymied at the very first step: despite knowing of the
       | existence of Wikidata, and being able to go from the Wikipedia
       | page that lists all recipients (current version:
       | https://en.wikipedia.org/w/index.php?title=Turing_Award&oldi... )
       | to the Wikidata item for "Turing Award" (look for "Wikidata item"
       | in the sidebar on the left) https://www.wikidata.org/wiki/Q185667
       | I could not quickly find a way of getting a list of recipients
       | from there. Tantalizingly, the data maybe does exist e.g. if I go
       | to one of the recipients like Leslie Valiant
       | https://www.wikidata.org/wiki/Q93154 I see a "statement" award
       | received - Turing Award with "property" point in time - 2010.
       | Even after coming so close, and being interested in using
       | Wikidata, it was not easy enough for me to get to the next step
       | (which I still imagine is possible, maybe with tens of minutes of
       | effort), until I just decided "screw this, I'll just scrape the
       | Wikipedia page" (I scraped the wikisource rather than html). And
       | if one is going to have to scrape anyway, then might as well do
       | the rest too (dates of birth) with scraping.
        
         | bawolff wrote:
         | > When scraping HTML from Wikipedia, one is using general-
         | purpose well-known tools. You'll get slightly better at
         | whatever general-purpose programming language and libraries you
         | were using, learn something that may be useful the next time
         | you need to scrape something else. And most importantly, you
         | know that you'll finish, you can see a path to success. When
         | exploring something "alternative" like Wikidata, you aren't
         | sure if it will work, so the alternative path needs to work
         | harder to convince potential users of success.
         | 
         | I'm not sure its that clear. Scrapping is pretty generic, but
         | SPARQL is hardly a proprietary query language - other things
         | use it. If what you're into is obtaining data, sparql might
         | more generically apply than scrapping would. It really depends
         | on what you are doing in the future. At the very least if you
         | do scrapping a lot, you're probably going to reinvent the
         | parsing wheel a lot. To each their own.
         | 
         | > he's ignoring the work he invested in learning that query
         | language (and where to query it), for instance
         | 
         | And Bill is ignoring the work of learning how to program. None
         | of us start from nothing, and its not like any of this is
         | trivial to learn if you've never touched a computer before.
         | 
         | And to be clear i'm not objecting - there is nothing wrong with
         | using the skills you currently have to solve the problem you
         | currently have. Whatever gets you the solution. If you're
         | querying wikidata (or similar things) everyday, learning sparql
         | is probably a good investment. If you're interested in sparql,
         | then by all means learn it. But if those dont apply, then
         | scrapping makes sense if you already know how to do that.
        
           | bryanrasmussen wrote:
           | > he's ignoring the work he invested in learning that query
           | language (and where to query it), for instance
           | 
           | >And Bill is ignoring the work of learning how to program.
           | 
           | I suppose if you didn't know how to program you wouldn't
           | learn Sparql. So the investment in learning how to program
           | has already been made.
        
             | bawolff wrote:
             | Why not? People sometimes learn SQL without learning to
             | program, why not sparql?
        
               | bryanrasmussen wrote:
               | Well one reason why someone might learn SQL without
               | learning how to program is that you can get jobs for it.
               | 
               | Ah, but the response might go, lots of people learned SQL
               | when there weren't a lot of jobs for people who knew SQL.
               | 
               | Yes, my response would be, but that was a long time ago
               | and the incentives for people to learn technologies have
               | changed, and I do not think a significant amount of
               | people will learn SQL without learning to program
               | henceforth; at least not amounts significant enough that
               | anyone will say "Well look at that trend!".
               | 
               | here there can be several responses so I won't go through
               | all the branches, but in the end I don't think there is
               | going to be an interest in learning Sparql in people who
               | are not programmers or at least programming adjacent
               | professions, and from what I see there hasn't been that
               | much interest from people who are programmers.
        
           | svat wrote:
           | > [Scraping] is pretty generic, but SPARQL is hardly a
           | proprietary query language - other things use it. If what
           | you're into is obtaining data, sparql might more generically
           | apply than [scraping] would. It really depends on what you
           | are doing in the future.
           | 
           | Yes my point exactly! My point was that _even_ when trying to
           | consider the perspective of people different from us, we can
           | end up writing for (and from the perspective of) people who
           | are  "into" the same things as us. Casual users like in the
           | original scraping post are not much really "into" obtaining
           | data, which can be a blind spot for enthusiasts who are. The
           | challenge and opportunity in such cases is really
           | communication with the outside of the field, rather than
           | competition within the field.
        
             | TechBro8615 wrote:
             | "Scrapping" is like nails scrapping on a chalkboard for me.
        
         | Minor49er wrote:
         | Absolutely spot-on. It makes me think of my own experience.
         | 
         | I've worked for a few niche search engines. Some sites have
         | APIs available so that you don't have to scrape their data. But
         | often times, since we were already used to scraping sites, we
         | wouldn't even notice that an API was available. In a few number
         | of cases, an API _was_ available, but it was more restrictive
         | or complicated than it was for us to just scrape a page. That's
         | not to say that we never used them, because we certainly did.
         | Just that we often were never aware that they were an option
         | since they were not very common in our cases.
        
         | dalf wrote:
         | About the Turing Award, after some trials and errors, I _think_
         | this is the request: https://w.wiki/3wmY
         | 
         | Disclaimer: I follow
         | https://www.youtube.com/channel/UCp2i8QpLDnWge8wZGKizVVw /
         | https://www.twitch.tv/belett (mostly in French, sometimes in
         | English).
         | 
         | Without these courses, I wouldn't have been able to write this
         | request.
        
           | bawolff wrote:
           | I think this is an interesting case because scraping this is
           | easy (just one page) where the wikidata query requires
           | dealing with modifiers which is a bit more complex.
        
             | dalf wrote:
             | (It requires the birth dates, so it is more than one page)
             | 
             | The HTML structure may change over time: if the request is
             | executed few times over a long period, the scrapper
             | may/will require more maintenance than the SPARQL request.
             | 
             | For example, the same wikipedia page 3 years ago is
             | slightly different: https://en.wikipedia.org/w/index.php?ti
             | tle=Turing_Award&oldi...
        
               | 1vuio0pswjnm7 wrote:
               | "The HTML structure may change over time..."
               | 
               | A very common argument in HN comments that discuss the
               | merits of so-called web APIs.
               | 
               | Fair balance:
               | 
               | Web APIs can change (e.g., v1 -> v2), they can be
               | discontinued, their terms of use can change, quotas can
               | be enforced, etc.
               | 
               | A public web page does not suffer from those drawbacks.
               | Changes that require me to rewrite scripts are generally
               | infrequent. What happens more often is websites that
               | provide good data/information sources simply go offline.
               | 
               | There is nothing wrong with web APIs per se, I welcome
               | them (I use the same custom HTTP generator and TCP/TLS
               | clients for both), but the way "APIs" are presented, as
               | some sort of "special privilege", requiring "sign up", an
               | email address and often more personal information, maybe
               | even payment, is for the _user_ , cf. developer, inferior
               | to a public webpage, IMHO. As a user, not a developer,
               | HTTP pipelining works for me better than many web APIs. I
               | can get large quantities of data/information in one or a
               | small number of TCP connections (I never have to use use
               | proxies nor do I ever get banned); it requires no
               | disclosure of personal details and is not subject to
               | arbitrary limits.
               | 
               | What's interesting about this Wikidata/Wikipedia case is
               | that the term chosen was "user" not "developer". It
               | appears we cannot assume that the only persons who will
               | use this "API" are ones who intend to insert the
               | retrieved data/information into some other webpage or
               | "app" that probably contains advertising and/or tracking.
               | It is for everyone, not just "developers".
        
           | beprogrammed wrote:
           | And today I learned Donald Knuth was the youngest Turing
           | award winner at the age of 36. I'm going to have to go learn
           | SPARQL.
        
         | ak217 wrote:
         | I also had a horrible experience using the recommended SPARQL
         | interface to query Wikidata. The queries were inscrutable, the
         | documentation was poor and even after writing the correct
         | queries, they timed out after scanning a tiny fraction of the
         | data I needed, making the query engine useless to me.
         | 
         | However, I had great success querying Wikidata via the "plain
         | old" MediaWiki Query API:
         | https://www.mediawiki.org/wiki/API:Query. That API was a joy to
         | work with.
         | 
         | Wikidata (as a backing store for Wikipedia and a knowledge
         | graph engine) is a very powerful concept. It's a key platform
         | technology for Wikipedia and hopefully they'll prioritize its
         | usability going forward.
        
       | ComputerGuru wrote:
       | Site is down. Archive.org link:
       | https://web.archive.org/web/20210821004307/http://simia.net/...
        
       | nine_k wrote:
       | I think this project is not getting enough attention:
       | https://github.com/zverok/wikipedia_ql
       | 
       | It allows to query Wikipedia (not wikidata, but the actual human-
       | readable text) more or less directly, mixing the way you describe
       | a scraper with some nicer higher-level constructs.
       | 
       | Can't vouch for its performance, but the API is interesting and
       | nice.
        
       | dhosek wrote:
       | On a related point, while doing some Unicode research, I
       | discovered that the Unicode project itself uses wikidata as an
       | (untrusted) source for some data, translations of names, if I
       | recall correctly, cf.
       | https://www.unicode.org/review/pri408/pri408-tr51-QID.html
       | although that's not the reference I encountered earlier today.
       | Their system is set up so that if the Unicode organization
       | corrects something previously read, it takes precedence over what
       | was pulled from wikidata, but otherwise the wikidata value will
       | be used.
        
         | dhosek wrote:
         | Ah, this is what I read earlier today, yay Google+color-
         | changing links http://cldr.unicode.org/implementers-faq
        
       | simonw wrote:
       | Here-in lies the problem for me:                   select * {
       | wd:Q9682 (wdt:P25|wdt:P22)* ?p . ?p wdt:P25|wdt:P22 ?q } -
       | 
       | I am extremely motivated to learn how to use this: I have a deep
       | desire to extract data from Wikipedia, and I'm fascinated by
       | graph databases.
       | 
       | And yet, despite trying on several previous occasions, SPARQL has
       | completely failed to stick in my brain.
       | 
       | This is partly my own failing: I'm confident that if I really
       | dedicated myself to it I could get over this hump.
       | 
       | But it's also a sign that the learning curve here really is
       | tremendously steep, which I think indicates a problem with the
       | design of the technology.
        
         | dmitriid wrote:
         | Or use something like https://github.com/zverok/wikipedia_ql
         | that uses Mediawiki API
        
         | bawolff wrote:
         | I find it helps to translate the syntax into english:
         | 
         | > select *
         | 
         | Show all variables starting with ?
         | 
         | > wd:Q9682
         | 
         | Find item Q9682 (Queen Elizabeth 2)
         | 
         | > (wdt:P25|wdt:P22)*
         | 
         | Follow edges that are either P22 (father) or P25 (mother) zero-
         | or more times
         | 
         | Everytime you follow one of those edges, add the new item to
         | ?p. Keep following these edges until you can't anymore.
         | 
         | > ?p wdt:P25|wdt:P22 ?q
         | 
         | For every ?p follow a mother/father edge precisely once, call
         | the item it points to ?q (if there is no such edge we get rid
         | of the p)
         | 
         | The end result, is we have a list of rows containing pairs of
         | (an ancestor of elizabeth, one of that ancestor's direct
         | parents).
         | 
         | ----
         | 
         | I feel like one of the reasons that sparql is confusing is
         | because people use their intuitions from SQL which is wrong -
         | since the underlying data model is different but the syntax
         | looks vaugely sql-like which leads to misunderstandings.
        
           | freeone3000 wrote:
           | Where do you end up with the translations from wdt:P25 to
           | "mother"? That's the most incomprehensible part. It feels
           | like I need a reverse dictionary lookup to write a single
           | query.
        
         | mmarx wrote:
         | A few days ago, the Wikidata Query Builder[0] was deployed. It
         | provides a visual interface to generate simple SPARQL queries,
         | and you can show the generated queries. Maybe this can help you
         | in understanding how SPARQL patterns work?
         | 
         | [0] https://query.wikidata.org/querybuilder/
        
           | simonw wrote:
           | That does look like a big step forward.
           | 
           | It could really benefit from some linked examples on that
           | page though - I stared at the interface for quite a while,
           | unable to figure out how to use it for anything - then I dug
           | around for an example link and it started to make sense to
           | me.
           | 
           | https://query.wikidata.org/querybuilder/?query=%7B%22conditi.
           | ..
        
       | exogen wrote:
       | I missed the original HN and twitter threads referenced in the
       | post, so I might just be repeating something that was already
       | said there...
       | 
       | But, in nearly all cases I would trust a bespoke Wikipedia
       | scraper over using the output of Wikidata or DBpedia. Not to
       | disparage either project, because they're great ideas and good
       | efforts. I have a firm grasp of RDF and SPARQL queries (used to
       | work with them professionally), which also makes them tempting to
       | use.
       | 
       | One issue is that Wikidata tends to only report facts whose
       | subjects or objects themselves have articles (and thus Wikidata
       | entities).
       | 
       | For example, compare the "Track listing" section of Carly Rae
       | Jepsen's Curiosity EP on Wikipedia vs. the "track listing"
       | property on Wikidata.
       | 
       | Wikipedia has:                   1. Call Me Maybe (link)
       | 2. Curiosity (link)         3. Picture         4. Talk to Me
       | 5. Just a Step Away         6. Both Sides Now (link)
       | 
       | while Wikidata has:                   1. Call Me Maybe         2.
       | Curiosity
       | 
       | So not only has it ignored any tracks that aren't deserving of
       | their own articles, but it also missed one that actually does
       | have an article (track 6, a cover of "Both Sides, Now").
       | 
       | > Others asked about the data quality of Wikidata, and complained
       | about the huge amount of bad data, duplicates, and the bad
       | ontology in Wikidata (as if Wikipedia wouldn't have these
       | problems. I mean how do you figure out what a Wikipedia article
       | is about? How do you get a list of all bridges or events from
       | Wikipedia?)
       | 
       | Often the problem isn't that Wikipedia is wrong, it's that
       | Wikidata's own parser (however it works) doesn't account for the
       | many ways people format things on Wikipedia. With a bespoke
       | parser, you can improve it over time as you encounter edge cases.
       | With Wikidata, you can't really fix anything... the data is
       | already extracted (right or wrong) and all the original context
       | lost.
        
         | mistrial9 wrote:
         | acute observation! there is something here to be teased out ..
         | about.. the final product is a human readable page all these
         | years, and that human readable page got better in adhoc ways
         | and most all of those improvments stuck..
         | 
         | compare to the RDF efforts, who ride a rigorous math-y
         | perspective and with a far, far smaller development crowd right
         | away..
        
       | nathell wrote:
       | Wikipedia asks people not to crawl it. There are database dumps
       | that you can instead import into your local MySQL and work from
       | there.
       | 
       | https://en.wikipedia.org/wiki/Wikipedia:Database_download#Pl...
        
         | bawolff wrote:
         | Wikipedia has no objection to crawling a couple thousand pages
         | if you do so at a reasonable speed and set a user-agent with a
         | contact email.
         | 
         | If you want to crawl millions of pages please use a dump.
        
       | marto1 wrote:
       | For me, I ended up using both. Opting for wikidata wherever made
       | sense, but a lot of things felt half built/broken.
        
       | drej wrote:
       | It's fairly difficult from the other side as well - contributing.
       | I've been trying to complete wikidata from a few open source
       | datasets I am intensly familiar with... and it's been rather
       | painful. WD is the sole place I have ever interacted with that
       | uses RDF, so I always forget the little syntax I learned last
       | time around. I have some pre-existing queries versioned, because
       | I'll never be able to write them again. I even went to a local
       | Wikimedia training to get acquainted with some necessary tooling,
       | but I'm still super unproductive compared to e.g. SQL.
       | 
       | It's sad, really, I'd love to contribute more, but the whole data
       | model is so clunky to work with.
        
         | drej wrote:
         | That being said, I now remember I stopped contributing for a
         | slightly different reason. While I tried to fill WD with
         | complete information about a given subject, this was never
         | leverage by a Wikimedia project - there is certain resistance
         | to generating Wikipedia articles/infoboxes from Wikidata, so
         | you're fighting on two fronts, you always have to edit things
         | in two places and it's just a waste of everyone's time.
         | 
         | Unless the attitude becomes "all facts in infoboxes and most
         | tables come from WD", the two "datasets" will continue
         | diverging. That is obviously more easily said than done,
         | because relying on WD makes Wikipedia contribution a lot more
         | difficult... and that pretty much defeats its purpose.
        
           | dane-pgp wrote:
           | > the two "datasets" will continue diverging.
           | 
           | You may be pleased to learn that there is a project underway
           | that aims to largely solve that problem:
           | 
           | https://www.mediawiki.org/wiki/Wikidata_Bridge
           | 
           | The last piece of news I can immediately find is that it was
           | deployed to the Catalan Wikipedia in August 2020, but I'm not
           | sure what progress there has been since.
        
       | elcapitan wrote:
       | How do other people use Wikidata dumps if they are not using the
       | "official" (with sparql or so) way of querying it? I have done
       | some pretty raw extraction from it (e.g. download the already
       | pretty large zipped json dump, then unzip it on the fly and parse
       | the json, and extract triples and entities). Not sure if that is
       | really quite efficient, but the dumps are hard to work with, and
       | I really just needed the entities in one language and the
       | triples/graph of them.
        
       | bob229 wrote:
       | Wikipedia is garbage.
        
       | JamesCoyne wrote:
       | https://web.archive.org/web/20210822151254/http://simia.net/...
        
       ___________________________________________________________________
       (page generated 2021-08-23 23:01 UTC)