[HN Gopher] Wikidata or Scraping Wikipedia
___________________________________________________________________
Wikidata or Scraping Wikipedia
Author : Lockal
Score : 119 points
Date : 2021-08-23 15:53 UTC (7 hours ago)
(HTM) web link (simia.net)
(TXT) w3m dump (simia.net)
| dheera wrote:
| I'd highly recommend parsing the mobile edition of Wikipedia, it
| is much easier to parse.
| svat wrote:
| This is a great post, which also happens to serve as a good
| illustration of the "curse of knowledge" and the typical blind-
| spots of enthusiasts. Consider the timeline of events:
|
| * The blog post on scraping Wikipedia (https://billpg.com/data-
| mining-wikipedia/ , HN discussion 4 days ago:
| https://news.ycombinator.com/item?id=28234122 which mentions
| Wikidata as an alternative etc.
|
| * The author of this post, a Wikidata person, finds this an
| "extremely surprising discussion", and posts a Twitter thread (
| https://web.archive.org/web/20210820105621/https://twitter.c... )
| ending with
|
| > _I don 't want to argue or disagree, I am just completely
| surprised by that statement. Are the docs so bad? Is the API
| design of Wikidata so weird or undiscoverable? There are plenty
| of libraries for getting Wikidata data, are they all so hard to
| use? I am really curious._
|
| This curiosity is a great attitude! (But...)
|
| * After seeing the HN discussion and responses on
| Twitter/Facebook, he writes this post linked here. In this post,
| he does mention what he learned from potential users:
|
| > _And there were some very interesting stories about the pain of
| using Wikidata, and I very much expect us to learn from them and
| hopefully make things easier. The number of API queries one has
| to make in order to get data [...], the learning curve about
| SPARQL and RDF (although, you can ignore both, unless you want to
| use them explicitly - you can just use JSON and the Wikidata
| API), the opaqueness of the identifiers (wdt:P25 wd:Q9682 instead
| of "mother" and "Queen Elizabeth II") were just a few. The
| documentation seems hard to find, there seem to be a lack of
| libraries and APIs that are easy to use. And yet, comments like
| "if you've actually tried getting data from wikidata/wikipedia
| you very quickly learn the HTML is much easier to parse than the
| results wikidata gives you" surprised me a lot. [...] I am not
| here to fight. I am here to listen and to learn, in order to help
| figuring out what needs to be made better._
|
| Again, very commendable! Almost an opening to really
| understanding the perspective of casual potential users. But
| then: the entire rest of the post does not really address "the
| other side", and instead completely focuses on the kinds of
| things Wikidata enthusiasts care about: comparing Wikipedia and
| Wikidata quality in this example, etc.
|
| I mean, sure, this query he presents is short:
| select * { wd:Q9682 (wdt:P25|wdt:P22)* ?p . ?p wdt:P25|wdt:P22 ?q
| }
|
| but when he says:
|
| > _I would claim that I invested far less work than Bill in
| creating my graph data. No data cleansing, no scraping, no
| crawling, no entity reconciliation, no manual checking._
|
| he's ignoring the work he invested in learning that query
| language (and where to query it), for instance. And this post
| would have been a perfect opportunity to teach readers about how
| to go from the question "all ancestors of Queen Elizabeth" to
| that query (and in trying to teach it, he may have better
| discovered exactly what is hard about it), but he just squanders
| the opportunity (just as when he says "plenty of libraries"
| without inviting exploration by linking to the easiest one): this
| is a typical thing enthusiasts do, which is unfortunate IMO.
|
| When scraping HTML from Wikipedia, one is using general-purpose
| well-known tools. You'll get slightly better at whatever general-
| purpose programming language and libraries you were using, learn
| something that may be useful the next time you need to scrape
| something else. And most importantly, you know that you'll
| finish, you can see a path to success. When exploring something
| "alternative" like Wikidata, you aren't sure if it will work, so
| the alternative path needs to work harder to convince potential
| users of success.
|
| ---
|
| Personal story: I actually _know_ about the existence of
| Wikidata. Yet the one time I tried to use it, I couldn 't figure
| out how. This is what I was trying to do: plot a graph of the
| average age of Turing Award winners by year. (Reproduce the first
| figure from here: http://hagiograffiti.blogspot.com/2009/01/when-
| will-singular... just for fun) One would think this is a perfect
| use-case for Wikidata: presumably it has a way of going from
| Turing Award - list of winners - each winner's date of birth. But
| I was stymied at the very first step: despite knowing of the
| existence of Wikidata, and being able to go from the Wikipedia
| page that lists all recipients (current version:
| https://en.wikipedia.org/w/index.php?title=Turing_Award&oldi... )
| to the Wikidata item for "Turing Award" (look for "Wikidata item"
| in the sidebar on the left) https://www.wikidata.org/wiki/Q185667
| I could not quickly find a way of getting a list of recipients
| from there. Tantalizingly, the data maybe does exist e.g. if I go
| to one of the recipients like Leslie Valiant
| https://www.wikidata.org/wiki/Q93154 I see a "statement" award
| received - Turing Award with "property" point in time - 2010.
| Even after coming so close, and being interested in using
| Wikidata, it was not easy enough for me to get to the next step
| (which I still imagine is possible, maybe with tens of minutes of
| effort), until I just decided "screw this, I'll just scrape the
| Wikipedia page" (I scraped the wikisource rather than html). And
| if one is going to have to scrape anyway, then might as well do
| the rest too (dates of birth) with scraping.
| bawolff wrote:
| > When scraping HTML from Wikipedia, one is using general-
| purpose well-known tools. You'll get slightly better at
| whatever general-purpose programming language and libraries you
| were using, learn something that may be useful the next time
| you need to scrape something else. And most importantly, you
| know that you'll finish, you can see a path to success. When
| exploring something "alternative" like Wikidata, you aren't
| sure if it will work, so the alternative path needs to work
| harder to convince potential users of success.
|
| I'm not sure its that clear. Scrapping is pretty generic, but
| SPARQL is hardly a proprietary query language - other things
| use it. If what you're into is obtaining data, sparql might
| more generically apply than scrapping would. It really depends
| on what you are doing in the future. At the very least if you
| do scrapping a lot, you're probably going to reinvent the
| parsing wheel a lot. To each their own.
|
| > he's ignoring the work he invested in learning that query
| language (and where to query it), for instance
|
| And Bill is ignoring the work of learning how to program. None
| of us start from nothing, and its not like any of this is
| trivial to learn if you've never touched a computer before.
|
| And to be clear i'm not objecting - there is nothing wrong with
| using the skills you currently have to solve the problem you
| currently have. Whatever gets you the solution. If you're
| querying wikidata (or similar things) everyday, learning sparql
| is probably a good investment. If you're interested in sparql,
| then by all means learn it. But if those dont apply, then
| scrapping makes sense if you already know how to do that.
| bryanrasmussen wrote:
| > he's ignoring the work he invested in learning that query
| language (and where to query it), for instance
|
| >And Bill is ignoring the work of learning how to program.
|
| I suppose if you didn't know how to program you wouldn't
| learn Sparql. So the investment in learning how to program
| has already been made.
| bawolff wrote:
| Why not? People sometimes learn SQL without learning to
| program, why not sparql?
| bryanrasmussen wrote:
| Well one reason why someone might learn SQL without
| learning how to program is that you can get jobs for it.
|
| Ah, but the response might go, lots of people learned SQL
| when there weren't a lot of jobs for people who knew SQL.
|
| Yes, my response would be, but that was a long time ago
| and the incentives for people to learn technologies have
| changed, and I do not think a significant amount of
| people will learn SQL without learning to program
| henceforth; at least not amounts significant enough that
| anyone will say "Well look at that trend!".
|
| here there can be several responses so I won't go through
| all the branches, but in the end I don't think there is
| going to be an interest in learning Sparql in people who
| are not programmers or at least programming adjacent
| professions, and from what I see there hasn't been that
| much interest from people who are programmers.
| svat wrote:
| > [Scraping] is pretty generic, but SPARQL is hardly a
| proprietary query language - other things use it. If what
| you're into is obtaining data, sparql might more generically
| apply than [scraping] would. It really depends on what you
| are doing in the future.
|
| Yes my point exactly! My point was that _even_ when trying to
| consider the perspective of people different from us, we can
| end up writing for (and from the perspective of) people who
| are "into" the same things as us. Casual users like in the
| original scraping post are not much really "into" obtaining
| data, which can be a blind spot for enthusiasts who are. The
| challenge and opportunity in such cases is really
| communication with the outside of the field, rather than
| competition within the field.
| TechBro8615 wrote:
| "Scrapping" is like nails scrapping on a chalkboard for me.
| Minor49er wrote:
| Absolutely spot-on. It makes me think of my own experience.
|
| I've worked for a few niche search engines. Some sites have
| APIs available so that you don't have to scrape their data. But
| often times, since we were already used to scraping sites, we
| wouldn't even notice that an API was available. In a few number
| of cases, an API _was_ available, but it was more restrictive
| or complicated than it was for us to just scrape a page. That's
| not to say that we never used them, because we certainly did.
| Just that we often were never aware that they were an option
| since they were not very common in our cases.
| dalf wrote:
| About the Turing Award, after some trials and errors, I _think_
| this is the request: https://w.wiki/3wmY
|
| Disclaimer: I follow
| https://www.youtube.com/channel/UCp2i8QpLDnWge8wZGKizVVw /
| https://www.twitch.tv/belett (mostly in French, sometimes in
| English).
|
| Without these courses, I wouldn't have been able to write this
| request.
| bawolff wrote:
| I think this is an interesting case because scraping this is
| easy (just one page) where the wikidata query requires
| dealing with modifiers which is a bit more complex.
| dalf wrote:
| (It requires the birth dates, so it is more than one page)
|
| The HTML structure may change over time: if the request is
| executed few times over a long period, the scrapper
| may/will require more maintenance than the SPARQL request.
|
| For example, the same wikipedia page 3 years ago is
| slightly different: https://en.wikipedia.org/w/index.php?ti
| tle=Turing_Award&oldi...
| 1vuio0pswjnm7 wrote:
| "The HTML structure may change over time..."
|
| A very common argument in HN comments that discuss the
| merits of so-called web APIs.
|
| Fair balance:
|
| Web APIs can change (e.g., v1 -> v2), they can be
| discontinued, their terms of use can change, quotas can
| be enforced, etc.
|
| A public web page does not suffer from those drawbacks.
| Changes that require me to rewrite scripts are generally
| infrequent. What happens more often is websites that
| provide good data/information sources simply go offline.
|
| There is nothing wrong with web APIs per se, I welcome
| them (I use the same custom HTTP generator and TCP/TLS
| clients for both), but the way "APIs" are presented, as
| some sort of "special privilege", requiring "sign up", an
| email address and often more personal information, maybe
| even payment, is for the _user_ , cf. developer, inferior
| to a public webpage, IMHO. As a user, not a developer,
| HTTP pipelining works for me better than many web APIs. I
| can get large quantities of data/information in one or a
| small number of TCP connections (I never have to use use
| proxies nor do I ever get banned); it requires no
| disclosure of personal details and is not subject to
| arbitrary limits.
|
| What's interesting about this Wikidata/Wikipedia case is
| that the term chosen was "user" not "developer". It
| appears we cannot assume that the only persons who will
| use this "API" are ones who intend to insert the
| retrieved data/information into some other webpage or
| "app" that probably contains advertising and/or tracking.
| It is for everyone, not just "developers".
| beprogrammed wrote:
| And today I learned Donald Knuth was the youngest Turing
| award winner at the age of 36. I'm going to have to go learn
| SPARQL.
| ak217 wrote:
| I also had a horrible experience using the recommended SPARQL
| interface to query Wikidata. The queries were inscrutable, the
| documentation was poor and even after writing the correct
| queries, they timed out after scanning a tiny fraction of the
| data I needed, making the query engine useless to me.
|
| However, I had great success querying Wikidata via the "plain
| old" MediaWiki Query API:
| https://www.mediawiki.org/wiki/API:Query. That API was a joy to
| work with.
|
| Wikidata (as a backing store for Wikipedia and a knowledge
| graph engine) is a very powerful concept. It's a key platform
| technology for Wikipedia and hopefully they'll prioritize its
| usability going forward.
| ComputerGuru wrote:
| Site is down. Archive.org link:
| https://web.archive.org/web/20210821004307/http://simia.net/...
| nine_k wrote:
| I think this project is not getting enough attention:
| https://github.com/zverok/wikipedia_ql
|
| It allows to query Wikipedia (not wikidata, but the actual human-
| readable text) more or less directly, mixing the way you describe
| a scraper with some nicer higher-level constructs.
|
| Can't vouch for its performance, but the API is interesting and
| nice.
| dhosek wrote:
| On a related point, while doing some Unicode research, I
| discovered that the Unicode project itself uses wikidata as an
| (untrusted) source for some data, translations of names, if I
| recall correctly, cf.
| https://www.unicode.org/review/pri408/pri408-tr51-QID.html
| although that's not the reference I encountered earlier today.
| Their system is set up so that if the Unicode organization
| corrects something previously read, it takes precedence over what
| was pulled from wikidata, but otherwise the wikidata value will
| be used.
| dhosek wrote:
| Ah, this is what I read earlier today, yay Google+color-
| changing links http://cldr.unicode.org/implementers-faq
| simonw wrote:
| Here-in lies the problem for me: select * {
| wd:Q9682 (wdt:P25|wdt:P22)* ?p . ?p wdt:P25|wdt:P22 ?q } -
|
| I am extremely motivated to learn how to use this: I have a deep
| desire to extract data from Wikipedia, and I'm fascinated by
| graph databases.
|
| And yet, despite trying on several previous occasions, SPARQL has
| completely failed to stick in my brain.
|
| This is partly my own failing: I'm confident that if I really
| dedicated myself to it I could get over this hump.
|
| But it's also a sign that the learning curve here really is
| tremendously steep, which I think indicates a problem with the
| design of the technology.
| dmitriid wrote:
| Or use something like https://github.com/zverok/wikipedia_ql
| that uses Mediawiki API
| bawolff wrote:
| I find it helps to translate the syntax into english:
|
| > select *
|
| Show all variables starting with ?
|
| > wd:Q9682
|
| Find item Q9682 (Queen Elizabeth 2)
|
| > (wdt:P25|wdt:P22)*
|
| Follow edges that are either P22 (father) or P25 (mother) zero-
| or more times
|
| Everytime you follow one of those edges, add the new item to
| ?p. Keep following these edges until you can't anymore.
|
| > ?p wdt:P25|wdt:P22 ?q
|
| For every ?p follow a mother/father edge precisely once, call
| the item it points to ?q (if there is no such edge we get rid
| of the p)
|
| The end result, is we have a list of rows containing pairs of
| (an ancestor of elizabeth, one of that ancestor's direct
| parents).
|
| ----
|
| I feel like one of the reasons that sparql is confusing is
| because people use their intuitions from SQL which is wrong -
| since the underlying data model is different but the syntax
| looks vaugely sql-like which leads to misunderstandings.
| freeone3000 wrote:
| Where do you end up with the translations from wdt:P25 to
| "mother"? That's the most incomprehensible part. It feels
| like I need a reverse dictionary lookup to write a single
| query.
| mmarx wrote:
| A few days ago, the Wikidata Query Builder[0] was deployed. It
| provides a visual interface to generate simple SPARQL queries,
| and you can show the generated queries. Maybe this can help you
| in understanding how SPARQL patterns work?
|
| [0] https://query.wikidata.org/querybuilder/
| simonw wrote:
| That does look like a big step forward.
|
| It could really benefit from some linked examples on that
| page though - I stared at the interface for quite a while,
| unable to figure out how to use it for anything - then I dug
| around for an example link and it started to make sense to
| me.
|
| https://query.wikidata.org/querybuilder/?query=%7B%22conditi.
| ..
| exogen wrote:
| I missed the original HN and twitter threads referenced in the
| post, so I might just be repeating something that was already
| said there...
|
| But, in nearly all cases I would trust a bespoke Wikipedia
| scraper over using the output of Wikidata or DBpedia. Not to
| disparage either project, because they're great ideas and good
| efforts. I have a firm grasp of RDF and SPARQL queries (used to
| work with them professionally), which also makes them tempting to
| use.
|
| One issue is that Wikidata tends to only report facts whose
| subjects or objects themselves have articles (and thus Wikidata
| entities).
|
| For example, compare the "Track listing" section of Carly Rae
| Jepsen's Curiosity EP on Wikipedia vs. the "track listing"
| property on Wikidata.
|
| Wikipedia has: 1. Call Me Maybe (link)
| 2. Curiosity (link) 3. Picture 4. Talk to Me
| 5. Just a Step Away 6. Both Sides Now (link)
|
| while Wikidata has: 1. Call Me Maybe 2.
| Curiosity
|
| So not only has it ignored any tracks that aren't deserving of
| their own articles, but it also missed one that actually does
| have an article (track 6, a cover of "Both Sides, Now").
|
| > Others asked about the data quality of Wikidata, and complained
| about the huge amount of bad data, duplicates, and the bad
| ontology in Wikidata (as if Wikipedia wouldn't have these
| problems. I mean how do you figure out what a Wikipedia article
| is about? How do you get a list of all bridges or events from
| Wikipedia?)
|
| Often the problem isn't that Wikipedia is wrong, it's that
| Wikidata's own parser (however it works) doesn't account for the
| many ways people format things on Wikipedia. With a bespoke
| parser, you can improve it over time as you encounter edge cases.
| With Wikidata, you can't really fix anything... the data is
| already extracted (right or wrong) and all the original context
| lost.
| mistrial9 wrote:
| acute observation! there is something here to be teased out ..
| about.. the final product is a human readable page all these
| years, and that human readable page got better in adhoc ways
| and most all of those improvments stuck..
|
| compare to the RDF efforts, who ride a rigorous math-y
| perspective and with a far, far smaller development crowd right
| away..
| nathell wrote:
| Wikipedia asks people not to crawl it. There are database dumps
| that you can instead import into your local MySQL and work from
| there.
|
| https://en.wikipedia.org/wiki/Wikipedia:Database_download#Pl...
| bawolff wrote:
| Wikipedia has no objection to crawling a couple thousand pages
| if you do so at a reasonable speed and set a user-agent with a
| contact email.
|
| If you want to crawl millions of pages please use a dump.
| marto1 wrote:
| For me, I ended up using both. Opting for wikidata wherever made
| sense, but a lot of things felt half built/broken.
| drej wrote:
| It's fairly difficult from the other side as well - contributing.
| I've been trying to complete wikidata from a few open source
| datasets I am intensly familiar with... and it's been rather
| painful. WD is the sole place I have ever interacted with that
| uses RDF, so I always forget the little syntax I learned last
| time around. I have some pre-existing queries versioned, because
| I'll never be able to write them again. I even went to a local
| Wikimedia training to get acquainted with some necessary tooling,
| but I'm still super unproductive compared to e.g. SQL.
|
| It's sad, really, I'd love to contribute more, but the whole data
| model is so clunky to work with.
| drej wrote:
| That being said, I now remember I stopped contributing for a
| slightly different reason. While I tried to fill WD with
| complete information about a given subject, this was never
| leverage by a Wikimedia project - there is certain resistance
| to generating Wikipedia articles/infoboxes from Wikidata, so
| you're fighting on two fronts, you always have to edit things
| in two places and it's just a waste of everyone's time.
|
| Unless the attitude becomes "all facts in infoboxes and most
| tables come from WD", the two "datasets" will continue
| diverging. That is obviously more easily said than done,
| because relying on WD makes Wikipedia contribution a lot more
| difficult... and that pretty much defeats its purpose.
| dane-pgp wrote:
| > the two "datasets" will continue diverging.
|
| You may be pleased to learn that there is a project underway
| that aims to largely solve that problem:
|
| https://www.mediawiki.org/wiki/Wikidata_Bridge
|
| The last piece of news I can immediately find is that it was
| deployed to the Catalan Wikipedia in August 2020, but I'm not
| sure what progress there has been since.
| elcapitan wrote:
| How do other people use Wikidata dumps if they are not using the
| "official" (with sparql or so) way of querying it? I have done
| some pretty raw extraction from it (e.g. download the already
| pretty large zipped json dump, then unzip it on the fly and parse
| the json, and extract triples and entities). Not sure if that is
| really quite efficient, but the dumps are hard to work with, and
| I really just needed the entities in one language and the
| triples/graph of them.
| bob229 wrote:
| Wikipedia is garbage.
| JamesCoyne wrote:
| https://web.archive.org/web/20210822151254/http://simia.net/...
___________________________________________________________________
(page generated 2021-08-23 23:01 UTC)