...

https://billpg.com/data-mining-wikipedia/

Skip to content
 billpg industries(tm)

 billpg industries(tm)

Bill P. Godfrey

Menu

  * Home
  * Social Media
      + Facebook
      + Twitter
      + LinkedIn
      + YouTube
      + Flickr
  * Cranky Bill
  * 
      + The Sunset Lounge

Posted on July 6, 2021July 7, 2021 by billpg

Data-Mining Wikipedia for Fun and Profit

It all started after watching one too many videos narrating the
English monarchy, all starting from King William I in 1066 as if he's
the first king of England. This annoys me as it completely disregards
the handful of Anglo-Saxon kings of England who reigned before the
Normans.

They're Kings of England. If you're going to make a list of the Kings
of England, then you should include the Kings of England.

It was this that made me want to make a particular edit to both the
King Alfred and Queen Elizabeth pages on Wikipedia, acknowledging
each as related to the other. But what is their relationship and
through who?

I went to the page for Queen Elizabeth II and started following the
Mother/Father links until I found my way to King Alfred, mostly going
through the other kings of England. I counted 36 generations, but was
there a shorter or even longer route?

Sounds like a job for some software!

[AnotherBatchOfKlutz_byMakeshiftlove_cropped]Gateau Brule.

Scanning Wikipedia

We have the technology.

  * Visual Studio 2019 and C#.
  * RestSharp, a library for downloading HTML.
  * HtmlAgilityPack, a library for parsing and extracting data from
    HTML.

With these libraries downloaded from nuget, I was able to write some
very quick and dirty code that would download the HTML for the
Wikipedia page of Queen Elizabeth II, storing the HTML in a cache
folder to save re-downloading it again.

Once the HTML is downloaded (or read from the cache), HtmlAgilityPack
can be called upon for the task of pulling items of data from the
HTML. For example, the person's full name, which is always the page's
only <H1>...</H1> element, can be extracted using one line of code:

[KingAlfred]

string personName =
    html
    .DocumentNode
    .Descendants()
    .Where(h => h.Name == "h1")
    .Single()
    .InnerText;

I used HtmlAgilityPack and LINQ in a similar way to pull out the
Mother and Father for each person. The code would look for the
info-box <TABLE>, then look inside for a <TH> with the text "Mother"
or "Father". It would then take a few steps backwards to look for the
<TR> that the text is a part of and finally pull out all the links it
can find inside.

With the links to the Queen Elizabeth's mother and father, the code
would add those links to a queue and the top-level would pull the
next link and continue until the links runs out.

"She's just a girl who says that I am the one..."

I made the decision here to only include people with an info-box.
Extracting someone's parents from free English text was a step too
far. If you're not notable enough to have an info-box with your
parents listed, you're not notable enough for this project. (Although
I did find a couple of people who didn't have a suitable info-box
surprisingly early in the process. Rather than hack in an exception,
I edited Wikipedia to include those people's parents in their
info-box, copying the link from elsewhere in the text.)

While that got me out of a small hole, more annoying was when the
info-box listed "Parents" or "Parent(s)" instead of Mother and
Father. I wanted to track matrilineal and patrilineal lines, so it
was a little annoying to just have an individual's parents with no
clear indication of which one is which. I coded it so that if there's
only one one link, assume it is the father. If there's two links,
assume the father is the first one.

[SlashPatriarchy_by_gaelx]Because patriarchy.

"Also known as..."

Another issue was that some of the pages changed names. RestSharp
would dutifully follow HTTP redirects, but I'd end up storing a page
with one name but having a different name internally. This happened
right away as the page for Queen Elizabeth links to her mother as "
Elizabeth_Bowes-Lyon", but once you follow the link, you end up at "
Queen_Elizabeth_The_Queen_Mother".

The HTML included a <LINK> tag named the "canonical reference", so I
could pull that out and use it as the primary key in my data
structure. To keep the link between child and parent, it collects the
aliases when the are detected and a quick reconciliation loop
corrects the links after the initial loop completes.

[BananaMuffins_by_RichardLewis]King Alfred, also known as The Muffin
Man.

From Alfred to Elizabeth.

Once I had a complete set of Wikipedia pages cached, the next step
was to build a tree with all of the parental connections that lead
from King Alfred to Queen Elizabeth. I knew that some non-people had
crept in because someone's parents would be listed as "(name) of
(town)", but that didn't bother me as those towns wouldn't have a
mother or father listed and those loose ends would be discarded.

I wrote some code to walk the tree of connections. It started from
Queen Elizabeth and recursively walked to each of the mother and
father node. If a node ended on King Alfred, the complete chain would
be added to the list of nodes.

With this reduced set in place, I churned through the nodes and
generated a GraphViz file. For those who don't know about it, this an
app for producing graphs of connected bubbles. You tell it what
bubbles you want and how they are connected and it automatically lays
them out.

At this point, I was expecting a graph that would be mainly tall and
thin and it would appear right here in this article. While family
trees do grow exponentially, I wasn't including every single
relationship, only those that connect both of two individuals. If I
were graphing the relationships between myself an a distant ancestor,
I'd expect a single line, each parent handing over to their child.
There would be a few bulges when third-or-so cousins marry. There, an
individual's two children would split off into separate lines,
eventually reuniting with one ever-so-slightly inbred individual.

Yeah, that's not what I got. This is the SVG file GraphViz generated
for me. If you follow this link and are faced with a blank screen,
scroll right until you find the King Alfred node. Then zoom out.

Aristocrats...

(The bubbles are all clickable, by the way.)

Count the Generations.

The graph was interesting but this wasn't the primary objective of
this exercise. I wanted to write "He is the n-times great-father of
his current successor Queen Elizabeth." on King Alfred's Wikipedia
page.

But what's the n? I already had a collection of all the chains
between so I just had to loop through them to find the longest and
shortest chain. The longest chain has 45 links and the shortest chain
has 31 links.

King Alfred is a 42-times great-grandfather of Queen Elizabeth II.

(And also 28 times-great-grandfather. And everything in between.)

Here's the simplified graph showing only those lines with exactly 45
links.

[AlfredToElizabethExactly45]All the parental chains from Alfred to
Elizabeth that have exactly 45 links.

"Let's talk about sex."

Earlier, I mentioned being annoyed that some info-boxes listed two
parents instead of a mother and a father, requiring me to make
assumptions that fathers are more likely to be included and put
first, because these are aristocrats and society is quite
patriarchal.

I still wanted to data-mine into matrilineal lines, so to check on
those assumptions, I pulled out all of the people linked only in a
"Parents" line of the info-box and checked they were all in order.
The fathers all had manly names and the mothers all had womanly
names. Seemed fine. But just to be sure, I queried my data structure
for any individual that was listed as both a mother and a father,
expecting that to happen from two different children's pages.

There were several. Not only that, the contradicting links came from
the same page. Someone apparently had the same individual as both his
father and mother. Expecting to see the same person linked twice or a
similar variety of quirk, I was surprised to see what should have
been very a simple info-box to process.

[Duke_Charles_Louis_Frederick_of_Mecklen]Screen-shot of info-box for
Duke Charles Louis Frederick of Mecklenburg

This person has an info-box with two individuals, each unambiguously
listed as Father and Mother. Why was my code somehow interpreting the
mother as the same individual as the father?

Investigating, I discovered that not only was Adolphus listed as
someone's mother, his actual mother was skipped over entirely. My
data-structure simply didn't have an entry for her.

To try and work out what was going on, I added a conditional
breakpoint and looked as my code dutifully added her name to the
queue of work, as well as later on when it was taken off the queue.
The code downloaded her page as it disappeared into the parser. Yet
the response that came back was that she was already accounted for. I
beg to differ!

What I hadn't done was click on her link. She didn't have her own
page, only a redirect to her husband's page. Apparently, the only
notable thing she had done, according to history, was marry her
husband.

I later found a significant number of there links where a woman's
name is just a redirect to her husband. If the patriarchy isn't going
to allow me to rely on Mother/Father links as a sign of an
individual's parental role, investigating matrilineal lines will have
to wait.

[RiverSeine_by_IreneSteeves]"We call our show, The Aristocrats!"

Acknowledgements and Notes

If you'd like to do you're own analysis, I've saved the data I
extracted into a JSON file you can download. I make no promises about
its accuracy or completeness or indeed anything about the file. I've
even hidden the word "Rutabaga" in there, just to make it clear how
potentially inaccurate it is.

I showed a friend an earlier version of the chart and he wondered if
I could do it better in Python. Maybe, but equally maybe not. This
isn't the C# of the early 2000s we're dealing with. HtmlAgilityPack
and LINQ combined can do very clever queries to extract data from web
pages, often in single lines of code. Maybe there's a Python
component to do the same, I don't know.

Rather than install GraphViz myself, I found online GraphViz did the
job admirably and I'm grateful to them for hosting it. I'm also
grateful to my friend Richard Heathfield for telling me about it
several decades ago, back when I was thinking about building my own
version control system. (Ah, to be young.)

RestSharp is a very nice component for downloading web content for
processing. It flattens all the quirks of using the dot-net standard
library directly and wraps it all up in a simple and consistent
interface.

Oh, and here's that Wikipedia edit, in all its glory. It was reverted
around three minutes later by another editor but never mind.

Picture Credits:
 "Another batch of klutz" by "makeshiftlove".
 "King Arthur statue in Winchester " by "foundin_a_attic".
 "</patriarchy>" by "Gaelx".
 "Banana Muffins" by Richard Lewis.
 "River Seine" by Irene Steeves.

CategoriesData, Technology

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked 
*

        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
Comment [                                             ]

Name * [                              ]

Email * [                              ]

Website [                              ]

[ ] Save my name, email, and website in this browser for the next
time I comment.

[Post Comment] 

Post navigation

Previous PostPrevious POP3 - The ideas that didn't make it.
Search for: [                    ] Search
Recent Posts

  * Data-Mining Wikipedia for Fun and Profit
  * POP3 - The ideas that didn't make it.
  * Farewell, Hackensplat Industries!
  * POP3 - Delete Immediately
  * Write Your Own POP3 Service

Recent Comments

  * Unknown on Why I willingly bought a Windows Phone
  * Anonymous on Dear WordPress. Please stop using MySQL.
  * Anonymous on Dear WordPress. Please stop using MySQL.
  * Kevin B. Lyons on My Crazy Software Engineer Tattoo (that I
    didn't get)
  * Anonymous on GIT isn't perfect. (And other blasphemies.)

Archives

  * July 2021
  * April 2021
  * March 2021
  * December 2020
  * June 2019
  * April 2019
  * December 2018
  * January 2018
  * October 2017
  * July 2014
  * September 2013
  * June 2013
  * April 2012
  * June 2011
  * February 2011
  * January 2011
  * September 2010
  * May 2010
  * March 2010
  * February 2010
  * January 2010
  * December 2009
  * November 2009

Categories

  * Algorithm
  * C#
  * Childhood
  * Compiler
  * Cryptography
  * Data
  * Databases
  * Development
  * Economics
  * Education
  * Email
  * Export
  * Meta
  * Philosophy
  * Photography
  * PHP
  * Podcast
  * Politics
  * POP3
  * Pop3Extensions
  * Privacy
  * Protocol
  * Publishing
  * Puzzle
  * Satire
  * Security
  * Technology
  * Uncategorized
  * VersionControl

Meta

  * Log in
  * Entries feed
  * Comments feed
  * WordPress.org

  * Facebook
  * Twitter
  * YouTube
  * LinkedIn
  * Flickr

Proudly powered by WordPress