Creating a Neo4j graph of Wikipedia links

I started looking at Neo4j and thought: I need to write a simple but non-trivial application to really try it out. Something with lots of nodes and relationships. I need to find a large dataset that I can import into a graph database.

To my delight, I found that Wikipedia provides database dumps for download. That serves my purpose beautifully: I can represent each wiki page as a node, and the links between pages as relationships.

So I wrote some code to parse the Wikipedia XML dump and extract links from each article body, and some other code to import everything into a Neo4j store. It’s now on github, as project Graphipedia.

I ended up with a graph database containing 9,006,704 nodes (pages, titles only) and 82,537,500 relationships (links). The whole database takes up 3.8G on disc, of which 650M is a Lucene index (on page titles).

With this wealth of well-connected data at my disposal, I can now do some interesting stuff. For one, I can simply open the database with the Neoclipse tool, find a page by title and visualise all links to/from that page. Here’s an example with the Neo4j page at its centre.

Another fun thing is calculating the shortest path between two wiki pages -¬† this requires only a few lines of code thanks to Neo4j’s graph algorithms. For example:

From Neo4j to Kevin Bacon:
Neo4j > Structured storage > NoSQL > InfiniteGraph > Kevin Bacon

From Kevin Bacon to Neo4j:
Kevin Bacon > Internet Movie Database > SQL > NoSQL > Neo4j

(I’m certainly not the first person applying the six degrees of separation¬†hypothesis to Wikipedia: there even is a Six degress of Wikipedia page already.)

Where does all this leave me in my evaluation of Neo4j I’m not quite sure, as I haven’t done any benchmarks yet. But it certainly shows that graph databases allow you to do some interesting stuff that would be much harder to achieve with a relational database.

16 thoughts on “Creating a Neo4j graph of Wikipedia links

  1. Pingback: Creating a Neo4j graph of Wikipedia links « Another Word For It

  2. Pingback: Importing Wikipedia into Neo4j with Graphipedia « Max De Marzi

    • @Mirko…Thanks for your time. Mine is Core 2 duo on 4GB with 64bit OS. Unable to finish due to slow speed of my platform.

      Is it possible to send a link of the graphdb you created from the english dataset?

      I thank you for your time in advance.
      Nadeem

  3. This is amazing. Thanks for sharing. Your code really helped me learn neo4j better. Please post more cool stuff on your blog!

    • How long depends heavily on your disk – in fact it’s an interesting I/O benchmark.

      Took me anywhere from 10-15 minutes with an SSD to “way too long” (killed the process after a few hours) with a 5,400 rpm disk.

  4. Hi – Interesting work. Could you tell us how fast does Neo4J calcalute the two examples?

    From Neo4j to Kevin Bacon
    and
    From Kevin Bacon to Neo4j:

    Also – Are category pages (and links from articles to categories) included in the graph?

    Thx.
    Rune

  5. Hi there,

    I recently decided to begin a similar project, only later finding your code. I have been unable to get anywhere near your speeds due to the massive number of links.

    You said that your version “contains almost 10M pages, resulting in over 92M links to be extracted.”
    However, I am finding an average of about 85 links per page. I know that it has been some time, and maybe wikipedia has changed, but are you sure that that number is accurate? a 9:1 link to article ratio seems very low…

  6. Hello,

    First of all, great work. Iy really helped me. I wonder if itis possible to construct Links in the form of LinksTo and LinksFrom. How can I do that?

    Thx, Deniz

  7. Pingback: Graphipedia, Context and knowledge | SoulFireMage's Code Ramblings

  8. I read a lot of interesting articles here. Probably you spend a
    lot of time writing, i know how to save you a lot of work,
    there is an online tool that creates unique, google friendly
    articles in seconds, just search in google – laranitas free content source

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>