I started looking at Neo4j and thought: I need to write a simple but non-trivial application to really try it out. Something with lots of nodes and relationships. I need to find a large dataset that I can import into a graph database.
To my delight, I found that Wikipedia provides database dumps for download. That serves my purpose beautifully: I can represent each wiki page as a node, and the links between pages as relationships.
So I wrote some code to parse the Wikipedia XML dump and extract links from each article body, and some other code to import everything into a Neo4j store. It’s now on github, as project Graphipedia.
I ended up with a graph database containing 9,006,704 nodes (pages, titles only) and 82,537,500 relationships (links). The whole database takes up 3.8G on disc, of which 650M is a Lucene index (on page titles).
With this wealth of well-connected data at my disposal, I can now do some interesting stuff. For one, I can simply open the database with the Neoclipse tool, find a page by title and visualise all links to/from that page. Here’s an example with the Neo4j page at its centre.
Another fun thing is calculating the shortest path between two wiki pages - this requires only a few lines of code thanks to Neo4j’s graph algorithms. For example:
From Neo4j to Kevin Bacon:
Neo4j > Structured storage > NoSQL > InfiniteGraph > Kevin Bacon
From Kevin Bacon to Neo4j:
Kevin Bacon > Internet Movie Database > SQL > NoSQL > Neo4j
(I’m certainly not the first person applying the six degrees of separation hypothesis to Wikipedia: there even is a Six degress of Wikipedia page already.)
Where does all this leave me in my evaluation of Neo4j I’m not quite sure, as I haven’t done any benchmarks yet. But it certainly shows that graph databases allow you to do some interesting stuff that would be much harder to achieve with a relational database.
Awesome! Looking forward to hearing more about your “experiments!”
Patrick
Pingback: Creating a Neo4j graph of Wikipedia links « Another Word For It
Great Work…and thanks for sharing.
Interested in knowing the size of your hardware platform.
Nadeem.
Pingback: Importing Wikipedia into Neo4j with Graphipedia « Max De Marzi
@Nadeem my “hardware platform” is simply my laptop, although it does have an i7 processor and 8G RAM.
@Mirko…Thanks for your time. Mine is Core 2 duo on 4GB with 64bit OS. Unable to finish due to slow speed of my platform.
Is it possible to send a link of the graphdb you created from the english dataset?
I thank you for your time in advance.
Nadeem
The db is ~3.8G so unfortunately it’s difficult to make it available somewhere.
This is amazing. Thanks for sharing. Your code really helped me learn neo4j better. Please post more cool stuff on your blog!