The road so far….

September 5, 2010

Inserting and Indexing Bulk data in Neo4J

Filed under: java — Tags: — Rahul Sharma @ 10:23 pm

In my previous blog I tried to import some 151K names in Neo4J using Transactions, but the db gave me all sorts of exceptions while importing this much data. I was then advised to use BatchInserter api to import data in bulk as the Transaction interface keeps data in memory untill everything has been committed. On the other hand the BatchInserter directly writes back to store without any transaction mechanism.

Data import with BatchInserter

Taking the same use case of my previous post, Exploring Neo4J, I started importing 151K names using BatchInserter. The BatchInserter works quite differently as compared to  the GraphDataBaseService. The following  are the differences which you see when you start using the API :

  • Transactions are not used while importing data, thus the data gets inserted a whole lot quicker.
  • Nodes and Relationships can only be created using a Map of all the properties that are part of the Node/Relationship.
  • You do not get explicit Node object,, all updates to the Node can only be accomplished using the node-id and updated Map of properties.

Since while importing data you work with Map rather than the node object so you need to know all the keys which gets stored in the Node object while using the POJO. The following is my Person POJO in which I am storing values against the keys NAME and ID.

class Person implements Name{
 private final Node dataNode;
 static enum NameKeys {
    ID, NAME
 }
 ......
 @Override
 public int getId() {
 return (Integer) dataNode.getProperty(NameKeys.ID.name());
 }
 @Override
 public String getName() {
 return (String) dataNode.getProperty(NameKeys.NAME.name());
 }
 @Override
 public void setId(int id) {
 dataNode.setProperty(NameKeys.ID.name(), id);
 }
 @Override
 public void setName(String name) {
 dataNode.setProperty(NameKeys.NAME.name(), name);
 }
 ...
}

Importing data in bulk is quite straight forward

  • Create BatchInserterfor import
  • Start importing data in form of Maps
  • Close the BatchInserter

In the following piece of code I am importing Names from a file, while importing the data I am also creating relations of  a node with  subsequent 9 nodes. This method successfully imports the 151K names files I downloaded from at census site in around 7 secs.

public class BatchGraphImport {
 public void createGraph(String fileName, String dbdir) throws Exception {
   BatchInserter batchInserter = new BatchInserterImpl(dbdir);
   File file = new File(fileName);
   importNamesWithRelations(file, 10, batchInserter);
   batchInserter.shutdown();
 }

 void importNamesWithRelations(File file, long perNodeRelationShipCount, BatchInserter batchInserter) throws Exception {
   System.out.println("impoting data");
   Reader in = new FileReader(file);
   BufferedReader reader = new BufferedReader(in);
   String data;
   int id = 0;
   long lastUsedKey = 0;
   long relationshipNode = 0;
   while ((data = reader.readLine()) != null) {
      lastUsedKey = createNode(data, id++, batchInserter);
      if (lastUsedKey % perNodeRelationShipCount == 0) {
         relationshipNode = lastUsedKey;
     } else {
        batchInserter.createRelationship(relationshipNode, lastUsedKey,Relations.Knows, null);
   }}
   System.out.println(" count of data inserted :" + id + " last used key:"+ lastUsedKey);
 }

 private long createNode(String data, int id, BatchInserter batchInserter) {
   Map<String, Object> properties = new HashMap<String, Object>();
   properties.put(Person.NameKeys.ID.name(), (Integer) id);
   properties.put(Person.NameKeys.NAME.name(), data);
   long nodeKey = batchInserter.createNode(properties);
   return nodeKey;
 }}

Indexing Data

Now we have success fully imported the data but we would like to index it so that it can be searched.

Approach One

One way to index that data can be by using LuceneIndexService with the GraphDatabaseService provided by the  BatchInserter.

@Test
public void indexGraphUsingLuceneIndexService() {
  GraphDatabaseService databaseService = new BatchInserterImpl(batchdbdir).getGraphDbService();
  IndexService indexService = new LuceneIndexService(databaseService);
  Iterable allNodes = databaseService.getAllNodes();
  for (Node node : allNodes) {
   if (node.hasRelationship()) {
     String key = Person.NameKeys.NAME.name();
     Object property = node.getProperty(key, "Not-Found");
     indexService.index(node, key, property);
   }}
  indexService.shutdown();
  databaseService.shutdown();
 }

But this approach would fail as it gives back a ClassCastException. Even though LuceneIndexService takes GraphDataBaseService as an input parameter but it works only on EmbeddedGraphDatabase. I am having a BatchBrapghDatabase implementation  of GraphDatabasServcice which it tries to cast into EmbededGraphDatabase and this fails.

java.lang.ClassCastException: org.neo4j.kernel.impl.batchinsert.BatchGraphDatabaseImpl cannot be cast to org.neo4j.kernel.EmbeddedGraphDatabase
 at org.neo4j.index.lucene.LuceneIndexService.<init>(LuceneIndexService.java:99)
 at com.nosql.neo4j.BatchGraphTest.indexGraphThatWouldfail(BatchGraphTest.java:47)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.junit.internal.runners.TestMethod.invoke(TestMethod.java:59)
 at org.junit.internal.runners.MethodRoadie.runTestMethod(MethodRoadie.java:98)
 at org.junit.internal.runners.MethodRoadie$2.run(MethodRoadie.java:79)
..............
Approach Two

Another way I found for indexing data is by using LuceneIndexBatchInserter. I tried to index the data that I have already imported into the system using the BatchInserter along with LuceneIndexBatchInserter.

@Test
public void indexGraphWithLuceneIndexBatchInserter() {
 BatchInserter inserter = new BatchInserterImpl(batchdbdir);
 LuceneIndexBatchInserter indexBatchInserter = new LuceneIndexBatchInserterImpl(inserter);
 GraphDatabaseService databaseService = inserter.getGraphDbService();
 IndexService indexService = indexBatchInserter.getIndexService();
 Iterable allNodes = databaseService.getAllNodes();
 for (Node node : allNodes) {
    if (node.hasRelationship()) {
     String key = Person.NameKeys.NAME.name();
     Object property = node.getProperty(key, "Not-Found");
     indexService.index(node, key, property);
   }}
 indexService.shutdown();
 databaseService.shutdown();
 }

But to my disappointment this method of indexing also failed as the approach adopted here is not correct.  The BatchGraphDatabase does not support the getAllNodes method and gives back UnsupportedOperationException.

java.lang.UnsupportedOperationException: Batch inserter mode
 at org.neo4j.kernel.impl.batchinsert.BatchGraphDatabaseImpl.getAllNodes(BatchGraphDatabaseImpl.java:116)
 at com.nosql.neo4j.BatchGraphTest.indexGraphWithBatchInserted(BatchGraphTest.java:67)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.junit.internal.runners.TestMethod.invoke(TestMethod.java:59)
 at org.junit.internal.runners.MethodRoadie.runTestMethod(MethodRoadie.java:98)
...........

Approach Three

The third approach that I adopted for indexing was quite different from the previous ones. Rather indexing data after it has been inserted I tried to index the data while it is being inserted using the LuceneIndexBatchInserter. in order to do so I added the following LOC in the importNamesWithRelations function of  BatchGraphImport.

void importNamesWithRelations(File file, long perNodeRelationShipCount, BatchInserter batchInserter) throws Exception {
  LuceneIndexBatchInserter batchIndexer= new LuceneIndexBatchInserterImpl(batchInserter);
System.out.println("imporing data");
  .........
   if (lastUsedKey % perNodeRelationShipCount == 0) {
      relationshipNode = lastUsedKey;
      String nameKey = Person.NameKeys.NAME.name();
      System.out.println("adding :" + data + " to index");
      batchIndexer.index(lastUsedKey, nameKey, data); }
}
 .........
}

Using this approach I was able to successfully index data that can be searched for. We can search this indexed database using the IndexService as described in my previous post.

Conclusion

If the requirement is to import data in bulk we should use BatchInserter and related apis. This is comparatively quicker mechanism of data import when compared to import with transactions. Also if your db is holding large amounts of non- indexed data and the requirement is to index it, then there does not seem to be an easy way to accomplish this. One way you can index such a large amount of  data is to perform indexing while importing the data.

Advertisements

3 Comments »

  1. Hi Rahul!

    I’m glad your testing neo4j. I’ve been looking at LuceneIndexProvider, and as you point out in Approach #1, it casts the graph database object to a concrete class when it shouldn’t.

    Would it be possible to see the test ‘com.nosql.neo4j.BatchGraphTest.indexGraphWithBatchInserted’? With a good test case, this shouldn’t be too difficult to resolve.

    Thanks so much for helping us make neo4j even better!

    Andrés

    Comment by Andrés Taylor — September 9, 2010 @ 12:11 am

    • Hi Andres, I had earlier mailed you the failing test cases of the blog. I think the ClassCastException is a bug, can you tell me if the same would be fixed in some upcoming release of Neo4j ?

      Comment by Rahul Sharma — September 17, 2010 @ 9:25 am

  2. […] Devlearnings very useful BatchInserter tutorial […]

    Pingback by Neo4j Spatial – introduction and installation | Shaping Knowledge — August 14, 2014 @ 9:37 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: