Doc2Graph

Doc2Graph project, the neo4j-couchbase connector evolution, allows graph analysis over JSON data from any kind of sources.

Abstract

Our goal is to allow graph analysis over data originally stored into document-oriented database. To make it possible we need to use a real graph database, neo4j, where the document are “copied” and synchronized. We want to create a project with two tenets: JSON based and customization.

Doc2Graph project allows to import JSON data from every kind of sources because it’s neo4j centric. The entire transformation logic is embedded into a neo4j plugin and exposed through cypher procedures. We create also two connectors for the most popular document-oriented database: Couchbase and MongoDB. These connectors automatically synchronize the contents of document-oriented database with neo4j, avoiding manual data exports.

The way to transform a JSON into a graph depends on the kind of analyze you need, that means the resulting graph model can be customized. You can develop your own graph model rules and use them just editing the configuration. We think data model is so specific that we cannot give an absolute solution: our proposal is a well-structured model released with the default configuration.

Nowadays there is only the JSON import feature, but we are going to develop the reverse way: Graph2Doc.

JSON based tenet

JSON is the most popular serialization format used in document-oriented database and in most of web API. It describes a tree representation of data, where parent document contains child documents (sub-documents). This kind of structure, the tree, can always be converted in a graph because a tree is a special kind of graph.

json_concert_to_graph

Doc2Graph manages JSON through cypher, using two procedures: json.upsert and json.delete. We chose to implement the logic into neo4j plugin to be “source-free”, without external dependencies. You can analize JSON data simply stored in files or created at runtime, because you only need a neo4j client and compose the CALL to procedures.

CALL json.upsert('dockey','{ "type":"foo", "id": 1}')

Abstracting from JSON source means that you can provide documents from everywhere, but you must have a key for every JSON document. Document-oriented databases are, first of all, key-value databases, so there’s no problem. If your source doesn’t have keys, you must create it (a UUID for example) but be careful about the meaning of “key” because for the system two JSON strings are the same document only if they have the same key. Furthermore, you can delete a document just using its key.

CALL json.upsert('dockey','{ "type":"foo", "id": 1}'); //create a node
CALL json.upsert('dockey','{ "type":"foo", "id": 1, "other":"update"}'); //update the same node
CALL json.delete('dockey'); //remove the node

Customization tenet

A tree can be always converted into a graph, but the way to do this is not always the same. Nodes reusing is very important for graph analysis. Nodes are a plain document (set of propeties), with relationships to other nodes (sub-documents). But how to build relationships? How to recognise a node that already exists? What kind of graph model would you like to get? Everybody has different needs.

We made a default configuration that creates an all-purpose graph. But we know there are a lot of business models and a lot of their JSON representations. The way to find an existing node may not be, for example, the “type/id” combo, so you can customize that. We have created a set of java interfaces that you can implement in order to create your own behaviour. You can mix existing implementations with yours in order to change what you prefer to define your own rules.

How it works

There are two entry points: upsert and delete. Upsert means update or insert document. When a JSON is upserted to the system it is removed using its key and, after, converted into a graph. The removing algorthim contained in a customizable part; this part returns those nodes which are orphan and will be deleted. A document can be deleted only just using the key, that means the key needs to be stored into the resulting graph.

The inserting algorithm is more complex. The insert procedure can recive only one document. Every tree has a root document that becomes a root node. This node is the only particular node in the graph: it contains a special attribute that has the “key of document” as value.

Following the rules of JSON conversion:

  • each primitive type attribute becomes a node property;
  • each primitive type array attribute becomes a node property (a list of primitive type values);
  • each complex type attribute becomes a node with a parent relationship;
  • each complex type array attribute becomes nodes with parent relationships;
  • each document can be identified using the attribute values that is the properties values of node. The value(s) used to identify the node is call “id”;
  • each node has a label;
  • to seek an existing node, label and “id” are used;
  • document without the attributes that compose the “id” are discarded;
  • logic is recursive through the tree.
Transform rules
Transform rules

Default behaviour

The default graph model uses only one relation beetween two nodes and re-uses the node by “type/id” combo value. All nodes have the same label.
If no configuration is provided, this is the default behaviour:

  • document id: “type” (string) + “id” (number);
  • property to store the key in a root node: “_document_key”;
  • label for all nodes: “DocNode”;
  • there is only one relation between two nodes with label build from “type” of parent and child. e.g.: “album_artist”;
  • each relationship has a property called “docKeys”: an array of “keys of documents”. Each document that has this kind of relationship inside, adds the “key” to the array.

This configuration creates a graph model that emphasises the document structure, hiding the redundancy of the same relationship between two nodes. So you can find which part of the document is duplicated in the document storage (origin database) and if there are cross references among the documents.

Two document with the same children
Two documents with the same children

How to customize

Neo4j-json is released with a default configuration, but it can be changed by creating a special node. The default configuration is like this:

 CREATE (n:JSON_CONFIG {
 configuration: 'byNode'
 ,root_node_key_property:'_document_key'
 ,document_default_label:'DocNode'
 ,document_id_builder:'org.neo4j.helpers.json.document.impl.DocumentIdBuilderTypeId'
 ,document_relation_builder:'org.neo4j.helpers.json.document.impl.DocumentRelationBuilderTypeArrayKey'
 ,document_label_builder:'org.neo4j.helpers.json.document.impl.DocumentLabelBuilderConstant'
 })

If there isn’t a JSON_CONFIG node in database, default configuration is used, but no configuration node is created.

WARNING: when you change the configuration you must restart the neo4j server.

configuration

Configuration node is active only if contains “configuration: ‘byNode‘” otherwise the node is ignored.

root_node_key_property

When you insert a new document, a new root node is created. Upsert procedure sets a property with root_node_key_property as “key” and the “document key” as value.

document_default_label

All nodes are created with a label. This property is the default label value used by document_label_builder if no other label can be applied.

document_id_builder

It’s the name of class that implements org.neo4j.helpers.json.document.DocumentIdBuilder interface. It builds a org.neo4j.helpers.json.document.DocumentId that represents the primary key value of the sub-document (node). This “id” is used to seek node in database if already exists. So, it’s essential for node re-using.

You can choose from:

  • org.neo4j.helpers.json.document.impl.DocumentIdBuilderTypeId
  • org.neo4j.helpers.json.document.impl.DocumentIdBuilderId
  • your own implementation

WARNING: seeking node uses also the label so you have to pay attention to “label-id” combo.

document_relation_builder

It’s the name of class that implements org.neo4j.helpers.json.document.DocumentRelationBuilder interface. It manages the relationships, adding and removing them between nodes and deciding which nodes are orphans. Orphan nodes are deleted from the database.

You can choose from:

  • org.neo4j.helpers.json.document.impl.DocumentRelationBuilderTypeArrayKey
  • org.neo4j.helpers.json.document.impl.DocumentRelationBuilderByKey
  • your own implementation

document_label_builder

It’s the name of class that implements org.neo4j.helpers.json.document.DocumentLabelBuilder interface. It builds the label that is applied on a node. When it cannot build from sub-document data it uses the default value (document_default_label). Labels are used also to seek node that already exists in database, so you have to use it carefully.

You can choose from:

  • org.neo4j.helpers.json.document.impl.DocumentLabelBuilderConstant
  • org.neo4j.helpers.json.document.impl.DocumentLabelBuilderById
  • your own implementation

Setup

The core of Doc2Graph project is neo4j-json plugin. This plugin gives you the procedures to build the graph. At the moment you can only download the code from github and follow the README to build and install it.

There are two connectors for MongoDB and Couchbase that integrate these databases with neo4j. You can find them on GitHub and install from scratch following the README.

Cypher integration

It’s the basicaly way to use Doc2Graph. Installing the neo4j-json plugin you have two procedures available:

  • json.upsert({key},{json})
  • json.delete({key})
CALL json.upsert('dockey','{ "type":"foo", "id": 1, "other":"update"}') //update the same node
CALL json.delete('dockey'); //remove the node

Both the procedures don’t return value.

Couchbase connector

couchbase-neo4j-connector is a java standalone program that transfers mutation from Couchbase bucket to neo4j database. It’s a DCP client and uses the BOLT protocol to call the cypher procedures. It has been made for Couchbase 4.5 and works on a specific bucket. It catches “mutation” and “delete” events of documents. It has been tested to run on Java 8.

To configure the source and destination addresses you have to provide a configuration-properties-file.

DCP client schema
DCP client schema

MongoDB connector

mongodb-neo4j-connector is a doc_manager module for mongodb-connector  that calls neo4j-json procedures via cypher.
It’s writen in Python, version 3 recommended. It can be installed via “pip” or run in develop mode. See the README to install and run it.

mongodb doc_manager schema
mongodb doc_manager schema

Resources

You can contribute to project on GitHub.