You are on page 1of 16

A GraphChi Cluster

Major Technical Project [Aug14 May15]

Fatehjeet Sra
CSE, IIT Mandi

TEAM:
Khushpreet Singh Ritish Rana
CSE, IIT Mandi

CSE, IIT Mandi

Aim
build a prototype of a cost-effective cluster of distributed GraphChi nodes that
can perform large-scale graph computations with performance and latency
bounds.

Objectives
configure Neo4j ( also disk-based ) on a machine and run a multi-hop query with
a large data-set.
same query would be run on a configured GraphChi node.
Performance Comparison of Neo4j and GraphChi.
develop a cluster of GraphChi nodes and port it on small factor nodes such as
Raspberry Pis (or ARM processor chips). (Raspberry Pi is a low cost, credit-card
sized computer)
Finally, test the performance of our cluster by running the same multi-hop query
and come out with best possible results.

Configuring Neo4j
Download the latest release fromhttp://neo4j.com/download
Linux Service:

sudo
sudo
sudo
sudo

./bin/neo4j-installer install
service neo4j-service status
service neo4j-service start
service neo4j-service stop

Server Start:
./bin/neo4j console

Neo4j Shell:
./bin/neo4j-shell readonly path/to/neo4j-db

Configuring GraphChi
Headers-only (no installation required)
Makefile was run using make apps
Compiled executables in bin/apps/

Graph Datasets

1. Twitter Dataset (Directed) : Nodes (81,306) Edges (1,768,149)


2. California Road NW (Undirected) : Nodes (1,965,206) Edges (5,533,214)
3. Friendster Communities (Undirected) : Nodes (65,608,366) Edges(1,806,067,135)

.Reference: http://snap.stanford.edu/data/

Format Conversion
GraphChi reads graphs in
EdgeListFormat :
<src> <dst> <value>
AdjListFormat: <src> <listcount> <d1> <d2> <d3>

Parsers were written in Python for conversion.


----------------------- Neo4j reads graph from csv file in format
specifier 1, specifier 2 .

Cypher query for adding data:


LOAD CSV WITH HEADERS FROM file:/path/to/file.csv' AS line CREATE (:Node
{specifier 1 : line. specifier 1, specifier 1 : line. specifier 1})

Execution on GraphChi
Build and Run
bin/apps/pagerank file GRAPH-NAME
If the graph has not been preprocessed, the program will ask for the
format of the graph (edgelist or adjlist)

Algorithms Run
PageRank
application prints the ids of the top 20 vertices with highest pagerank.

Connected Components
app produces output GRAPHNAME_components.txt, which on each line has
<Component ID>, <No_of_Vertices>

Execution on Neo4j
Interaction with database is done using CYPHER which can be
used from the Neo4j Shell or browser based platform.

Sample Query:
MATCH (n) RETURN (n) LIMIT 500 ;

Performance Comparison
Data: Twitter Social NW [81,306 ,
1,768,149] <Directed>

GraphChi

Query: Pagerank

Format: EdgeList

Neo4j

Performance Comparison
Data: Twitter Social NW [81,306 ,
1,768,149] <Directed>

GraphChi

Query: Connected
Components

Query:

Format: EdgeList

Neo4j

MATCH (n) WITH COLLECT(n) as nodes


RETURN REDUCE(graphs = [], n in nodes | case
when
ANY (g in graphs WHERE shortestPath((n)[*]-(head(g)))
then graphs else graphs + [[p
in (n)-[*0..]-() | nodes(p)[length(p)-1]]] end ))

Performance Comparison
Data: CA Road NW [1,965,206 , 5,533,214]
<UnDirected>

GraphChi

Query: Connected Components


Format: AdjList

Neo4j

Observations (small dataset on


GraphChi)
Dataset: YouTube Subscribers (AdjListFormat)
No. of Communities: 8385

Pretty Fast!

Observations (large dataset on


GraphChi)

Dataset: Friendster Communities(AdjListFormat)


No. of Communities: 957,154

Fast Enough!

Observations (very large dataset on


GraphChi)
Nodes: 65,608,366
Dataset: Friendster (EdgeListFormat)
1,806,067,135

Crashed!

Edges:

Inference
GraphChi has a slight edge over Neo4j in terms of performance.
Neo4j GUI cant handle large number of vertices (>1000)
Computation inefficient for relatively large graphs.
Cypher queries relatively slower.

Connected Components Query suitable for AdjListFormat Graphs.


Also, Directed graphs show significantly larger execution times compared
to undirected graphs.

GraphChi performed exceedingly well for small & large graphs.


Can go ahead for cluster prototype.

GraphChi crashed for very large graph due to inadequate system


resources.

Future Plan
Even Semester ( Feb-May 2015)
Develop a cluster of networked GraphChi nodes.
Port the cluster on Rasberry Pis.
Run the same query on the cluster.
Analyze and Report Results.

You might also like