A Graphchi Cluster: Major Technical Project (Aug'14 - May'15)

A GraphChi Cluster
Major Technical Project [Aug14 May15]
Fatehjeet Sra
CSE, IIT Mandi
TEAM:
Khushpreet Singh Ritish Rana
CSE, IIT Mandi
CSE, IIT Mandi
Aim
build a prototype of a cost-effective cluster of distributed GraphChi nodes that
can perform large-scale graph computations with performance and latency
bounds.
Objectives
configure Neo4j ( also disk-based ) on a machine and run a multi-hop query with
a large data-set.
same query would be run on a configured GraphChi node.
Performance Comparison of Neo4j and GraphChi.
develop a cluster of GraphChi nodes and port it on small factor nodes such as
Raspberry Pis (or ARM processor chips). (Raspberry Pi is a low cost, credit-card
sized computer)
Finally, test the performance of our cluster by running the same multi-hop query
and come out with best possible results.
Configuring Neo4j
Download the latest release fromhttp://neo4j.com/download
Linux Service:
sudo
sudo
sudo
sudo
./bin/neo4j-installer install
service neo4j-service status
service neo4j-service start
service neo4j-service stop
Server Start:
./bin/neo4j console
Neo4j Shell:
./bin/neo4j-shell readonly path/to/neo4j-db
Configuring GraphChi
Headers-only (no installation required)
Makefile was run using make apps
Compiled executables in bin/apps/
Graph Datasets
1. Twitter Dataset (Directed) : Nodes (81,306) Edges (1,768,149)

2. California Road NW (Undirected) : Nodes (1,965,206) Edges (5,533,214)
3. Friendster Communities (Undirected) : Nodes (65,608,366) Edges(1,806,067,135)
.Reference: http://snap.stanford.edu/data/
Format Conversion
GraphChi reads graphs in
EdgeListFormat :
<src> <dst> <value>
AdjListFormat: <src> <listcount> <d1> <d2> <d3>
Parsers were written in Python for conversion.

----------------------- Neo4j reads graph from csv file in format
specifier 1, specifier 2 .
Cypher query for adding data:

LOAD CSV WITH HEADERS FROM file:/path/to/file.csv' AS line CREATE (:Node
{specifier 1 : line. specifier 1, specifier 1 : line. specifier 1})
Execution on GraphChi
Build and Run
bin/apps/pagerank file GRAPH-NAME
If the graph has not been preprocessed, the program will ask for the
format of the graph (edgelist or adjlist)
Algorithms Run
PageRank
application prints the ids of the top 20 vertices with highest pagerank.
Connected Components
app produces output GRAPHNAME_components.txt, which on each line has
<Component ID>, <No_of_Vertices>
Execution on Neo4j
Interaction with database is done using CYPHER which can be
used from the Neo4j Shell or browser based platform.
Sample Query:
MATCH (n) RETURN (n) LIMIT 500 ;
Performance Comparison
Data: Twitter Social NW [81,306 ,
1,768,149] <Directed>
GraphChi
Query: Pagerank
Format: EdgeList
Neo4j
Data: Twitter Social NW [81,306 ,
1,768,149] <Directed>
GraphChi
Query: Connected
Components
Query:
Format: EdgeList
Neo4j
MATCH (n) WITH COLLECT(n) as nodes

RETURN REDUCE(graphs = [], n in nodes | case
when
ANY (g in graphs WHERE shortestPath((n)[*]-(head(g)))
then graphs else graphs + [[p
in (n)-[*0..]-() | nodes(p)[length(p)-1]]] end ))
Data: CA Road NW [1,965,206 , 5,533,214]
<UnDirected>
GraphChi
Query: Connected Components

Format: AdjList
Neo4j
Observations (small dataset on

GraphChi)
Dataset: YouTube Subscribers (AdjListFormat)
No. of Communities: 8385
Pretty Fast!
Observations (large dataset on

GraphChi)
Dataset: Friendster Communities(AdjListFormat)

No. of Communities: 957,154
Fast Enough!
Observations (very large dataset on

GraphChi)
Nodes: 65,608,366
Dataset: Friendster (EdgeListFormat)
1,806,067,135
Crashed!
Edges:
Inference
GraphChi has a slight edge over Neo4j in terms of performance.
Neo4j GUI cant handle large number of vertices (>1000)
Computation inefficient for relatively large graphs.
Cypher queries relatively slower.
Connected Components Query suitable for AdjListFormat Graphs.

Also, Directed graphs show significantly larger execution times compared
to undirected graphs.
GraphChi performed exceedingly well for small & large graphs.

Can go ahead for cluster prototype.
GraphChi crashed for very large graph due to inadequate system

resources.
Future Plan
Even Semester ( Feb-May 2015)
Develop a cluster of networked GraphChi nodes.
Port the cluster on Rasberry Pis.
Run the same query on the cluster.
Analyze and Report Results.

A Graphchi Cluster: Major Technical Project (Aug'14 - May'15)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Graphchi Cluster: Major Technical Project (Aug'14 - May'15)

Uploaded by

Copyright:

Available Formats

A GraphChi Cluster

Major Technical Project [Aug14 May15]

CSE, IIT Mandi

1. Twitter Dataset (Directed) : Nodes (81,306) Edges (1,768,149)

Parsers were written in Python for conversion.

Cypher query for adding data:

MATCH (n) WITH COLLECT(n) as nodes

Query: Connected Components

Observations (small dataset on

Observations (large dataset on

Dataset: Friendster Communities(AdjListFormat)

Observations (very large dataset on

Connected Components Query suitable for AdjListFormat Graphs.

GraphChi performed exceedingly well for small & large graphs.

GraphChi crashed for very large graph due to inadequate system

You might also like