Vertex centric programming model getting popular for large scale graph processing due to simplicity of the programming model and public availability of software frameworks like Apache Giraph. But the vertex centric programming model has its own down sides mainly due to the slow convergence of vertex centric algorithms and high communication overhead.
Vertex centric programming model forces programmer to think like a vertex when developing algorithms. Logic is executed at each vertex in the graph in parallel and vertices communicate with each other by using edges and vertex IDs.
Subgraph centric programming model can be thought as an extension of the vertex centric programming model where the programming logic is executed at subgraph level. In this model the graph is initially partitioned and distributed across multiple computing nodes and subgraphs (or weakly connected components) were identified. Computation within the subgraph takes place in memory where subgraphs can communicate with each other using messages.
This document walks you through the basic steps to run a simple subgraph centric algorithm to get total number of vertices in the graph on your local machine.
A pre installed virtual machine is used in this guide and can be easily installed on your local machine. Following installation instructions assume a Linux operating system environment.
Download the pre installed virtual machine from here
Install Oracle Virtual Box 4.3.8. You can download it from here.
Install and configure Vagrant:
sudo apt-get install vagrant
Add Vagrant plugin that keeps Virtual Box Guest Additions in sync:
vagrant plugin install vagrant-vbguest
Extract the downloaded virtual machine
Go to the virtual machine directory and start the environment:
vagrant up
After it boots, log in to the VM:
vagrant ssh
Other useful vagrant commands:
vagrant suspend
Saves the current running state of the machine and stops it. vagrant up will resume.
vagrant halt
gracefully shuts down the VM.
vagrant up
will boot it back up.
vagrant destroy
destroys the VM (and all the cruft inside of it). Running vagrant up
at this point will reprovision and run the deploy scripts again.
More info at http://www.vagrantup.com/
Log in to the virtual machine:
vagrant ssh
Change the password of root user and changed to root user:
$sudo passwd root
$su root
Start the namenode. A namenode is an interface that tracks all the metadata for a particular GoFS installation. At the moment GoFS ships with a REST server based namenode, which can be run through the GoFSNameNode script. This script accepts a URI argument and attempts to start a name node REST server at this URI. Optionally a file may also be specified which is used to save name node state. It is recommended you always specify a save file or name node state may be lost if the process is killed. Once a name node exists it is referenced through a Java class type and a URI. The default type is edu.usc.goffish.gofs.namenode.RemoteNameNode which can communicate with a remote REST server based name node.
cd $GOFFISH_HOME/gofs-2.0/bin
./GoFSNameNode http://localhost:9998
Login to vm from another terminal:
$vagrant ssh
Change to root:
$su root
Create directory named gofs-data in $GOFFISH_HOME. This will be used to store the graph data in binary format:
$mkdir $GOFFISH_HOME/gofs-data
Change in to GoFS binary directory and and format the graph file system:
$cd $GOFFISH_HOME/gofs-2.0/bin
$./GoFSFormat
A sample graph which can be used is located at /home/vagrant/deployment/samples/gofs-samples directory. To deploy this graph in the GoFS file system we need to create a list of graph and graph instances. In this example we will be only using graph template which is equivalent to a static graph.
File named list.xml in the /home/vagrant/deployment/samples/gofs-samples directory lists the graph data locations
Deploy the graph. This will partition the graph convert partitions into binary format and copy the files into graph storage directory. In this case we will only have a single graph partition:
$cd $GOFFISH_HOME/gofs-2.0/bin
$./GoFSDeployGraph edu.usc.goffish.gofs.namenode.RemoteNameNode http://localhost:9998 "graph1" 1 /home/vagrant/deployment/samples/gofs-samples/list.xml
Now the sample graph is deployed in the GoFS file system.
To start and run Gopher go to Gopher client directory and deploy the vertex count application:
$cd /home/vagrant/deployment/gopher-client-2.0/bin
$./setup-gopher.sh
This will install the user applications and setup computation nodes.
Now everything is setup and ready to run gopher applications. You can run gopher applications using the GopherClient command:
$./GopherClient.sh /home/vagrant/deployment/gopher-client-2.0/gopher-config.xml /home/vagrant/deployment/goffish_home/gofs-2.0/conf/gofs.config graph1 vert-count-2.0.jar edu.usc.pgroup.goffish.gopher.sample.VertCounter NILL
Executing this algorithm will output the vertex count of the input graph. You can find the results in $GOFFISH_HOME/gopher-server-2.0/vert-count.txt
Note: If executing ./GopherClient will not return to console. You need to forcefully return to console by pressing ctrl + c. (This is a known bug)
Mode detailed informations regarding general deployment of GoFS and Gopher is available in GitHub.
Publications:GoFFish: A Sub-Graph Centric Framework for Large-Scale Graph Analytics, Yogesh Simmhan, Alok Kumbhare, Charith Wickramaarachchi, Soonil Nagarkar, Santosh Ravi, Cauligi Raghavendra and Viktor Prasanna, European Conference on Parallel Processing (Euro-Par) , 2014