It's been a long time since my last blog post. I wanted to write this blog for some time, but never got a chance to compete. Last year I was a part of a team worked on building a platform to perform large scale distributed graph processing: GoFFish. In this blog, I am trying to give a small overview about GoFFish and its programming model.

After the Google MapReduce paper and induction of Hadoop there was a search for simple programming models and analytics tools for big data processing. Graph structured data takes over a significant portion of large scale data we are seeing present day. I would say any big data problem worth looking at are graph related. :)

Map Reduce model only works well with data with minimal interdependencies. Where graph structured data occupies the complete opposite end of spectrum. Google pregel paper introduced a new simple programming model for graph processing, addressing this shortcomings of Hadoop. It's generally known as the vertex centric programming model in which programmer is forced to think as a vertex in the graph. The program logic is written thinking its inside a vertex and executed in parallel at each vertex.

The vertex centric programming model is a message passing based programming model. Each vertex can send messages to its neighbors or to any known vertex id. Each vertex will receive messages sent to its vertex it. The vertex centric programming model is an extension to Bulk Synchronous Parallel (BSP) parallel programming model. BSP can be thought as an abstract computer with multiple processors which does work in parallel. The execution of a program in done in iterations. In each iteration, each processor will

- Receive messages sent to it in the previous iteration
- Process the messages and execute the user logic
- Sent messages to other processors if needed

- Wait for all other processors to finish the iteration

In the BSP model these iterations are called supersteps. Even though the vertex centric programming model is very simple given the fact that, at vertex level the work we can do is so minimal the communication to computation ratio of this model is high. This makes it not suitable for some classes of algorithms like betweenness centrality.

In GoFFish we introduced a subgraph centric programming model instead of a vertex centric model and demonstrated that it can give huge performance improvement over vertex centric model for some class of graph algorithms.

I did a talk at Southern California Fall workshop 2013 at UCLA on this work and following are the slides

You can find a more detailed set of slides on subgraph centric single source shortest path below.

This work was accepted at EuroPar 2014 and More detailed technical report can be downloaded from here.

GoFFish is a free and open source project and can be downloaded from here.