Cranic Computing

High Performance Graph Algorithms

We were among the first to develop a parallel and distributed BFS traversal (first multi-GPUs graph traversal implementation). The first version of our BFS code ranked 19 on the November list of the graph 500 benchmark in 2011. Since then several optimizations have been added but the underlying method and the implementation remain simple, compared to other state-of-the-art implementations. We relay on a novel data mapping which allows for a perfect load balancing. Our latest version, along with the performance analysis is available here: Bfs2D. As far as we know this is still one of the fastest code, with a peak performance of 5.5 Terateps on 1024 GPUs and 203.1 GTEPS on a single GPU.

Since then, we are involved in the study and development of more complex graph algorithms like the ST-connectivity, critical node detection problem (CNDP) and the betweenness centrality. Our implementation of the BC (MGBC) supports 2-D and 3-D decomposition of the graph and multi-level parallelism. More information are available at: MGBC.

Available codes:

Bfs2D

MGBC

Multilevel Parallelism for the Exploration of Large-Scale Graphs

We present the most recent release of our parallel implementation of the BFS and BC algorithms for the study of large scale graphs. Although our reference platform is a high-end cluster of new generation Nvidia GPUs and some of our optimizations are CUDA specific, most of our ideas can be applied to other platforms offering multiple levels of parallelism. We exploit multi level parallel processing through a hybrid programming paradigm that combines highly tuned CUDA kernels, for the computations performed by each node, and explicit data exchange through the Message Passing Interface (MPI), for the communications among nodes. The results of the numerical experiments show that the performance of our code is comparable or better with respect to other state-of-the-art solutions. For the BFS, for instance, we reach a peak performance of 200 Giga Teps on a single GPU and 5.5 Terateps on 1024 Pascal GPUs. We release our source codes both for reproducing the results and for facilitating their usage as a building block for the implementation of other algorithms. (Paper)

Parallel Distributed Breadth First Search on the Kepler Architecture

We present the results obtained by using an evolution of our CUDA-based solution for the exploration, via a breadth first search, of large graphs. This latest version exploits at its best the features of the Kepler architecture and relies on a combination of techniques to reduce both the number of communications among the GPUs and the amount of exchanged data. The final result is a code that can visit more than 800 billion edges in a second by using a cluster equipped with 4,096 Tesla K20X GPUs. (Paper, arXiv)

Efficient breadth first search on multi-GPU systems

Simple algorithms for the execution of a Breadth First Search on large graphs lead, running on clusters of GPUs, to a situation of load unbalance among threads and un-coalesced memory accesses, resulting in pretty low performances. To obtain a significant improvement on a single GPU and to scale by using multiple GPUs, we resort to a suitable combination of operations to rearrange data before processing them. We propose a novel technique for mapping threads to data that achieves a perfect load balance by leveraging prefix-sum and binary search operations. To reduce the communication overhead, we perform a pruning operation on the set of edges that needs to be exchanged at each BFS level. The result is an algorithm that exploits at its best the parallelism available on a single GPU and minimizes communication among GPUs. We show that a cluster of GPUs can efficiently perform a distributed BFS on graphs with billions of nodes. (Paper)