What is PrIter?

The PrIter project started at University of Massachusetts Amherst from 2011. PrIter is a modified version of Hadoop MapReduce framework that supports prioritized iterative computation, which support a large collection of iterative algorithms, including pagerank and shortest path. PrIter runs on a cluster of commodity PCs or in Amazon EC2 cloud. It ensures faster convergence of iterative process by reorganizing the update order of data items. Priter also supports online queries and generates top-k result snapshot every period of time. For details, please read our paper accepted in SOCC 2011.

Currently, PrIter is just a prototype. It is better for understanding the priority idea than for production usage. Any feedback is appreciated and we welcome your involvement in this project. If you have any question, please contact me (yanfengzhang@ecs.umass.edu).

Recently, the file-based PrIter is implemented, which maintains data in files instead of in memory. This ensures PrIter to scale much larger data sets. Memory-based PrIter is for better performance, while file-based PrIter is for better scalability. But notice that, we have tested that the file-based PrIter is only 2 times slower than the memory-based PrIter for Pagerank computation.

Ongoing Works


Getting Started

PrIter is implemented based on Hadoop 0.19.2 and HOP. The Hadoop jobs can also run in PrIter with the same code.

  1. Download hadoop-priter-0.1.tar.gz.
  2. Unpack this tar and deploy PrIter cluster using one or more commodity PCs. The deployment of PrIter in a distributed environment is the same as Hadoop's deployment. You can refer to Hadoop Quick Start instructions, if you've never used Hadoop. There are a few notes:
  3. Go to {priter_path}/apps directory, the shell scripts of four sample applications are provided. Here we take the pagerank script as an example:

The pagerank code can be found in {priter_path}/src/examples. It first distributes the graph data to workers by an MapReduce job and then performs pagerank by prioritized iteration. Every period of time snapshot_interval, PrIter outputs a top-k snapshot on HDFS (HDFS_output_path). The computation terminates when the difference between two iteration progresses measured by two consecutive snapshot time points are smaller thantermination_difference. Users can set queue_size to control the prioritization degree, which is a balance between priority benefit and priority overhead. * Run other applications: More applications scripts can be found in {priter_path}/apps, including shortest path, adsorption, katz metric. Their source codes are included in the example source dir.

Something that should be noticed when running PrIter. * It is normal that the job progress is stuck at 0%. Because the iterations are within a single job, we perform the termination check by measuring the distance between two snapshot results (terminate when the distance is smaller than a threshold, the number of iterations is unknown), it is difficult to estimate the iteration progress. So we skip the progress monitor implementation. You can see the progress by checking the generated snapshots on HDFS or tracking the task logs (or do something in the iterate() interface). * The setting of the snapshot interval parameter (set by -i) is important. PrIter performs termination check along with generating the snapshot, so the snapshot interval is the same as the termination check interval. Suppose we measure the distance between two consecutive snapshots, snapshot1 and snapshot2. If |snapshot1-snapshot2|

The following pages provide further implementation details that help you know how to program in PrIter:

An brief introduction of PrIter

PrIter API

PrIter system parameters

PageRank code