In this project, I implemented a data pipeline to analyze a stream of Venmo payments. I was particularly interested in studying what types of transactions were most responsible for forming 'cliques'. The cliqueness, or clustering, of the Venmo social network can be quantified by calculating its transitivity. Transitivity is a measure for quantifying the level of clustering in graphs, and is calculated by counting the ratio of triangles (cycles of length 3) to wedges (paths of length 2).

Determining exact transitivity is an O(~n³) calculation, and is impractical for large graphs. Furthermore, measuring transitivity from a stream of edges introduces another level of complexity to the problem.

My solution for this problem was to implement a one-pass streaming transitivity approximation algorithm outlined by Jha et al. 2013. The algorithm works by maintaining a random sample of wedges from the stream, and returning the fraction of the wedges that are "closed" by each incoming edge to form a triangle. To verify performance of the approximation algorithm, I compared its results with the exact transitivity calculated on a smaller sample of Venmo transactions. Finally, I also calculated some global statistics on counts and transitivity of transaction types across the whole Venmo community.

Initially, my implementation of the streaming transitivity algorithm returned extremely poor approximations-- up to almost 20% difference from the actual transitivity. This is because the algorithm as described by Jha et al. 2013 does not account for repeating edges, which is a frequent occurrence with Venmo transactions (multiple food-related transactions may happen between the same two users). The circumvent this, I used a Scalable Bloom Filter to track and ignore repeated edges, which yielded much better results.

Pipeline

Historical Venmo data is stored on Amazon S3, which I used for batch processes as well as simulated streams.

For the batch part of the pipeline, I used Spark to process the data into colored edges, which were then stored on Redis. On the streaming side, I emulated a stream by from S3 into Kafka, then processed the transactions with my Kafka Consumer. The results were also stored on Redis. My Flask application then queried the Redis database for the relevant data.

Repository Structure

./src/ contains all relevant files for replicating the batch and streaming portions of the project
./webapps/ contains the relevant files for the Flask web application
./sample-data contains some sample Venmo transaction data
./config contains configurations for setting up AWS clusters using Pegasus
./dev-scripts contains other in-progress scripts I wrote during the project, but aren't part of the final project (messing around with different databases, trying out Spark Streaming, etc.); potentially handy for future use

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
config		config
dev-scripts		dev-scripts
docs		docs
sample-data		sample-data
src		src
webapp		webapp
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Overview

Pipeline

Repository Structure

About

Releases

Packages

Languages

anuvedverma/venmo-graph-analytics

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Overview

Pipeline

Repository Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages