Spark+scala implementation of neighborhood sampling and feature learning for large graphs:
This repository includes the implementation of node2vec.
- Scalable node2vec
- Second-order random walk
- Memory-efficient graph data structure introduced @ Spark Summit 2017 "Random Walks on Large Scale Graphs with Apache Spark" presented by Min Shen (LinkedIn)
- Leverages graph partition information in order to optimize the communication and to speed up the random walk computation
- Compatible with Spark-JobServer
- Scala 2.11 or later.
- Maven 3+
- Java 8+
- Apache Spark 2.2.0 or later.
- (Optional): Spark-JobServer 0.8.0 or later.
-
Get the random walk application source code:
- using git: ' git clone [email protected]:data61/stellar-random-walk.git '
- using http: download the zip file and unzip it.
-
Go to the source code directory. A pre-built jar file, named randomwalk-0.0.1-SNAPSHOT.jar, is available at ./target. To run the application, you use this jar file. (If you want to build the jar file from the source code, you need to have Apache Maven installed and run:
mvn clean package
) -
Download Apache Spark 2.2.0 or later (e.g release 2.2.1, pre-built for apache hadoop 2.7)
To run the application on your machine, you can use spark-submit script. Go to the Apache Spark directory. Run the application with the following command:
bin/spark-submit --class au.csiro.data61.randomwalk.Main [random walk dir]/target/randomwalk-0.0.1-SNAPSHOT.jar
-
make sure that the prerequisites are installed:
- Apache spark (e.g release 2.2.1, pre-built for apache hadoop 2.7)
- Java Virtual Machine (e.g. 9.0.1)
- sbt
-
git clone job-server
-
Create a
.bashrc
with the paths to JVM, sbt and spark, e.g., for Mac OS users it will be the following:export SBT=/usr/local/Cellar/sbt/1.1.0 export SPARK_HOME=~/spark-2.2.1-bin-hadoop2.7 export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-9.0.1.jdk export PATH=$JAVA_HOME/bin:$SBT/bin:$PATH
-
Run
source .bashrc
-
Go to spark folder and run
start-all.sh
:
cd spark-2.2.1-bin-hadoop2.7/
sbin/sbin/start-all.sh
- From spark Job-server run sbt shell:
cd spark-jobserver/job-server/
sbt
-
Once inside sbt shell, run
reStart
. -
Go to
http://localhost:8090/
and make sure that Spark Job Server UI is working (Note: in Chrome binaries were not updated properly, while in Firefox it was ok) -
upload randomwalk jar to the server:
curl --data-binary @randomwalk/target/randomwalk-0.0.1-SNAPSHOT.jar localhost:8090/jars/randomwalk
-
submit a job:
curl -d "rw.input = --cmd randomwalk --numWalks 1 --p 1 --q 1 --walkLength 10 --rddPartitions 10 --directed false --input [random walk dir]/src/test/resources/karate.txt --output [output dir] --partitioned false" 'localhost:8090/jobs?appName=randomwalk&classPath=au.csiro.data61.randomwalk.Main'
-
Check the status in the Spark Job-server UI
The following options are available:
--walkLength <value> walkLength: 80
--numWalks <value> numWalks: 10
--p <value> return parameter p: 1.0
--q <value> in-out parameter q: 1.0
--rddPartitions <value> Number of RDD partitions in running Random Walk and Word2vec: 200
--weighted <value> weighted: true
--directed <value> directed: false
--w2vPartitions <value> Number of partitions in word2vec: 10
--input <value> Input edge-file/paths-file: empty
--output <value> Output path: empty
--singleOutput <value> Whether to write the output in a single file: true
--cmd <value> command: randomwalk/embedding/node2vec (to run randomwalk + embedding)
--partitioned <value> Whether the graph is partitioned: false
--lr <value> Learning rate in word2vec: 0.025
--iter <value> Number of iterations in word2vec: 10
--dim <value> Number of dimensions in word2vec: 128
--window <value> Window size in word2vec: 10
For example:
bin/spark-submit --class au.csiro.data61.randomwalk.Main ./randomwalk/target/randomwalk-0.0.1-SNAPSHOT.jar \
--cmd randomwalk --numWalks 1 --p 1 --q 1 --walkLength 10 --rddPartitions 10 \
--input [input edge list] --output [output directory] --partitioned false
You can choose the algorithm to run by the --cmd option:
- randomwalk: to run just the second-order randomwalk.
- embedding: to run just word2vec given paths as input.
- node2vec: to run randomwalk + embedding
The input graph must be an edge list with integer vertex IDs. For example:
src1-id dst1-id
src1-id dst2-id
...
If the graph is weighted, it must include the weight in the last column for each edge. For example:
src1-id dst1-id 1.0
If the graph is partitioned, each edge should have a partition number, i.e., should be assigned to a partition. The partition number must be in the third column of the edge list. For example:
src1-id dst1-id 1 1.0
src1-id dst2-id 1 1.0
src3-id dst1-id 2 1.0
...
The application itself will replicate (cut) those vertices that span among multiple partitions.
The application writes output to the disk in the directory given by the --output parameter. In this directory, three folders may be created:
- /path: the result for randomwalk.
- /vec: vector representation of vertices.
- /bin: word2vec model's metadata.
The output of the randomwalk (in the /path directory) is in the format of tab-separated integers (vertex-ids), where each line represents random steps starting from a vertex. The number of lines are equal to the number of generated paths (--numWalks*|V|). The result is partitioned in --rddPartitions number of files as plain text unless you set --singleOutput to true.
The embeddings are also in the the format of tab-separated numbers per line, where the first number represents the vertex-id and the rest of the numbers in that line represent the vertex's vector representation. The result is partitioned in --rddPartition number of files and is written as plain text unless you set --singleOutput to true.
An example of randomwalk output:
4 2 4 2 4
32 26 32 33 16
20 2 31 2 31
...
An example of vecotor representations:
34 -0.055403516 -0.007816158 0.053844377 -0.038890388 0.13484207 -0.07744342
12 -0.025323598 0.054739047 -0.077121474 -0.016575111 -0.041371193 -0.046176624
8 -0.07687951 0.120588094 -0.11744399 -0.058044944 -0.061426725 -0.13111477
...
- (Grover, Aditya, and Jure Leskovec. "node2vec: Scalable feature learning for networks." Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016.).
- Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
- Random Walks on Large Scale Graphs with Apache Spark