This is the repository for the demo paper SparkSQL+: Next-generation Query Planning over Spark.
- Java 1.8
- Scala 2.12.10
- Maven 3.8.6
- Spark 3.0.1
- HDFS(Optional) 2.7.7
SparkSQL+ uses Maven multi-module for project organization and dependency management. It consists of the following submodules:
- sqlplus-core: contains the parser, planner, and code generator of SparkSQL+.
- sqlplus-lib: contains the library functions used by the Scala programs generated by SparkSQL+.
- sqlplus-web: contains the web-based interface.
- sqlplus-cli: contains the command line interface.
- sqlplus-example: contains the implementation of built-in example queries.
Run git clone [email protected]:hkustDB/SparkSQLPlus.git
.
- Local mode
- Use Local mode if the Spark Standalone Cluster is deploy on the same machine.
- The HDFS is not needed in Local mode.
- Remote mode
- Use Remote mode if the Spark Standalone Cluster is deploy on different machines.
- The jars will be uploaded to HDFS before execution.
- The input data should be in HDFS before execution.
- Local port forwarding is necessary when the remote Spark and HDFS clusters are not directly accessible. In this case, enable the
experiment.forwarding
(see below). - Manually enable local port forwarding for the following ports(e.g., through
SSH -L
):- 6066
- 7077
- 8080
- 8081
- 9000
- 50070
- 50075
- Run
cp src/main/resources/application.yml.template src/main/resources/application.yml
. - Edit
src/main/resources/application.yml
.
server:
port: 8848
logging:
config: classpath:log4j2-spring.xml
sqlplus:
home: /Users/sqlplus/Projects/SparkSQLPlus # the absolute path of SparkSQLPlus
experiment:
mode: local # local or remote
forwarding: false # whether using local port forwarding
spark:
master:
host: localhost
port: 7077
submission:
port: 6066
web-ui:
port: 8080
driver:
memory: 4g
cores: 1
executor:
memory: 4g
cores: 1
default:
parallelism: 1
timeout: 300 # timeout setting, in seconds
hdfs:
host: localhost
post: 50070
path: /Users/sqlplus # base path for the uploaded jars
user: sqlplus
data:
path: /Users/sqlplus/data # base path for the input data
result:
type: web-ui
- Edit
$SPARK_HOME/conf/spark-defaults.conf
and addspark.master.rest.enabled true
to the end. - Edit
$SPARK_HOME/conf/log4j.properties
and add the following configurations.
log4j.logger.SparkSQLPlusExperiment=INFO, SparkSQLPlus
log4j.appender.SparkSQLPlus=org.apache.log4j.ConsoleAppender
log4j.appender.SparkSQLPlus.target=System.out
log4j.appender.SparkSQLPlus.layout=org.apache.log4j.PatternLayout
log4j.appender.SparkSQLPlus.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
Run mvn clean package
.
Run bash examples/data/download.sh
to download a graph from SNAP.
It is also possible to use other input data as long as the columns are separated by commas.
The example queries are in examples/query
.
sparksql-plus compiles the input SQL file into SparkSQL+ code.
syntax: sparksql-plus [OPTIONS] <query>
options:
-d,--ddl <path> Set the path to the ddl file.
-h,--help Show the help message.
-n,--name <object name> Set the object name for the output object.
-o,--output <path> Set the path to the output file.
-p,--pkg <package name> Set the package name for the output object.
The following command generates SparkSQL+ code for examples/query/q1
.
./bin/sparksql-plus -d examples/query/q1/graph.ddl -o examples/query/q1/q1.scala examples/query/q1/query.sql
Run java -jar sqlplus-web/target/sparksql-plus-web-jar-with-dependencies.jar
.
Visit http://localhost:8848/
in the browser.
- Submit a query.
- Select a candidate.
- Persist the generated SparkSQL+ code.
Click the Experiment
tab on the top and submit the experiments. The result will be fetched and displayed automatically.