SparkSQL+

This is the repository for the demo paper SparkSQL+: Next-generation Query Planning over Spark.

Prerequisites

Java 1.8
Scala 2.12.10
Maven 3.8.6
Spark 3.0.1
HDFS(Optional) 2.7.7

Project Structure:

SparkSQL+ uses Maven multi-module for project organization and dependency management. It consists of the following submodules:

sqlplus-core: contains the parser, planner, and code generator of SparkSQL+.
sqlplus-lib: contains the library functions used by the Scala programs generated by SparkSQL+.
sqlplus-web: contains the web-based interface.
sqlplus-cli: contains the command line interface.
sqlplus-example: contains the implementation of built-in example queries.

Setup

Clone

Run git clone git@github.com:hkustDB/SparkSQLPlus.git.

Configurations

SparkSQL+ modes

Local mode
- Use Local mode if the Spark Standalone Cluster is deploy on the same machine.
- The HDFS is not needed in Local mode.
Remote mode
- Use Remote mode if the Spark Standalone Cluster is deploy on different machines.
- The jars will be uploaded to HDFS before execution.
- The input data should be in HDFS before execution.

Local forwarding

Local port forwarding is necessary when the remote Spark and HDFS clusters are not directly accessible. In this case, enable the experiment.forwarding (see below).
Manually enable local port forwarding for the following ports(e.g., through SSH -L):
- 6066
- 7077
- 8080
- 8081
- 9000
- 50070
- 50075

SparkSQL+ configurations

Run cp src/main/resources/application.yml.template src/main/resources/application.yml.
Edit src/main/resources/application.yml.

server:
  port: 8848

logging:
  config: classpath:log4j2-spring.xml

sqlplus:
  home: /Users/sqlplus/Projects/SparkSQLPlus    # the absolute path of SparkSQLPlus

experiment:
  mode: local                                   # local or remote
  forwarding: false                             # whether using local port forwarding
  spark:
    master:
      host: localhost
      port: 7077
      submission:
        port: 6066
      web-ui:
        port: 8080
    driver:
      memory: 4g
      cores: 1
    executor:
      memory: 4g
      cores: 1
    default:
      parallelism: 1
  timeout: 300                                  # timeout setting, in seconds 
  hdfs:
    host: localhost
    post: 50070
    path: /Users/sqlplus                        # base path for the uploaded jars
    user: sqlplus
  data:
    path: /Users/sqlplus/data                   # base path for the input data
  result:
    type: web-ui

Spark configurations

Edit $SPARK_HOME/conf/spark-defaults.conf and add spark.master.rest.enabled true to the end.
Edit $SPARK_HOME/conf/log4j.properties and add the following configurations.

log4j.logger.SparkSQLPlusExperiment=INFO, SparkSQLPlus
log4j.appender.SparkSQLPlus=org.apache.log4j.ConsoleAppender
log4j.appender.SparkSQLPlus.target=System.out
log4j.appender.SparkSQLPlus.layout=org.apache.log4j.PatternLayout
log4j.appender.SparkSQLPlus.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

Build

Run mvn clean package.

Download data

Run bash examples/data/download.sh to download a graph from SNAP. It is also possible to use other input data as long as the columns are separated by commas.

Usage

The example queries are in examples/query.

Command line interface

sparksql-plus compiles the input SQL file into SparkSQL+ code.

syntax: sparksql-plus [OPTIONS] <query>
  options:
     -d,--ddl <path>           Set the path to the ddl file.
     -h,--help                 Show the help message.
     -n,--name <object name>   Set the object name for the output object.
     -o,--output <path>        Set the path to the output file.
     -p,--pkg <package name>   Set the package name for the output object.

The following command generates SparkSQL+ code for examples/query/q1.

./bin/sparksql-plus -d examples/query/q1/graph.ddl -o examples/query/q1/q1.scala examples/query/q1/query.sql

Web-based interface

Start the application

Run java -jar sqlplus-web/target/sparksql-plus-web-jar-with-dependencies.jar.

Access the web

Visit http://localhost:8848/ in the browser.

Compile a query

Submit a query.
Select a candidate.
Persist the generated SparkSQL+ code.

Run experiments

Click the Experiment tab on the top and submit the experiments. The result will be fetched and displayed automatically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SparkSQL+

Prerequisites

Project Structure:

Setup

Clone

Configurations

SparkSQL+ modes

Local forwarding

SparkSQL+ configurations

Spark configurations

Build

Download data

Usage

Command line interface

Web-based interface

Start the application

Access the web

Compile a query

Run experiments

Files

README.md

Latest commit

History

README.md

File metadata and controls

SparkSQL+

Prerequisites

Project Structure:

Setup

Clone

Configurations

SparkSQL+ modes

Local forwarding

SparkSQL+ configurations

Spark configurations

Build

Download data

Usage

Command line interface

Web-based interface

Start the application

Access the web

Compile a query

Run experiments