- PySpark is the Python API for Spark.
- The purpose of PySpark tutorial is to provide basic distributed algorithms using PySpark.
- PySpark has an interactive shell (
$SPARK_HOME/bin/pyspark
) for basic testing and debugging and is not supposed to be used for production environment. - You may use
$SPARK_HOME/bin/spark-submit
command for running PySpark programs (may be used for testing and production environemtns)
- DNA Base Counting
- Classic Word Count
- Find Frequency of Bigrams
- Join of Two Relations R(K, V1), S(K, V2)
- Basic Mapping of RDD Elements
- How to add all RDD elements together
- How to multiply all RDD elements together
- Find Top-N and Bottom-N
- Find average by using combineByKey()
- How to filter RDD elements
- How to find average
- Cartesian Product: rdd1.cartesian(rdd2)
- Sort By Key: sortByKey() ascending/descending
- How to Add Indices
- Map Partitions: mapPartitions() by Examples
- Getting started with PySpark - Part 1
- Getting started with PySpark - Part 2
- A really really fast introduction to PySpark
- PySpark
- Basic Big Data Manipulation with PySpark
- Working in Pyspark: Basics of Working with Data and RDDs
- View Mahmoud Parsian's profile on LinkedIn
- Please send me an email: [email protected]
- Twitter: @mahmoudparsian
Thank you!
best regards,
Mahmoud Parsian