Modify your complex structures in spark-sql dataframes with optics.
Need to set an inner element in a complex structure?
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.lit
val df: DataFrame = ???
import org.hablapps.sparkOptics._
df.select(Lens("field.subfield")(df.schema).set(lit(13)):_*)
Want to try it right now, click on the binder icon to lunch a interactive notebook.
Compiled for Scala 2.11 with Spark 2.3 and Scala 2.12 with Spark 2.4. Tested with spark 2.3, 2.4 in scala 2.11 and for 3.0.1 in scala 2.12.
libraryDependencies += "org.hablapps" %% "spark-optics" % "1.0.0"
Spark lens doesn't have any dependencies beyond Spark itself.
Optic to focus in a column of a provided schema.
Optic equivalent to a lens, but without an specific schema, so that it can be applied to any dataframe that contains the specified column.
Working with complex structures in Spark Sql's DataFrames can be hard. A common case if you are working with complex structures is to modify the inner elements of the structure. For instance:
case class Street(number: Int, name: String)
case class Address(city: String, street: Street)
case class Company(name: String, address: Address)
case class Employee(name: String, company: Company)
val employee = Employee("john", Company("awesome inc", Address("london", Street(23, "high street"))))
val df = List(employee).toDS.toDF
root
|-- name: string (nullable = true)
|-- company: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- street: struct (nullable = true)
| | | |-- number: integer (nullable = false)
| | | |-- name: string (nullable = true)
In order to modify an inner element, like changing the name of the street, we need to do something like this:
val mDF = df.select(df("name"),struct(
df("company.name").as("name"),
struct(
df("company.address.city").as("city"),
struct(
df("company.address.street.number").as("number"),
upper(df("company.address.street.name")).as("name")
).as("street")
).as("address")
).as("company"))
mDF.printSchema
val longCodeEmployee = mDF.as[Employee].head
root
|-- name: string (nullable = true)
|-- company: struct (nullable = false)
| |-- name: string (nullable = true)
| |-- address: struct (nullable = false)
| | |-- city: string (nullable = true)
| | |-- street: struct (nullable = false)
| | | |-- number: integer (nullable = true)
| | | |-- name: string (nullable = true)
mDF: DataFrame = [name: string, company: struct<name: string, address: struct<city: string, street: struct<number: int, name: string>>>]
longCodeEmployee: Employee = Employee(
"john",
Company("awesome inc", Address("london", Street(23, "HIGH STREET"))))
This can be simplified by using spark-optics. It allows you to focus in the element that you want to modify, and the optics will recreate the structure for you.
import org.hablapps.sparkOptics._
import org.apache.spark.sql.functions._
val df: DataFrame = List(employee).toDF
val streetNameLens = Lens("company.address.street.name")(df.schema)
val modifiedDF = df.select(streetNameLens.modify(upper):_*)
modifiedDF.printSchema
modifiedDF.as[Employee].head
root
|-- name: string (nullable = true)
|-- company: struct (nullable = false)
| |-- name: string (nullable = true)
| |-- address: struct (nullable = false)
| | |-- city: string (nullable = true)
| | |-- street: struct (nullable = false)
| | | |-- number: integer (nullable = true)
| | | |-- name: string (nullable = true)
streetNameLens: Lens = Lens(company.address.street.name)
modifiedDF: DataFrame = [name: string, company: struct<name: string, address: struct<city: string, street: struct<number: int, name: string>>>]
res19_3: Employee = Employee(
"john",
Company("awesome inc", Address("london", Street(23, "HIGH STREET")))
)