This guide describes the procedure for using Transport UDFs in various supported platforms. For information about the project in general please refer to the documentation index
The Transport framework automatically generates UDF artifacts for each supported platform. These artifacts are distinguished by maven classifiers in addition to the original UDF artifact coordinates. Follow the below sections on how to identify the correct artifact/class for your platform and consequently how to use it in the platform.
As mentioned above, the Transport Plugin will automatically generate artifacts for each platform. Once these artifacts are published to a ivy repository, you can consume them using the corresponding ivy coordinates using the platform name as a maven classifier. E.g. if the UDF has an ivy coordinate com.linkedin.transport-example:example-udf:1.0.0
, then the coordinate for the platform-specific UDF would be com.linkedin.transport-example:example-udf:1.0.0?classifier=PLATFORM-NAME
where PLATFORM-NAME
is hive
, presto
or spark
.
If you are building the UDF project locally, the platform-specific artifacts are built alongside the UDF artifact in the output directory with the platform name as a file suffix. If the built UDF is located at /path/to/example-udf.ext
then the platform-specific artifact is located at /path/to/example-udf-PLATFORM-NAME.ext
where PLATFORM-NAME
is hive
, presto
or spark
.
If the UDF class is com.linkedin.transport.example.ExampleUDF
then the platform-specific UDF class will be com.linkedin.transport.example.PLATFORM-NAME.ExampleUDF
where PLATFORM-NAME
is hive
, presto
or spark
.
-
Add the UDF jar to the Hive session
For adding the jar from a local filehive (default)> ADD JAR /path/to/example-udf-hive.jar;
For adding the jar from an ivy repository
hive (default)> ADD JAR ivy://com.linkedin.transport-example:example-udf:1.0.0?classifier=hive;
-
Register the UDF with the function registry
hive (default)> CREATE TEMPORARY FUNCTION example_udf AS 'com.linkedin.transport.example.hive.ExampleUDF';
-
Call the UDF in a query
hive (default)> SELECT example_udf(some_column, 'some_constant');
-
Add the UDF jar to the classpath of the Spark application.
If you are launching Spark through the Spark shell, use the--jars
option to include the local UDF jar file. If you are writing the Spark application using Scala code, use the dependency management solution of your build tool (e.g. Gradle/Maven) to include the UDF's Spark jar as a compile-time dependency. -
Register the UDF with the function registry
import com.linkedin.transport.example.spark.ExampleUDF val exampleUDF = ExampleUDF.register("example_udf")
-
Call the UDF
You can use the UDF either through Spark SQL or the Spark Dataframe API- Spark SQL:
spark.sql("""SELECT example_udf(some_column, 'some_constant')""")
- Dataframe API
OR
dataframe.withColumn("result", exampleUDF(col("some_column"), lit("some_constant")) )
import org.apache.spark.sql.functions.callUDF dataframe.withColumn("result", callUDF("example_udf", col("some_column"), lit("some_constant")) )
- Spark SQL:
-
Add the UDF to the Presto installation
Unlike Hive and Spark, Presto currently does not allow dynamically loading jar files once the Presto server has started. In Presto, the jar is deployed to theplugin
directory. However, a small patch is required for the Presto engine to recognize the jar as a plugin, since the generated Presto UDFs implement theSqlScalarFunction
API, which is currently not part of Presto's SPI architecture. You can find the patch here and apply it before deploying your UDFs jar to the Presto engine. -
Call the UDF in a query
To call the UDF, you will need to use the function name defined in the Transport UDF definition.presto-cli> SELECT example_udf(some_column, 'some_constant');