Release Cask Data Application Platform v4.2.0 · cdapio/cdap

Summary

Spark Enhancements: Added suppport for Apache Spark 2.x. Users have an option to configure CDAP to use Spark 1.x or Spark 2.x on their cluster. Also added capability to run interactive Spark code within CDAP.
Enhanced Data Preparation: Added capabilities in data preparation to connect to the File System (Local and HDFS) and relational databases, browse and select their existing data, and import into Data Preparation for cleansing, preparing and transforming.
Event Driven Schedules: Added capabilities to start CDAP programs based on data availability of partitions of data in HDFS and pose run contraints to intelligently orchestrate CDAP Workflows.

New Features

Spark Enhancements

Added support for Spark 2.x. In environments where multiple Spark versions exist, CDAP must be configured to use one or the other (CDAP-7875)
Enable capabilities to run interactive Spark code within CDAP (CDAP-11409)
Added capabilities to run arbitrary Spark code in CDAP Pipelines (CDAP-11410)
Enhancements to speed up launching Spark programs (CDAP-11411)

Enhanced Data Preparation

Adds File System Browser Component to browse Local and HDFS File System from Data Preparation (CDAP-9290)
Adds Data Quality information to Data Preparation table. Currently, it shows the completeness of each column (CDAP-9517)
Added point-and-click interactions for applying directives such as parsing, splitting, find and replace, filling null or empty rows, copying and deleting columns in Data Preparation. They can be invoked by using the dropdown menu for each column (CDAP-9524)
Added point-and-click interaction for cleansing column names (CDAP-11333)
Added a point-and-click interaction to set all column names in Data Preparation (CDAP-11334)
Added the ability to ingest data one-tim from Data Preparation to a CDAP Dataset (CDAP-11424)
Added macro support for Data Preparation directives (CDAP-9556)

Event Driven Schedules

Introduces a new, event-driven scheduling system that can start programs based on data availability in HDFS partitions (CDAP-7593)
Allow users to configure constraints for schedules, such as duration since last run and allowed time range for program execution (CDAP-11338)

Other New Features

Added capability for CDAP Services to dynamically list available artifacts and dynamically load artifacts (CDAP-11498)
Added support for EMR 5.0 - 5.3 (CDAP-7873)
Added the ability for Data Preparation to handle byte arrays of data for processing binary data (CDAP-11486)
Added an API to Spark Streaming sources to provide number of streams being used by a streaming source (CDAP-11422)
Users can now upload, view, and use plugins of type 'sparksink' in Studio. (CDAP-11681)
Modified the log viewer to only show ERROR, WARN, and INFO levels of logs by default, instead of all logs as previously (CDAP-8668)

Bug fixes

Fix a bug where the log level was always set to INFO at the root logger (CDAP-8289)
Fix a bug where extra characters after an artifact version range were being ignored instead of being recognized as invalid (CDAP-7727)
Fixed a bug where users could not read from real Datasets while previewing CDAP Pipelines (CDAP-7884)
Fixed a bug that prevented users from adding extra classpath to Apache Spark drivers and executors (CDAP-9422)
Fixed a bug where impersonated workflow was not creating local datasets with the correct impersonated user (CDAP-9456)
Fixed a bug in Parquet and Avro File sinks that would cause them to fail if they received ByteBuffers instead of byte arrays. (CDAP-11417)
Fixed a bug where writes could only succeed in one MongoDB sink even when multiple MongoDB sinks were present in a pipeline (CDAP-11558)
Fixed a thread leakage bug in Spark (SPARK-20935) after Spark Streaming program completed (CDAP-11577)
Fixed a bug in TMS where fetching from the payload table raised an exception if the fetch had an empty result (CDAP-11588)
Fixed a bug in the Purchase example that could cause purchases to overwrite each other (CDAP-11643)
Fixed a bug that prevented from using logback.xml in Apache Spark Streaming programs. (CDAP-11651)
Fixed an issue where pipeline metrics were not showing up in pipelines with a large number of nodes (CDAP-9284)
Fixed an issue with retrieving workflow state if it contained an exception without a message (CDAP-11795)
Fixed an issue with the CDAP Ambari service definition where the "cdap" headless user was not unique to the cluster (CDAP-11445)
Fixed the CDAP Upgrade tool to not fail when encountering a non-CDAP table that follows the CDAP naming convention (CDAP-4887)
Fixed an issue where the driver process of a CDAP Workflow was getting restarted when it ran out of memory, causing the Workflow to be executed again from the start node (CDAP-5067)
Fixed an issue with the detection of Apache Spark on HDP 2.5 and above, which caused excess noise on the console (CDAP-7429)
Fixed an issue with the YARN container allocation logic so that the correct container size is used. (CDAP-8888)
Fixed the stream container to terminate cleanly and cleaned up the CDAP Master's Apache Twill JAR files after master shutdown (CDAP-8911)
Fixed an issue where redeployment of an application with a deleted schedule would fail (CDAP-8918)
Fixed warnings about /opt/cdap/master/artifacts not being a directory in unit tests (CDAP-8961)
Fixed an issue due to which CDAP entity roles were not cleanup when the entity was deleted (CDAP-9026)
Fixed an issue where cdap-security.xml was not written under Ambari unless security.enabled in cdap-site.xml was set to true (CDAP-9378)
Fixed the Azure Blob Store source to work with Avro and Parquet formats (CDAP-10475)
Fixed the Azure Blob Store source to work with CDAP FileSets (CDAP-11384)
Fixed the "value is" filter in the Data Preparation UI (CDAP-11557)
Fixed impersonation while upgrading datasets in the Upgrade tool (CDAP-11815)

Deprecations

Add property "metrics.processor.queue.size" with default value 20000 to limit the maximum size of a queue where metrics processor temporarily stores newly fetched metrics in memory before persisting them. Added property "metrics.processor.max.delay.ms" with default value 3000 milliseconds to specify the maximum delay allowed between the latest metrics timestamp and the time when it is processed. The larger this property is, Metrics Processor gets to sleep more often between fetching each batch of metrics but the delay between metrics emission and processing also increases. Deprecated the property "metrics.messaging.fetcher.limit" (CDAP-8327)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cask Data Application Platform v4.2.0

Summary

New Features

Spark Enhancements

Enhanced Data Preparation

Event Driven Schedules

Other New Features

Bug fixes

Deprecations