Spark DICOM connector in Scala
Once loaded in the classpath of your Spark cluster, you can load DICOM data in Spark using the dicomFile
as follows:
val df = spark.read.format("dicomFile").load("/some/hdfs/path").select("PatientName", "StudyDate", "StudyTime")
You can select DICOM attributes defined in the DICOM standard registry using their keyword. They are defined in the official DICOM standard.
Each attribute is written to a column with a Spark data type equivalent to its VR. The mapping is as follows:
VR | Spark Data type |
---|---|
AE, AS, AT, CS, DS, DT, IS, LO, LT, SH, ST, UC, UI, UR, UT | String |
PN | {"Alphabetic": String, "Ideographic": String, "Phonetic": String} |
FL, FD | [Double] |
SL, SS, US, UL | [Integer] |
SV, UV | [Long] |
DA | String (formatted as DateTimeFormatter.ISO_LOCAL_DATE ) |
TM | String (formatted as DateTimeFormatter.ISO_LOCAL_TIME ) |
The PixelData
attribute in a DICOM file can be very heavy and make Spark crash.
Reading it is disabled by default.
In order to be able to select the PixelData
column, please turn the includePixelData
option on:
spark.read.format("dicomFile").option("includePixelData", true).load("/some/hdfs/path").select("PixelData")
isDicom
:true
if file was read as a DICOM file,false
otherwise
The DICOM dataframe can be de-identified according to the Basic Confidentiality Profile in the DICOM standard. To use the de-identifier, do the following in scala:
import ai.kaiko.spark.dicom.deidentifier.DicomDeidentifier._
var df = spark.read.format("dicomFile").load("/some/hdfs/path")
df = deidentify(df)
The resulting dataframe will have all the columns dropped/emptied/dummyfied according to the actions described here.
To perform the de-identification with any of the options described in the table, use:
import ai.kaiko.spark.dicom.deidentifier.DicomDeidentifier._
import ai.kaiko.spark.dicom.deidentifier.options._
val config: Map[DeidOption, Boolean] = Map(
CleanDesc -> true,
RetainUids -> true
)
var df = spark.read.format("dicomFile").load("/some/hdfs/path")
df = deidentify(df, config)
Current limitations of the de-identification are:
Expected behavior | Current behavior |
---|---|
Tags with SQ VR are de-identified |
Tags with SQ VR are ignored |
Private tags are de-identified | Private tags are ignored |
The U action pseudonimizes the value |
The U action replaces the value with ToPseudonimize |
The C action cleans the value of PHI/PII |
The C action replaces the value with ToClean |
A reproducible development environment is provided using Nix.
$ nix-shell
it will provide you the JDK, sbt, and all other required tools.
Build the JAR artifact:
$ nix-build
When changing sbt build dependencies, change depsSha256
in default.nix
as instructed.
CI is handled by GitHub actions, using Nix for dependency management, test, build and caching (with Cachix).
Note: for CI to run tests, the CI needs the Nix build to run tests in checkPhase.
You can run the CI locally using act
(provided in the Nix shell).
Creating a release is done with the help of the sbt-sonatype, sbt-pgp and sbt-release plugins.
Before starting, make sure to set the Sonatype credentials as environment variables: SONATYPE_USERNAME
& SONATYPE_PASSWORD
. In addition, make sure to have the gpg
utility installed and the release GPG Key available in your keyring.
Then, run:
$ nix-shell
$ sbt
$ release
You will be prompted for the "release version", the "next version" and the GPG Key passphrase. Make sure to follow the SemVer versioning scheme. If all went well, the new release should be available on Maven Central in 10 minutes.