The framework is a crude approximation of the CMS data processing software, CMSSW. CMSSW is a rather generic framework to process independent chunks of data. In CMS these chunks of data correspond to triggered proton-proton collisions, and are called events. The events are processed by modules, that in this project are all "producers" that can read objects from the event, and insert new objects to the event (in CMSSW there are, in addition, analyzers, that can only read objects from the event, and filters, that can decide to stop processing of the event).
The modules form a DAG based on their data dependencies. The modules are implemented in C++ (C++17 in general, CUDA code is in C++14). A CMSSW job is configured in Python (this project does not provide configuration mechanism).
The CMSSW framework is multi-threaded using Threading Building Blocks (TBB). An integral part of the multi-threading is a concept of "concurrent event processor" that we call "a stream" (to disambiguate from CUDA streams, these streams are called "EDM streams" from now on). An EDM stream processes one event at a time ("processing" meaning that each module in the DAG is run on the event in some order respecting the data dependencies). A job may have multiple EDM streams, in which case the EDM streams process their events concurrently. Furthermore, modules processing the same event that are independent in the DAG are also run concurrently. All this potential concurrency is exposed as tasks to be run by the TBB task scheduler. We do not make any assumptions on how TBB runs these tasks in threads (e.g. number of EDM streams and number of threads may be different). For more information on CMSSW framework see e.g.
- CMS TWiki pages
- SWGuideFrameWork
- MultithreadedFrameworkDesignDiscussions
- The CMS Offline WorkBook WorkBook
- The CMS Offline SW Guide SWGuide
- Papers
The processing time and memory requirements can vary a lot across events. In addition, the filtering capability may affect which modules in the DAG can be run.
The main approximations of the framework in this project with respect to CMSSW are
- producer modules only, and only stream and stream-ExternalWork producers
- modules and the event products have no labels, and therefore the event can hold only at most one product of each C++ type
- no run time configuration mechanism
- input data are fully read in memory at the beginning of the job
- EventSetup system has a single (implicit) record, one IOV, all products are read from binary dumps at the beginning of the job.
Our overall aims are to avoid blocking synchronization as much as
possible, and keep all processing units (CPU cores, GPUs) as busy as
we can doing useful work. It follows that we try to have all
operations (memory transfers, kernel calls) asynchronous with the use
of CUDA streams, and that we use callback functions
(cudaStreamAddCallback()
) to notify the CMSSW framework when the
asynchronous work has finished.
We use a "caching allocator" (based on the one from
CUB library) for both device and
pinned host memory allocations. This approach allows us to amortize
the cost of the cudaMalloc()
/cudaFree()
etc, while being able to
easily re-use device/pinned host memory regions for temporary
workspaces, to avoid "conservative" overallocation of memory, and to
avoid constraining the scheduling of modules to multiple devices.
We use one CUDA stream for each EDM stream ("concurrent event") and each linear chain of GPU modules that pass data from one to the other in the device memory. In case of branches in the DAG of modules, additional CUDA streams are used since there is sub-event concurrency in the DAG that we want to expose to CUDA runtime.
For more information see cms-sw/cmssw:HeterogeneousCore/CUDACore/README.md
.
The pixel tracking GPU prototype consists of five modules that are run according to their data dependencies (roughly in the following order)
BeamSpotToCUDA
(A)SiPixelRawToClusterCUDA
(B)SiPixelRecHitCUDA
(C)CAHitNtupletCUDA
(D)PixelVertexProducerCUDA
(E)
The data dependencies of the modules form the following DAG
In addition, there are two modules to transfer the tracks and vertices back to CPU (enabled only with --transfer
parameter)
With the transfers the module DAG becomes
The application reads uncompressed raw data for just the pixel detector (about 250 kB/event). This configuration is somewhat artificial, e.g. almost nothing is transferred back to CPU, at the moment there are no modules that would consume the data in the format produced by this workflow, and in offline data processing the input data is compressed.
For more information on the application see
- https://patatrack.web.cern.ch/patatrack/wiki/
- CHEP 2019: A. Bocci: Heterogeneous online reconstruction at CMS
- ACAT 2019: A. Bocci: Towards a heterogeneous High Level Trigger farm for CMS
- Connecting The Dots 2019: F. Pantaleo: Patatrack: accelerated Pixel Track reconstruction in CMS
- F. Pantaleo: New Track Seeding Techniques for the CMS Experiment (PhD thesis)
BeamSpot transfer (BeamSpotToCUDA
)
This module transfers information about the average beam collision region ("beam spot") from the host to the device for each event.
Operation | Description |
---|---|
memcpy H2D 44 B | Transfer BeamSpotCUDA::Data for position and other information about the beam spot |
These essentially only transfer data from CPU to GPU.
Raw-to-cluster (SiPixelRawToClusterCUDA
)
This module that unpacks and reformats the raw to into something usable to downstream, applies some calibration, and forms clusters of pixels on each pixel detector module.
The following memory transfers are done once per job on the first event from SiPixelRawToClusterCUDA.cc
Operation | Description |
---|---|
memcpy H2D 1.44 MB | Transfer SiPixelFedCablingMapGPU for pixel detector cabling map |
memcpy H2D 57.6 kB | Transfer an array of module indices to be unpacked. This is to support regional unpacking in CMSSW, even though this functionality is not used in this project. |
memcpy H2D 3.09 MB | Transfer gain calibration data |
memcpy H2D 24.05 kB | Transfer SiPixelGainForHLTonGPU for gain calibration |
memcpy H2D 8 B | Set the gain calibration data pointer in SiPixelGainForHLTonGPU struct |
The following CUDA operations are issued for each event from SiPixelRawToClusterGPUKernel.cu
Operation | Description |
---|---|
memcpy H2D 40 B | Transfer SiPixelDigisCUDA::DeviceConstView for SoA of pixel digis (= unpacked raw data) |
memset 3.6 MB | Zero unpacking error array |
memcpy H2D 16 B | Transfer GPU::SimpleVector<PixelErrorcompact> to provide std::vector -like interface for the unpacking error array |
memcpy H2D 32 B | Transfer [SiPixelClustersCUDA::DeviceConstView ] for SoA of pixel clusters |
memcpy H2D | Transfer raw data words |
memcpy H2D | Transfer IDs for FEDs that provided data |
pixelgpudetails::RawToDigi_kernel() |
Kernel to unpack the raw data into data structure usable for subsequent kernels |
memcpy D2H 16 B | Transfer GPU::SimpleVector<PixelErrorcompact> , i.e. essentially the number of errors, to host |
gpuCalibPixel::calibDigis() |
Calibrate pixel digis (ADC counts) |
gpuClustering::countModules() |
Fills starting index into the ADC (etc) arrays for each active module |
memcpy D2H 4 B | Transfer number of active modules to host |
gpuClustering::findClus() |
Cluster digis on each pixel module |
gpuClustering::clusterChargeCut() |
Select clusters whose aggregated electric charge is above a given threshold |
pixelgpudetails::fillHitsModuleStart() |
|
memcpy D2H 4 B | Transfer number of pixel clusters to host |
RecHits SiPixelRecHitCUDA
This module computes the 3D position estimate for each cluster.
The following CUDA operations are issued for each event from PixelRecHits.cu
Operation | Description |
---|---|
memcpy H2D 152 B | Transfer TrackingRecHit2DSOAView for SoA of pixel hits |
gpuPixelRecHits::getHits() |
Calculates 3D position for each cluster |
setHitsLayerStart() |
Set index of the first hit for each pixel detector layer |
cudautils::countFromVector() |
First kernel of four to fill a phi-binned histogram of the hits. This kernel counts the number of elements for each bin |
cub::DeviceScanInitKernel() |
Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize() |
cub::DeviceScanKernel() |
Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize() |
cudautils::fillFromVector() |
Last kernel of four to fill a phi-binned histogram of the hits. This kernel fills each bin with the hits. |
Pattern recognition (with Cellular Automaton) (CAHitNtupletCUDA
)
This module performs the "pattern recognition":
- create pairs of pixel hits on adjacent layers
- connect the pairs to form ntuplets (triplets or quadruplets)
- fit the ntuplets with a helix
The following CUDA operations are issued for each event from CAHitNtupletGeneratorKernels.cu
and BrokenLineFitOnGPU.cu
Operation | Description |
---|---|
memset 4 B | Zero the number of doublets (CA cells) |
memset 36 B | Zero counter of ntuplets |
memset 197 kB | Zero association structure from hits to ntuplets |
gpuPixelDoublets::initDoublets() |
Initialize pair/doublet finding data structure |
gpuPixelDoublets::getDoubletsFromHisto() |
Create the hit pairs/doublets |
memset 131 kB | Zero ntuplet finding data structure |
kernel_connect() |
Connect compatible pairs/doublets |
gpuPixelDoublets::fishbone() |
Identify duplicate ntuplets (a single particle can induce multiple hits per layer because of redundancies in the detector) |
kernel_find_ntuplets |
Find ntuplets from the doublet connection graph (the actual Cellular Automaton) |
cudautils::finalizeBulk() |
? |
kernel_earlyDuplicateRemover() |
Clean duplicate ntuplets |
kernel_countMultiplicity() |
Count the number of ntuplets with different numbers of hits |
cub::DeviceScanInitKernel() |
Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize() |
cub::DeviceScanKernel() |
Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize() |
kernel_fillMultiplicity() |
Fills a nhit-binned histogram of the ntuplets. |
kernel_fillHitDetIndices |
? |
kernelBLFastFit<3>() |
First step of fitting triplets |
kernelBLFit<3>() |
Second step of fitting triplets |
kernelBLFastFit<4>() |
First step of fitting quadruplets |
kernelBLFit<4>() |
Second step of fitting quadruplets |
kernelBLFastFit<4>() |
First step of fitting pentuplets (only first 4 hits) |
kernelBLFit<4>() |
Second step of fitting pentuplets (only first 4 hits) |
kernel_classifyTracks |
Classify fitted tracks according to certain quality criteria |
kernel_fastDuplicateRemover |
Identify duplicate tracks |
Vertexing (PixelVertexProducerCUDA
)
This module reconstruct vertices, i.e. finds clusters of track "orign
The following CUDA operations are issued for each event from gpuVertexFinderImpl.h
Operation | Description |
---|---|
gpuVertexFinder::init() |
Zero data structures |
gpuVertexFinder::loadTracks() |
Fill vertexing data structures with information from tracks |
gpuVertexFinder::clusterTracksByDensityKernel() |
Cluster tracks to vertices |
gpuVertexFinder::fitVerticesKernel() |
Fit vertex parameters (first pass) |
gpuVertexFinder::splitVerticesKernel() |
Split vertices that are likely to be merged from multiple proton-proton interactions |
gpuVertexFinder::fitVerticesKernel() |
Fit vertex parameters (second pass, for split vertices) |
gpuVertexFinder::sortByPt2() |
Sort vertices by the sum of the square of the transverse momentum of the tracks contained by a vertex |
Transfer tracks to host (PixelTrackSoAFromCUDA
)
This module transfers the pixel tracks from the device to the host.
Operation | Description |
---|---|
memcpy D2H X B | Transfer pixeltrack::TrackSoA |
Transfer vertices to host (PixelVertexSoAFromCUDA
)
This module transfers the pixel vertices from the device to the host.
Operation | Description |
---|---|
memcpy D2H X B | Transfer ZVertexSoA |