Skip to content

Latest commit

 

History

History
239 lines (198 loc) · 19.6 KB

CodeStructure.md

File metadata and controls

239 lines (198 loc) · 19.6 KB

Code structure

Framework

The framework is a crude approximation of the CMS data processing software, CMSSW. CMSSW is a rather generic framework to process independent chunks of data. In CMS these chunks of data correspond to triggered proton-proton collisions, and are called events. The events are processed by modules, that in this project are all "producers" that can read objects from the event, and insert new objects to the event (in CMSSW there are, in addition, analyzers, that can only read objects from the event, and filters, that can decide to stop processing of the event).

The modules form a DAG based on their data dependencies. The modules are implemented in C++ (C++17 in general, CUDA code is in C++14). A CMSSW job is configured in Python (this project does not provide configuration mechanism).

The CMSSW framework is multi-threaded using Threading Building Blocks (TBB). An integral part of the multi-threading is a concept of "concurrent event processor" that we call "a stream" (to disambiguate from CUDA streams, these streams are called "EDM streams" from now on). An EDM stream processes one event at a time ("processing" meaning that each module in the DAG is run on the event in some order respecting the data dependencies). A job may have multiple EDM streams, in which case the EDM streams process their events concurrently. Furthermore, modules processing the same event that are independent in the DAG are also run concurrently. All this potential concurrency is exposed as tasks to be run by the TBB task scheduler. We do not make any assumptions on how TBB runs these tasks in threads (e.g. number of EDM streams and number of threads may be different). For more information on CMSSW framework see e.g.

The processing time and memory requirements can vary a lot across events. In addition, the filtering capability may affect which modules in the DAG can be run.

The main approximations of the framework in this project with respect to CMSSW are

  • producer modules only, and only stream and stream-ExternalWork producers
  • modules and the event products have no labels, and therefore the event can hold only at most one product of each C++ type
  • no run time configuration mechanism
  • input data are fully read in memory at the beginning of the job
  • EventSetup system has a single (implicit) record, one IOV, all products are read from binary dumps at the beginning of the job.

Overall view on the use of CUDA

Our overall aims are to avoid blocking synchronization as much as possible, and keep all processing units (CPU cores, GPUs) as busy as we can doing useful work. It follows that we try to have all operations (memory transfers, kernel calls) asynchronous with the use of CUDA streams, and that we use callback functions (cudaStreamAddCallback()) to notify the CMSSW framework when the asynchronous work has finished.

We use a "caching allocator" (based on the one from CUB library) for both device and pinned host memory allocations. This approach allows us to amortize the cost of the cudaMalloc()/cudaFree() etc, while being able to easily re-use device/pinned host memory regions for temporary workspaces, to avoid "conservative" overallocation of memory, and to avoid constraining the scheduling of modules to multiple devices.

We use one CUDA stream for each EDM stream ("concurrent event") and each linear chain of GPU modules that pass data from one to the other in the device memory. In case of branches in the DAG of modules, additional CUDA streams are used since there is sub-event concurrency in the DAG that we want to expose to CUDA runtime.

For more information see cms-sw/cmssw:HeterogeneousCore/CUDACore/README.md.

Application structure

The pixel tracking GPU prototype consists of five modules that are run according to their data dependencies (roughly in the following order)

The data dependencies of the modules form the following DAG Module DAG

In addition, there are two modules to transfer the tracks and vertices back to CPU (enabled only with --transfer parameter)

With the transfers the module DAG becomes Module DAG with transfers

The application reads uncompressed raw data for just the pixel detector (about 250 kB/event). This configuration is somewhat artificial, e.g. almost nothing is transferred back to CPU, at the moment there are no modules that would consume the data in the format produced by this workflow, and in offline data processing the input data is compressed.

For more information on the application see

BeamSpot transfer (BeamSpotToCUDA)

This module transfers information about the average beam collision region ("beam spot") from the host to the device for each event.

Operation Description
memcpy H2D 44 B Transfer BeamSpotCUDA::Data for position and other information about the beam spot

These essentially only transfer data from CPU to GPU.

Raw-to-cluster (SiPixelRawToClusterCUDA)

This module that unpacks and reformats the raw to into something usable to downstream, applies some calibration, and forms clusters of pixels on each pixel detector module.

The following memory transfers are done once per job on the first event from SiPixelRawToClusterCUDA.cc

Operation Description
memcpy H2D 1.44 MB Transfer SiPixelFedCablingMapGPU for pixel detector cabling map
memcpy H2D 57.6 kB Transfer an array of module indices to be unpacked. This is to support regional unpacking in CMSSW, even though this functionality is not used in this project.
memcpy H2D 3.09 MB Transfer gain calibration data
memcpy H2D 24.05 kB Transfer SiPixelGainForHLTonGPU for gain calibration
memcpy H2D 8 B Set the gain calibration data pointer in SiPixelGainForHLTonGPU struct

The following CUDA operations are issued for each event from SiPixelRawToClusterGPUKernel.cu

Operation Description
memcpy H2D 40 B Transfer SiPixelDigisCUDA::DeviceConstView for SoA of pixel digis (= unpacked raw data)
memset 3.6 MB Zero unpacking error array
memcpy H2D 16 B Transfer GPU::SimpleVector<PixelErrorcompact> to provide std::vector-like interface for the unpacking error array
memcpy H2D 32 B Transfer [SiPixelClustersCUDA::DeviceConstView] for SoA of pixel clusters
memcpy H2D Transfer raw data words
memcpy H2D Transfer IDs for FEDs that provided data
pixelgpudetails::RawToDigi_kernel() Kernel to unpack the raw data into data structure usable for subsequent kernels
memcpy D2H 16 B Transfer GPU::SimpleVector<PixelErrorcompact>, i.e. essentially the number of errors, to host
gpuCalibPixel::calibDigis() Calibrate pixel digis (ADC counts)
gpuClustering::countModules() Fills starting index into the ADC (etc) arrays for each active module
memcpy D2H 4 B Transfer number of active modules to host
gpuClustering::findClus() Cluster digis on each pixel module
gpuClustering::clusterChargeCut() Select clusters whose aggregated electric charge is above a given threshold
pixelgpudetails::fillHitsModuleStart()
memcpy D2H 4 B Transfer number of pixel clusters to host

This module computes the 3D position estimate for each cluster.

The following CUDA operations are issued for each event from PixelRecHits.cu

Operation Description
memcpy H2D 152 B Transfer TrackingRecHit2DSOAView for SoA of pixel hits
gpuPixelRecHits::getHits() Calculates 3D position for each cluster
setHitsLayerStart() Set index of the first hit for each pixel detector layer
cudautils::countFromVector() First kernel of four to fill a phi-binned histogram of the hits. This kernel counts the number of elements for each bin
cub::DeviceScanInitKernel() Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize()
cub::DeviceScanKernel() Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize()
cudautils::fillFromVector() Last kernel of four to fill a phi-binned histogram of the hits. This kernel fills each bin with the hits.

Pattern recognition (with Cellular Automaton) (CAHitNtupletCUDA)

This module performs the "pattern recognition":

  • create pairs of pixel hits on adjacent layers
  • connect the pairs to form ntuplets (triplets or quadruplets)
  • fit the ntuplets with a helix

The following CUDA operations are issued for each event from CAHitNtupletGeneratorKernels.cu and BrokenLineFitOnGPU.cu

Operation Description
memset 4 B Zero the number of doublets (CA cells)
memset 36 B Zero counter of ntuplets
memset 197 kB Zero association structure from hits to ntuplets
gpuPixelDoublets::initDoublets() Initialize pair/doublet finding data structure
gpuPixelDoublets::getDoubletsFromHisto() Create the hit pairs/doublets
memset 131 kB Zero ntuplet finding data structure
kernel_connect() Connect compatible pairs/doublets
gpuPixelDoublets::fishbone() Identify duplicate ntuplets (a single particle can induce multiple hits per layer because of redundancies in the detector)
kernel_find_ntuplets Find ntuplets from the doublet connection graph (the actual Cellular Automaton)
cudautils::finalizeBulk() ?
kernel_earlyDuplicateRemover() Clean duplicate ntuplets
kernel_countMultiplicity() Count the number of ntuplets with different numbers of hits
cub::DeviceScanInitKernel() Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize()
cub::DeviceScanKernel() Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize()
kernel_fillMultiplicity() Fills a nhit-binned histogram of the ntuplets.
kernel_fillHitDetIndices ?
kernelBLFastFit<3>() First step of fitting triplets
kernelBLFit<3>() Second step of fitting triplets
kernelBLFastFit<4>() First step of fitting quadruplets
kernelBLFit<4>() Second step of fitting quadruplets
kernelBLFastFit<4>() First step of fitting pentuplets (only first 4 hits)
kernelBLFit<4>() Second step of fitting pentuplets (only first 4 hits)
kernel_classifyTracks Classify fitted tracks according to certain quality criteria
kernel_fastDuplicateRemover Identify duplicate tracks

This module reconstruct vertices, i.e. finds clusters of track "orign

The following CUDA operations are issued for each event from gpuVertexFinderImpl.h

Operation Description
gpuVertexFinder::init() Zero data structures
gpuVertexFinder::loadTracks() Fill vertexing data structures with information from tracks
gpuVertexFinder::clusterTracksByDensityKernel() Cluster tracks to vertices
gpuVertexFinder::fitVerticesKernel() Fit vertex parameters (first pass)
gpuVertexFinder::splitVerticesKernel() Split vertices that are likely to be merged from multiple proton-proton interactions
gpuVertexFinder::fitVerticesKernel() Fit vertex parameters (second pass, for split vertices)
gpuVertexFinder::sortByPt2() Sort vertices by the sum of the square of the transverse momentum of the tracks contained by a vertex

Transfer tracks to host (PixelTrackSoAFromCUDA)

This module transfers the pixel tracks from the device to the host.

Operation Description
memcpy D2H X B Transfer pixeltrack::TrackSoA

Transfer vertices to host (PixelVertexSoAFromCUDA)

This module transfers the pixel vertices from the device to the host.

Operation Description
memcpy D2H X B Transfer ZVertexSoA