This code is part of the astropath processing pipeline; it intializes a SlideID
for each incoming specimen. These SlideID
s are used to standardize slide naming and replace the SampleName
s (names applied to the slides during the scanning process) on all subfiles for a specimen and in the scanning plan ('annotations.xml') files. The code prepares an AstropathAPIDdef.csv for each project to record the pairings of SlideID
s and SampleName
s. This file contains the following 5 columns (described below):
SlideID, SampleName, ProjectID, CohortID, BatchID
Additionally, the code intializes the necessary processing folders for each project directory and saves a project level AstropathAPIDdef_PP.csv (PP
indicates the ProjectID
) in the upkeep_and_progress
folder, containing only the specimens related to that project.
Next, this code updates the Batch files in the <Dpath>\<Dname>\Batch
location by copying any files in <Spath>\<Dname>\Batch
that are missing.
The format for the SlideID
is APpppXXXX
where the ppp
indicates the numeric ProjectID
and the XXXX
is a slide number which is unique within a project.SlideID
s are generated by comparing the AstropathAPIDdef_PP.csv to the cohort specific SpecimenTable.xlsx and assigning each new specimen with a new value in sequential ordering (AP0010001, AP0010002, AP0020001 …).
Please Note: Since this code needs direct access to the Specimen_Table.xlsx which contains PPI, the naming generator code should only be run on the <Spath>
server (HIPPA side server) to avoid opening the Specimen_Table.xlsx over the network. In addition, SlideID
s are assigned in the order they are added to their respective Specimen_Table.xlsx files.
-
The
ProjectID
andCohortID
are defined in the AstropathCohortsProgress.csv file.- A description of this file is contained here
-
The
BatchID
is generated for each staining batch- a longer description of this value can be found in the description here
-
New
SampleName
s are detected from the SpecimenTable.xlsx files contained in each cohort folder- The
SampleName
s are the names defined during the scanning process - A description of the SpecimenTable.xlsx file is contained here
- The
-
SlideID
s are the names for the specimens in the astropath processing pipeline and replace theSampleName
s on all corresponding files and inside the scanning plan, annotations.xml, files generated during the scanning process- using these names allows us to avoid outside-the-organization changes to naming conventions
- The IDs have the format;
APpppXXXX
ppp
indicates the numericProjectID
XXXX
is a slide number which is unique within a project
- The IDs are generated by comparing the AstropathAPIDdef.csv to the cohort specific SpecimenTable.xlsx
- we assign each new specimen with a new value in sequential ordering (AP0010001, AP0010002, AP0020001 …)
-
We use path specifiers to shorten descriptions, further description of these paths can be found in the additional documentation here repository:
<Mpath>
: the main path for all the astropath processing .csv configuration files; the current location of this path is\\bki04\astropath_processing
<Dname>
: the data name or the name of the clinical specimen folder<Dpath>
: the data or destination path- this is the path to the project's data on the bki servers
<Spath>
: the source path to the project's data
NOTE: the
<path>
variables do not contain the<Dname>
For python download the repository and install the astroidgen. Then launch using:
ASTgen.py <Mpath>
<Mpath>
: should contain the AstropathCohortsProgress.csv, AstropathPaths.csv, and the AstropathCohorts.csv files- description of these files can be found here
We begin by opening the AstropathCohortsProgress.csv file from the <Mpath>
. We process each available cohort from this file sequentially. We first check that data folder for the cohort on the bki servers (<Dpath>\<Dname>
) and the source folder for the cohort (<Spath>\<Dname>
) exist. Then, we either intialize or ensure that the following folders are intialized in the <Dpath>\<Dname>
for processing.
upkeep_and_progress
- For any upkeep and progress tracking files
- Location of the AstropathAPIDdef_PP.csv files, where
PP
indicates the numeric project id
flatfield
- Location of the flatfield parameter files
- These files are named
Flatfield_BatchID_BB.bin
, replacing theBB
for the appropriate batch id.
logfiles
- Project level log files for the astropath pipeline
Batch
- The batch and merge tables
- These tables are described in further documentation located here
Clinical
- Location of the clinical table
- These tables should be labeled as Clinical_Table_Specimen_CSID_MMDDYYYY.csv, where the
CSID
indicates the number on the<Dname>
folder. - We always use the clinical table with the most recent date in the data upload
Ctrl
- Location of control TMA data output
dbload
- Location of the files used for the database upload
tmp_inform_data
- Location of the inform data output and inform algorithms used
reject
- Location of the rejected slides
Information on these folders is also located in documentation located here repository.
Next, we update the available Batch folder <Dpath>\<Dname>\Batch
with any possible missing Batch files in <Spath>\<Dname>\Batch
.
Finally, we compile the AstropathAPIDdef.csv as follows:
- Check that Specimen_Table.xlsx exists.
- If it does not exist, move to the next directory.
- If it does, extract the
Patient #
and theBatchID
columns for each sample
- Open the AstropathAPIDdef.csv if it exists
- If the file does not exist start
SlideIDs
at APppp0001 - If the file exists
- Compare the
SampleName
s in AstropathAPIDdef.csv to the ‘Patient #’ in Specimen_Table.xlsx to determine new specimens - Acquire next available
SlideID
from the AstropathAPIDdef.csv
- Compare the
- If the file does not exist start
- Update the new
SampleName
rows to the AstropathAPIDdef.csv file - Compare the local AstropathAPIDdef_PP.csv file with AstropathAPIDdef.csv
- If there are no new
SampleName
s, do not update AstropathAPIDdef_PP.csv - If there are new
SampleName
s, recreate AstropathAPIDdef_PP.csv with all project-relevant entries
- If there are no new
Once all cohorts have been checked the code will wait for 30 minutes, reload the AstropathCohortsProgress.csv file, then recheck all directories in such a way that the code is running in a continous loop.