The input file should be a resulting PSM (peptide-spectrum match) output file from a database search such as Mascot or Andromeda in tab delimited format. The minimum required columns are :
- Sequence
- The amino acid sequence of a PSM
- Modifications
- A column that describes the modifications of the sequence.
(Note for MaxQuant this information is already provided in Modified Sequence
)
- PrecursorArea
- (or Intensity) the AUC (or SPCs) for each PSM
- IonScore
- Mascot (or search engine equivalent) score for each PSM
- q_value
- Percolator (or search engine equivalent) score for each PSM Note that PEP can be used in place of q_value with the rough observational approximation that PEP / 10 = q_value
- Charge
A base configuration file for renaming column headers is provided with gpgrouper
To generate it, simply run gpgrouper getconfig
and a config file will be generated.
The configuration file can be specified by the --configfile
flag in gpgrouper run
, but if not specified
and the default gpgrouper_config.ini
file exists, it will be used.
This file can be edited with addition of new column aliases as needed, though it should be set up to work with ProteomeDiscoverer+Mascot and MaxQuant already.
For MaxQuant output, the evidence.txt
PSMs file should be used as input. gpgrouper is designed
to be run separately for each experiment. As of writing, it is not configured to work with multiple
label-free experiments searched together under say MaxQuant. Therefore, separate experiments
need to be separated. In other words, if MaxQuant is used to search multiple experiments at
once (potentially with match between runs) the evidence.txt file needs to be separated into separate “experiment” files.
A simple script is available for doing just this.
The database file should be a pre-constructed tab delimited file for matching
PSMs to their respective GeneIDs.
The required columns are TaxonID
, HomologeneID
, GeneID
,
ProteinGI
, FASTA
.
Note that PyGrouper uses GeneID
to group PSMs, so if a GeneID is lacking for
a desired grouping another identifier can be substituted in such as ProteinGI
.
HomologeneID
can be an empty column if this information is not available
or desired.