readme

advertisement
Flowgram
1. Preprocessing
Data 01
(mzxml)
Data 02
(mzxml)
Pre-processing the data
Data 01
(mat file)
Data 02
(mat file)
2. Training data & model parameters generation
Common Data 01 & Data
02 Tandem Peptide List
Training Data & Model parameters
Generation
Training
Dataset
Model
Parameters
3. Alignment part
Data 01 & Data 02
Testing List
(Different from
training peptides)
Align Testing List
Alignment
Result
Model
Parameters
4. Aligning Result Verification
Alignment
Result excel
file
Input Ground
truth file
Verification
Result
Readme
1. The pre-processing script: main_pre_processing.m, takes in the raw LC/MS data files in
the mzxml format to be aligned, parses the mzxml files, and save the LC/MS level one
scans in two Matlab data files named data02levelone.mat and data03levelone.mat
These two files will be the input to both the training data & model parameter
generation script: main_training.m and alignment testing script main_testing.m.
2. The training data and model parameters generation script: main_training.m is used to
get the training peptide for the Alignment R (AR) statistics model and Alignment Time
(AT) model. The input should be a peplist xls file which list common Tandem MS
identified peptides from the two data files to be aligned data02 and data03. ( A demo
file of such a list is TandemCommonPeplist.xlsx). The selected training peptides are
chosen based on the conditions outlined in the paper. The output is a training dataset (a
list of peptides) and the model parameters. The training dataset is saved as an excel file
(CommonTandemTrainingPeplistInput.xlsx as an example), which contains peptide
sequence, charge state in data02, Tandem MS identified time point in data02, charge
state in data03 and Tandem MS identified time point in data03. The model parameter
file (training_parameter.mat)contains all the parameters for the Gamma AR model and
the Student-t AT model. The names of the input and output files can be specified by the
user in the main_training.m script.
3. The alignment testing script main_testing.m is used to align the testing datasets. The
input is supposed to be a peptide list annotated by peptide sequence and charge state
information. See CommonTandemTestingPeplistInput.xlsx for an example. Note that
elution time information is optional. The column corresponding to data02ms2timepoint
and data03ms2timepoint can be set to all zeros.
The input parameters are loaded from the training_parameter.mat file. The alignment
result is an excel file that contains 8 columns. The first is titled pepseqence which is as
same as that in the input file. The 2nd column is named XIC02exist which indicates if a
peptide can be found in data02 (1 for exist, 0 for non-exist). The 3rd column is named
XIC03exsit which has the same meaning as that of the 2nd column for data03. The 4th
column is named XIC0203exist which shows if peptides can be found in both datasets.
The 5th and 6th columns are named T02start and T02end which indicate when peptide
elution time start and end in data02. The last two columns record the elution time start
and end in data03. See CommonTandemMsInspectTestingPeplistAlignmentResult.xlsx
for an example.
The input and output file names can be modified in the main_testing.m script.
4. The main_verification.m is used to verify the alignment result if the user have the
tandem time point in the testing excel file as CommonTandemTestingPeplistInput.xlsx
for an example. The data02ms2timepoint and data03ms2timepoint are set to be the ground
truth. This script compares the intervals the LABAHT detected to the ground truth. the
result is saved in Detection_verification.mat in which the 1st column is for data02 (1 for
detected, 0 for non-detected) while the 2nd column is for data03.
Download