MSSimulator

A highly configurable simulator for mass spectrometry experiments.

This implementation is described in

Bielow C, Aiche S, Andreotti S, Reinert K
MSSimulator: Simulation of Mass Spectrometry Data
Journal of Proteome Research (2011), DOI: 10.1021/pr200155f

The most important features are:

Simulation of Capillary electrophoresis and HPLC as separation step
Simulation of MS spectra
Simulation of MS/MS spectra with configurable precursor-selection strategy
Simulation of iTRAQ labels
Simulation of different noise models and instrument types (resolution, peak shape)

Look at the INI file (via "MSSimulator -write_ini myini.ini") to see the available parameters and more functionality.

Input: FASTA files

Protein sequences can be provided as FASTA file. We allow a special tag in the description of each entry to specify protein abundance. If you want to create a complex FASTA file with a Gaussian protein abundance model in log space, see our Python script shipping with your OpenMS installation (e.g., <OpenMS-dir>/share/OpenMS/examples/simulation/FASTAProteinAbundanceSampling.py). It supports (random) sampling from a large FASTA file, protein weight filtering and adds an intensity tag to each entry.

If multiplexed data is simulated (like SILAC or iTRAQ) you need to supply multiple FASTA input files. For the label-free setting, all FASTA input files will be merged into one, before simulation.

For MS/MS simulation only a test model is shipped with OpenMS.
Please find trained models at: http://sourceforge.net/projects/open-ms/files/Supplementary/Simulation/.

To specify intensity values for certain proteins, add an abundance tag for the corresponding protein in the FASTA input file:

add '[# <key>=<value> #]' at the end of the > line to specify intensity For RT control (disable digestion, to make this work!)
- rt (subjected to small local error by randomization)
- RT (used as is without local error)

e.g.

>seq1 optional comment [# intensity=567.4 #]
ASQYLATARHGFLPRHRDTGILP
>seq2 optional comment [# intensity=117.4, RT=405.3 #]
QKRPSQRHGLATARHGTGGGDRA

The command line parameters of this tool are:

MSSimulator -- A highly configurable simulator for mass spectrometry experiments.
Version: 2.3.0 Jan  9 2018, 17:46:23, Revision: 38ae115

Usage:
  MSSimulator <options>

This tool has algorithm parameters that are not shown here! Please check the ini file for a detailed descript
ion or use the --helphelp option.

Options (mandatory options marked with '*'):
  -in <files>*       Input protein sequences (valid formats: 'FASTA')
  -out <file>        Output: simulated MS raw (profile) data (valid formats: 'mzML')
  -out_pm <file>     Output: ground-truth picked (centroided) MS data (valid formats: 'mzML')
  -out_fm <file>     Output: ground-truth features (valid formats: 'featureXML')
  -out_cm <file>     Output: ground-truth features, grouping ESI charge variants of each parent peptide (vali
                     d formats: 'consensusXML')
  -out_lcm <file>    Output: ground-truth features, grouping labeled variants (valid formats: 'consensusXML')
  -out_cntm <file>   Output: ground-truth features caused by contaminants (valid formats: 'featureXML')
  -out_id <file>     Output: ground-truth MS2 peptide identifications (valid formats: 'idXML')
                     
Common UTIL options:
  -ini <file>        Use the given TOPP INI file
  -threads <n>       Sets the number of threads allowed to be used by the TOPP tool (default: '1')
  -write_ini <file>  Writes the default configuration file
  --help             Shows options
  --helphelp         Shows all options (including advanced)

The following configuration subsections are valid:
 - algorithm   Algorithm parameters section

You can write an example INI file using the '-write_ini' option.
Documentation of subsection parameters can be found in the doxygen documentation or the INIFileEditor.
Have a look at the OpenMS documentation for more information.

INI file documentation of this tool:

Legend:

required parameter

advanced parameter

+MSSimulatorA highly configurable simulator for mass spectrometry experiments.

version2.3.0 Version of the tool that generated this parameters file.

++1Instance '1' section for 'MSSimulator'

in[] Input protein sequencesinput file*.FASTA

out output: simulated MS raw (profile) dataoutput file*.mzML

out_pm output: ground-truth picked (centroided) MS dataoutput file*.mzML

out_fm output: ground-truth featuresoutput file*.featureXML

out_cm output: ground-truth features, grouping ESI charge variants of each parent peptideoutput file*.consensusXML

out_lcm output: ground-truth features, grouping labeled variantsoutput file*.consensusXML

out_cntm output: ground-truth features caused by contaminantsoutput file*.featureXML

out_id output: ground-truth MS2 peptide identificationsoutput file*.idXML

log Name of log file (created only when specified)

debug0 Sets the debug level

threads1 Sets the number of threads allowed to be used by the TOPP tool

no_progressfalse Disables progress logging to command linetrue,false

forcefalse Overwrite tool specific checks.true,false

testfalse Enables the test mode (needed for internal use only)true,false

+++algorithmAlgorithm parameters section

++++MSSim

+++++Digestion

enzymeTrypsin Enzyme to use for digestion (select 'no cleavage' to skip digestion)Chymotrypsin,Asp-N,PepsinA,Lys-C/P,V8-E,Arg-C,Trypsin/P,Arg-C/P,CNBr,Lys-N,Formic_acid,glutamyl endopeptidase,2-iodobenzoate,no cleavage,leukocyte elastase,proline endopeptidase,Alpha-lytic protease,Asp-N/B,V8-DE,Asp-N_ambic,Chymotrypsin/P,Lys-C,unspecific cleavage,TrypChymo,Trypsin

modelnaive The cleavage model to use for digestion. 'Trained' is based on a log likelihood model (see DOI:10.1021/pr060507u).trained,naive

min_peptide_length3 Minimum peptide length after digestion (shorter ones will be discarded)1:∞

++++++model_trained

threshold0.5 Model threshold for calling a cleavage. Higher values increase the number of cleavages. -2 will give no cleavages, +4 almost full cleavage.-2:4

++++++model_naive

missed_cleavages1 Maximum number of missed cleavages considered. All possible resulting peptides will be created.0:∞

+++++RT

rt_columnHPLC Modelling of an RT or CE columnnone,HPLC,CE

auto_scaletrue Scale predicted RT's/MT's to given 'total_gradient_time'? If 'true', for CE this means that 'CE:lenght_d', 'CE:length_total', 'CE:voltage' have no influence.true,false

total_gradient_time2500 The duration [s] of the gradient.1e-05:∞

sampling_rate2 Time interval [s] between consecutive scans0.01:60

++++++scan_window

min500 Start of RT Scan Window [s]0:∞

max1500 End of RT Scan Window [s]1:∞

++++++variationRandom component that simulates technical/biological variation

feature_stddev3 Standard deviation of shift in retention time [s] from predicted model (applied to every single feature independently)

affine_offset0 Global offset in retention time [s] from predicted model

affine_scale1 Global scaling in retention time from predicted model

++++++column_condition

distortion0 Distortion of the elution profiles. Good presets are 0 for a perfect elution profile, 1 for a slightly distorted elution profile etc... For trapping instruments (e.g. Orbitrap) distortion should be >4.0:10

++++++profile_shape

+++++++widthWidth of the EGH elution shape, i.e. the sigma^2 parameter, which is computed using 'value' + rnd_cauchy('variance')

value9 Width of the Exponential Gaussian Hybrid distribution shape of the elution profile. This does not correspond directly to the width in [s].0:∞

variance1.6 Random component of the width (set to 0 to disable randomness), i.e. scale parameter for the lorentzian variation of the variance (Note: The scale parameter has to be >= 0).0:∞

+++++++skewnessSkewness of the EGH elution shape, i.e. the tau parameter, which is computed using 'value' + rnd_cauchy('variance')

value0.1 Asymmetric component of the EGH. Higher absolute(!) values lead to more skewness (negative values cause fronting, positive values cause tailing). Tau parameter of the EGH, i.e. time constant of the exponential decay of the Exponential Gaussian Hybrid distribution shape of the elution profile.

variance0.3 Random component of skewness (set to 0 to disable randomness), i.e. scale parameter for the lorentzian variation of the time constant (Note: The scale parameter has to be > 0).0:∞

++++++HPLC

model_fileexamples/simulation/RTPredict.model SVM model for retention time prediction

++++++CE

pH3 pH of buffer0:14

alpha0.5 Exponent Alpha used to calculate mobility0:1

mu_eo0 Electroosmotic flow0:5

lenght_d70 Length of capillary [cm] from injection site to MS0:1000

length_total75 Total length of capillary [cm]0:1000

voltage1000 Voltage applied to capillary0:∞

+++++Detectability

dt_simulation_onfalse Modelling detectibility enabled? This can serve as a filter to remove peptides which ionize badly, thus reducing peptide counttrue,false

min_detect0.5 Minimum peptide detectability accepted. Peptides with a lower score will be removed

dt_model_fileexamples/simulation/DTPredict.model SVM model for peptide detectability prediction

+++++Ionization

++++++esi

ionized_residues[Arg, Lys, His] List of residues (as three letter code) that will be considered during ES ionization. The N-term is always assumed to carry a charge. This parameter will be ignored during MALDI ionization.Ala,Cys,Asp,Glu,Phe,Gly,His,Ile,Lys,Leu,Met,Asn,Pro,Gln,Arg,Sec,Ser,Thr,Val,Trp,Tyr

charge_impurity[H+:1] List of charged ions that contribute to charge with weight of occurrence (their sum is scaled to 1 internally), e.g. ['H:1'] or ['H:0.7' 'Na:0.3'], ['H:4' 'Na:1'] (which internally translates to ['H:0.8' 'Na:0.2'])

max_impurity_set_size3 Maximal #combinations of charge impurities allowed (each generating one feature) per charge state. E.g. assuming charge=3 and this parameter is 2, then we could choose to allow '3H+, 2H+Na+' features (given a certain 'charge_impurity' constraints), but no '3H+, 2H+Na+, 3Na+'

ionization_probability0.8 Probability for the binomial distribution of the ESI charge states

++++++maldi

ionization_probabilities[0.9, 0.1] List of probabilities for the different charge states during MALDI ionization (the list must sum up to 1.0)

++++++mz

lower_measurement_limit200 Lower m/z detector limit.0:∞

upper_measurement_limit2500 Upper m/z detector limit.0:∞

+++++RawSignal

enabledtrue Enable RAW signal simulation? (select 'false' if you only need feature-maps)true,false

peak_shapeGaussian Peak Shape used around each isotope peak (be aware that the area under the curve is constant for both types, but the maximal height will differ (~ 2:3 = Lorentz:Gaussian) due to the wider base of the Lorentzian.Gaussian,Lorentzian

++++++resolution

value50000 Instrument resolution at 400 Th.

typelinear How does resolution change with increasing m/z?! QTOFs usually show 'constant' behavior, FTs have linear degradation, and on Orbitraps the resolution decreases with square root of mass.constant,linear,sqrt

++++++baselineBaseline modeling for MALDI ionization

scaling0 Scale of baseline. Set to 0 to disable simulation of baseline.0:∞

shape0.5 The baseline is modeled by an exponential probability density function (pdf) with f(x) = shape*e^(- shape*x)0:∞

++++++mz

sampling_points3 Number of raw data points per FWHM of the peak.2:∞

++++++contaminants

fileexamples/simulation/contaminants.csv Contaminants file with sum formula and absolute RT interval. See 'OpenMS/examples/simulation/contaminants.txt' for details.

++++++variationRandom components that simulate biological and technical variations of the simulated data.

+++++++mzShifts in mass to charge dimension of the simulated signals.

error_stddev0 Standard deviation for m/z errors. Set to 0 to disable simulation of m/z errors.

error_mean0 Average systematic m/z error (Da)

+++++++intensityVariations in intensity to model randomness in feature intensity.

scale100 Constant scale factor of the feature intensity. Set to 1.0 to get the real intensity values provided in the FASTA file.0:∞

scale_stddev0 Standard deviation of peak intensity (relative to the scaled peak height). Set to 0 to get simple rescaled intensities.0:∞

++++++noiseParameters modeling noise in mass spectrometry measurements.

+++++++shotParameters of Poisson and Exponential for shot noise modeling (set :rate OR :mean = 0 to disable).

rate0 Poisson rate of shot noise per unit m/z. Set this to 0 to disable simulation of shot noise.0:∞

intensity-mean1 Shot noise intensity mean (exponentially distributed with given mean).

+++++++whiteParameters of Gaussian distribution for white noise modeling (set :mean AND :stddev = 0 to disable).

mean0 Mean value of white noise being added to each measured signal.

stddev0 Standard deviation of white noise being added to each measured signal.

+++++++detectorParameters of Gaussian distribution for detector noise modeling (set :mean AND :stddev = 0 to disable).

mean0 Mean value of the detector noise being added to the complete measurement.

stddev0 Standard deviation of the detector noise being added to the complete measurement.

+++++RawTandemSignal

statusdisabled Create Tandem-MS scans?disabled,precursor,MS^E

tandem_mode0 Algorithm to generate the tandem-MS spectra. 0 - fixed intensities, 1 - SVC prediction (abundant/missing), 2 - SVR prediction of peak intensity
0:2

svm_model_set_fileexamples/simulation/SvmModelSet.model File containing the filenames of SVM Models for different charge variants

++++++Precursor

ms2_spectra_per_rt_bin5 Number of allowed MS/MS spectra in a retention time bin.1:∞

min_mz_peak_distance2 The minimal distance (in Th) between two peaks for concurrent selection for fragmentation. Also used to define the m/z width of an exclusion window (distance +/- from m/z of precursor). If you set this lower than the isotopic envelope of a peptide, you might get multiple fragment spectra pointing to the same precursor.0.0001:∞

mz_isolation_window2 All peaks within a mass window (in Th) of a selected peak are also selected for fragmentation.0:∞

exclude_overlapping_peaksfalse If true, overlapping or nearby peaks (within 'min_mz_peak_distance') are excluded for selection.true,false

charge_filter[2, 3] Charges considered for MS2 fragmentation.1:5

+++++++Exclusion

use_dynamic_exclusionfalse If true dynamic exclusion is applied.true,false

exclusion_time100 The time (in seconds) a feature is excluded.0:∞

+++++++ProteinBasedInclusion

max_list_size1000 The maximal number of precursors in the inclusion list.1:∞

++++++++rt

min_rt960 Minimal rt in seconds.0:∞

max_rt3840 Maximal rt in seconds.0:∞

rt_step_size30 rt step size in seconds.1:∞

rt_window_size100 rt window size in seconds.1:∞

++++++++thresholds

min_protein_id_probability0.95 Minimal protein probability for a protein to be considered identified.0:1

min_pt_weight0.5 Minimal pt weight of a precursor0:1

min_mz500 Minimal mz to be considered in protein based LP formulation.0:∞

max_mz5000 Minimal mz to be considered in protein based LP formulation.0:∞

use_peptide_rulefalse Use peptide rule instead of minimal protein id probabilitytrue,false

min_peptide_ids2 If use_peptide_rule is true, this parameter sets the minimal number of peptide ids for a protein id1:∞

min_peptide_probability0.95 If use_peptide_rule is true, this parameter sets the minimal probability for a peptide to be safely identified0:1

++++++MS_E

add_single_spectrafalse If true, the MS2 spectra for each peptide signal are included in the output (might be a lot). They will have a meta value 'MSE_DebugSpectrum' attached, so they can be filtered out. Native MS_E spectra will have 'MSE_Spectrum' instead.true,false

++++++TandemSim

+++++++Simple

add_isotopesfalse If set to 1 isotope peaks of the product ion peaks are addedtrue,false

max_isotope2 Defines the maximal isotopic peak which is added, add_isotopes must be set to 1

add_metainfofalse Adds the type of peaks as metainfo to the peaks, like y8+, [M-H2O+2H]++true,false

add_lossesfalse Adds common losses to those ion expect to have them, only water and ammonia loss is consideredtrue,false

add_precursor_peaksfalse Adds peaks of the precursor to the spectrum, which happen to occur sometimestrue,false

add_all_precursor_chargesfalse Adds precursor peaks with all charges in the given rangetrue,false

add_abundant_immonium_ionsfalse Add most abundant immonium ionstrue,false

add_first_prefix_ionfalse If set to true e.g. b1 ions are addedtrue,false

add_y_ionstrue Add peaks of y-ions to the spectrumtrue,false

add_b_ionstrue Add peaks of b-ions to the spectrumtrue,false

add_a_ionsfalse Add peaks of a-ions to the spectrumtrue,false

add_c_ionsfalse Add peaks of c-ions to the spectrumtrue,false

add_x_ionsfalse Add peaks of x-ions to the spectrumtrue,false

add_z_ionsfalse Add peaks of z-ions to the spectrumtrue,false

y_intensity1 Intensity of the y-ions

b_intensity1 Intensity of the b-ions

a_intensity1 Intensity of the a-ions

c_intensity1 Intensity of the c-ions

x_intensity1 Intensity of the x-ions

z_intensity1 Intensity of the z-ions

relative_loss_intensity0.1 Intensity of loss ions, in relation to the intact ion intensity

precursor_intensity1 Intensity of the precursor peak

precursor_H2O_intensity1 Intensity of the H2O loss peak of the precursor

precursor_NH3_intensity1 Intensity of the NH3 loss peak of the precursor

+++++++SVM

add_isotopesfalse If set to 1 isotope peaks of the product ion peaks are addedtrue,false

max_isotope2 Defines the maximal isotopic peak which is added, add_isotopes must be set to 1

add_metainfofalse Adds the type of peaks as metainfo to the peaks, like y8+, [M-H2O+2H]++true,false

add_first_prefix_ionfalse If set to true e.g. b1 ions are addedtrue,false

hide_y_ionsfalse Add peaks of y-ions to the spectrumtrue,false

hide_y2_ionsfalse Add peaks of y-ions to the spectrumtrue,false

hide_b_ionsfalse Add peaks of b-ions to the spectrumtrue,false

hide_b2_ionsfalse Add peaks of b-ions to the spectrumtrue,false

hide_a_ionsfalse Add peaks of a-ions to the spectrumtrue,false

hide_c_ionsfalse Add peaks of c-ions to the spectrumtrue,false

hide_x_ionsfalse Add peaks of x-ions to the spectrumtrue,false

hide_z_ionsfalse Add peaks of z-ions to the spectrumtrue,false

hide_lossesfalse Adds common losses to those ion expect to have them, only water and ammonia loss is consideredtrue,false

y_intensity1 Intensity of the y-ions

b_intensity1 Intensity of the b-ions

a_intensity1 Intensity of the a-ions

c_intensity1 Intensity of the c-ions

x_intensity1 Intensity of the x-ions

z_intensity1 Intensity of the z-ions

relative_loss_intensity0.1 Intensity of loss ions, in relation to the intact ion intensity

+++++Global

ionization_typeESI Type of Ionization (MALDI or ESI)MALDI,ESI

+++++Labeling

typelabelfree Select the labeling type you want for your experimentICPL,SILAC,itraq,labelfree,o18

++++++ICPLICPL labeling on MS1 level of lysines and n-term (on protein or peptide level) with either two or three channels.

ICPL_fixed_rtshift0 Fixed retention time shift between labeled pairs. If set to 0.0 only the retention times, computed by the RT model step are used.

label_proteinstrue Enables protein-labeling. (select 'false' if you only need peptide-labeling)true,false

ICPL_light_channel_labelUniMod:365 UniMod Id of the light channel ICPL label.

ICPL_medium_channel_labelUniMod:687 UniMod Id of the medium channel ICPL label.

ICPL_heavy_channel_labelUniMod:364 UniMod Id of the heavy channel ICPL label.

++++++SILACSILAC labeling on MS1 level with up to 3 channels and custom modifications.

fixed_rtshift0.0001 Fixed retention time shift between labeled peptides. If set to 0.0 only the retention times computed by the RT model step are used.0:∞

+++++++medium_channelModifications for the medium SILAC channel.

modification_lysineUniMod:481 Modification of Lysine in the medium SILAC channel

modification_arginineUniMod:188 Modification of Arginine in the medium SILAC channel

+++++++heavy_channelModifications for the heavy SILAC channel. If you want to use only 2 channels, just leave the Labels as they are and provide only 2 input files.

modification_lysineUniMod:259 Modification of Lysine in the heavy SILAC channel. If left empty, two channelSILAC is assumed.

modification_arginineUniMod:267 Modification of Arginine in the heavy SILAC channel. If left empty, two-channel SILAC is assumed.

++++++itraqiTRAQ labeling on MS2 level with up to 4 (4plex) or 8 (8plex) channels.

iTRAQ4plex 4plex or 8plex iTRAQ?4plex,8plex

reporter_mass_shift0.1 Allowed shift (uniformly distributed - left to right) in Da from the expected position (of e.g. 114.1, 115.1)0:0.5

channel_active_4plex[114:myReference] Four-plex only: Each channel that was used in the experiment and its description (114-117) in format :, e.g. "114:myref","115:liver".

channel_active_8plex[113:myReference] Eight-plex only: Each channel that was used in the experiment and its description (113-121) in format :, e.g. "113:myref","115:liver","118:lung".

isotope_correction_values_4plex[114:0/1/5.9/0.2, 115:0/2/5.6/0.1, 116:0/3/4.5/0.1, 117:0.1/4/3.5/0.1] override default values (see Documentation); use the following format: :<-2Da>/<-1Da>/<+1Da>/<+2Da> ; e.g. '114:0/0.3/4/0' , '116:0.1/0.3/3/0.2'

isotope_correction_values_8plex[113:0/0/6.89/0.22, 114:0/0.94/5.9/0.16, 115:0/1.88/4.9/0.1, 116:0/2.82/3.9/0.07, 117:0.06/3.77/2.99/0, 118:0.09/4.71/1.88/0, 119:0.14/5.66/0.87/0, 121:0.27/7.44/0.18/0] override default values (see Documentation); use the following format: :<-2Da>/<-1Da>/<+1Da>/<+2Da> ; e.g. '113:0/0.3/4/0' , '116:0.1/0.3/3/0.2'

Y_contamination0.3 Efficiency of labeling tyrosine ('Y') residues. 0=off, 1=full labeling0:1

++++++o1818O labeling on MS1 level with 2 channels, requiring trypsin digestion.

labeling_efficiency1 Describes the distribution of the labeled peptide over the different states (unlabeled, mono- and di-labeled)0:1

++++RandomNumberGeneratorsParameters for generating the random aspects (e.g. noise) in the simulated data. The generation is separated into two parts, the technical part, like noise in the raw signal, and the biological part, like systematic deviations in the predicted retention times.

biologicalrandom Controls the 'biological' randomness of the generated data (e.g. systematic effects like deviations in RT). If set to 'random' each experiment will look different. If set to 'reproducible' each experiment will have the same outcome (given that the input data is the same).reproducible,random

technicalrandom Controls the 'technical' randomness of the generated data (e.g. noise in the raw signal). If set to 'random' each experiment will look different. If set to 'reproducible' each experiment will have the same outcome (given that the input data is the same).reproducible,random