Shimizu Lab. Graduate School of Information Science and Technology,Osaka University

Japanese Page

Regular expressions of MS/MS spectra

Motivation

Small molecule identification based onMS/MS spectra data may need another approach. For example, methods for

  • Evaluation of spectral similarities among structurally related compounds. 
  • Description of fragmentation motifs commonly observed in MS/MS spectra

are requested to perform partial annotation and characterization of metabolite structures of interest.

Text representation of MS/MS spectra, MS/MS strings

The MS/MS spectrum of the proton adduct molecule ([M+H]+) of L-histidine (MassBank record, PR100321) showed that two major fragment ions with molecular formula C5H8N3 ([M+H-CO-H2O]+) and C5H5N2 ([M+H-CO-H2O-NH3]+) were produced by the loss of H2O, CO, and NH2 molecules, respectively.

The MS/MS spectrum could be described as the following MS/MS string by ignoring the signal intensity information:

C4H7N2:C4H7N2;C1H-2:C5H5N2;NH3:C5H8N3;C1H2O2:C6H10N3O2; (1)

The MS/MS string (1) consists of a sequence of expressions like [chemical formula of neutral losses]:[chemical formula of fragment signals];. For example, the expression "C5H8N3;" indicates a chemical formula of a fragment or precursor (the right-hand separator is “;”), and "N1H3:" indicates a chemical formula of a neutral loss between 2 fragments, C5H5N2; and C5H8N3; (the right-hand separator is “:”)

The MS/MS string (1) is considered a full complete MS/MS string with 4 signals because it fully covers whole spectra (from m/z 0 to a precursor) and completely includes all 4 fragment signals within the region. Furthermore, several full incomplete (ignoring one or more fragments in full strings) or partial MS/MS strings could be generated from the MS/MS spectrum of L-histidine. The population of MS/MS strings includes all possible combinations of fragments and neutral losses or all structure-related information available from the fragmentation pattern in the MS/MS spectra. 

Type

Length

MS/MS string

partial

1

C1H1N1:C5H8N3

partial

1

C1H-2:C5H5N2

partial

1

C1H-2:C5H5N2;H3N1:C5H8N3

partial

1

C1H2O2:C6H10N3O2;

partial

1

C1H5N1O2:C6H10N3O2;

partial

1

C2H3N1O2:C6H10N3O2;

partial

1

C4H7N2:C4H7N2

partial

1

C5H5N2:C5H5N2;

partial

1

C5H8N3:C5H8N3;

full incomplete

1

C6H10N3O2:C6H10N3O2;

partial

1

H3N1:C5H8N3;

partial

2

C1H1N1:C5H8N3;C1H2O2:C6H10N3O2;

partial

2

C1H-2:C5H5N2;C1H5N1O2:C6H10N3O2;

partial

2

C4H7N2:C4H7N2;C1H1N1:C5H8N3

partial

2

C4H7N2:C4H7N2;C1H-2:C5H5N2

partial

2

C4H7N2:C4H7N2;C1H-2:C5H5N2;H3N1:C5H8N3

full incomplete

2

C4H7N2:C4H7N2;C2H3N1O2:C6H10N3O2;

full incomplete

2

C5H5N2:C5H5N2;C1H5N1O2:C6H10N3O2;

partial

2

C5H5N2:C5H5N2;H3N1:C5H8N3;

full incomplete

2

C5H8N3:C5H8N3;C1H2O2:C6H10N3O2;

partial

2

H3N1:C5H8N3;C1H2O2:C6H10N3O2;

partial

3

C1H-2:C5H5N2;H3N1:C5H8N3;C1H2O2:C6H10N3O2;

full incomplete

3

C4H7N2:C4H7N2;C1H1N1:C5H8N3;C1H2O2:C6H10N3O2;

full incomplete

3

C4H7N2:C4H7N2;C1H-2:C5H5N2;C1H5N1O2:C6H10N3O2;

full incomplete

3

C5H5N2:C5H5N2;H3N1:C5H8N3;C1H2O2:C6H10N3O2;

full complete

4

C4H7N2:C4H7N2;C1H-2:C5H5N2;H3N1:C5H8N3;C1H2O2:C6H10N3O2;

Regular expression of MS/MS strings 

For searching similar MS/MS spectra data, a query functions for neutral losses and fragments strings has been implemented in LipidXplorer and MS2Analyzer. The concept was expanded by introduction of  the regular expression function for more flexible searching of the MS/MS strings. Since the metacharacters and the syntax employed in this study are identical with that of the basic regular expression, an investigator could search for MS/MS strings by a suitable regular expression using the programming languages such as Perl, java, and python, as well as the text search function of text editors. All regular expressions of MS/MS strings described in this study were tested by using Perl 5.12.

For example, from the MS/MS strings of L-histidine (PR100321), various regular expression could be generated.

C4H7N2:C4H7N2;C1H-2:C5H5N2;NH3:C5H8N3;C1H2O2:C6H10N3O2;

Regular expression that matches:

all MS/MS stirings including "C4H7N2:C4H7N2;C1H-2:C5H5N2;NH3:C5H8N3;C1H2O2:C6H10N3O2;":

C4H7N2:C4H7N2;C1H-2:C5H5N2;NH3:C5H8N3;C1H2O2:C6H10N3O2;

only MS/MS string "C4H7N2:C4H7N2;C1H-2:C5H5N2;NH3:C5H8N3;C1H2O2:C6H10N3O2;":

^C4H7N2:C4H7N2;C1H-2:C5H5N2;NH3:C5H8N3;C1H2O2:C6H10N3O2;C3H5N1O1:C9H15N4O3;$

MS/MS strings with  neutral loss pattern of histidine C4H7N2<=>C1H-2<=>NH3<=>C1H2O2". The expression "([CHONS][0-9]*)+" matches any chemical formula. Hereafter, an expression like (2) is referred to as a "neutral-loss string.":

C4H7N2:([CHONS][0-9]*)+;C1H-2:([CHONS][0-9]*)+;H3N1:([CHONS][0-9]*)+;C1H2O2:([CHONS][0-9]*)+; (2)

MS/MS strings withneutral loss patterns of histidine and methylated histidine (methylation position is unclear):

(C4H7N2|C5H10N2):([CHONS][0-9]*)+;(C1H-2|C2):([CHONS][0-9]*)+;(H3N1|C1H5N1):([CHONS][0-9]*)+;(C1H2O2|C2H4O2):([CHONS][0-9]*)+; 

MS/MS strings of mono methylated histidine (methylation position is unclear):

(C4H7N2|C5H10N2):([CHONS][0-9]*)+;(C1H-2|C2):([CHONS][0-9]*)+;(H3N1|C1H5N1):([CHONS][0-9]*)+;(C1H2O2|C2H4O2):C7H12N3O2; 

Similarity search

The following MS/MS string is a query for serching hydroxylated (+O1), methylated (+C1H2), and methoxylated (+C1H2O1) derivatives of histidine:

(C4H7N2|C4H7N2O1|C5H10N2|C5H10N2O1):([CHONS][0-9]*)+;(C1H-2|C1H-2O1|C2|C2O1):([CHONS][0-9]*)+;(H3N1|H3N1O2|C1H5N1|C1H5N1O1):([CHONS][0-9]*)+;(C1H2O2|C1H2O3|C2H4O2|C2H4O3):([CHONS][0-9]*)+; (3)

Indeed, the regular expression matches MS/MS string of alpha-methyl- DL -Histidine (PR100379, Fig. 1b):

C5H9N2:C4H7N2;C1H-2:C5H5N2;NH3:C5H8N3;C1H2O2:C6H10N3O2;

Spectral motifs

Natural product chemists usulally estimate metabolite structure from fixed patterns of MS/MS spectra (Of course, a complete and true structure of metabolite is outside of all fixed patterns).

For example, MS/MS specta of flavone C-glycosides showed a specfic patterns of MS/MS spectra. 

The empirical rule or the MS/MS spectral motif patterns could be described as the regular expression: 

:C16H11O5;C1H2O1:([CHONS][0-9]*)+;C2:([CHONS][0-9]*)+;C2H8O4:([CHONS][0-9]*)+;

Since MS/MS spectra data is insufficient to characterize (i) hydroxylation positions in flavone, (ii) detailed structure of hexoside (such as glucose or galactose, alpha- or beta- form), thus the regular expression could be considered as a spectral motif of "tetrahydroxyflavone-C-hexoside" and a member of ChEBI ontology code "flavone C-glycoside (CHEBI:83280)"

Future directions

The regular expression of MS/MS spectra is a promising approach for partial annotation of small molecules. Remaining technical problems are

  • Probabitiy-based scoring of similarity search results for p-value  estimation
  • Construction of decoy MS/MS string database for false discovery rate estimation
  • Automated extraction of spectral motifs from MS/MS spectra dataset

Publication

Test Data (Available from DropMet@RIKEN CSRS)

Available from DropMet@RIKEN CSRS

  • MS/MS string dataset generated from MassBank Dataset.
  • MS/MS string dataset generated from Arabidopsis and rice MS/MS spcetra library

Links

Databases

Tools for MS/MS data analysis

Molecular formula estimation

Regular expression

Acknowledgement

I am highly grateful to Prof. Takaaki Nishioka (Nara Institute of Science and Technology), Prof. Masanori Arita (National Institute of Genetics), and Mr. Yuya Ojima (MassBank) for providing the MS/MS spectra dataset from Metabolome.jp(http://metabolomics.jp/wiki/Index:MassBank). I also thank Dr. Nozomu Sakurai, Drs. Nayumi Akimoto (Kazusa DNA research institute), Dr. Yuji Sawada, Dr. Yutaka Yamada (RIKEN CSRS) for their helpful comments on this manuscript and databasing work.

Page Top