Prediction of antimicrobial activity of large pool of ...pubs.ccmsi.us/pubs/BioSys18-5.pdf · to...

8
Contents lists available at ScienceDirect BioSystems journal homepage: www.elsevier.com/locate/biosystems Prediction of antimicrobial activity of large pool of peptides using quasi- SMILES Alla P. Toropova a, , Andrey A. Toropov a , Emilio Benfenati a , Danuta Leszczynska b , Jerzy Leszczynski c a Department of Environmental Health Science, Laboratory of Environmental Chemistry and Toxicology, IRCCS-Istituto di Ricerche Farmacologiche Mario Negri, Via La Masa 19, 20156 Milano, Italy b Interdisciplinary Nanotoxicity Center, Department of Civil and Environmental Engineering, Jackson State University, 1325 Lynch Street, Jackson, MS 39217-0510, USA c Interdisciplinary Nanotoxicity Center, Department of Chemistry and Biochemistry, Jackson State University, 1400 J. R. Lynch Street, P.O. Box 17910, Jackson, MS 39217, USA ARTICLE INFO Keywords: Peptide Antimicrobial activity quasi-SMILES Bioinformatics Monte carlo method CORAL software ABSTRACT The purpose of this study was the estimation of ability of the so-called optimal descriptors calculated to be a tool to predict the antimicrobial activity of large pool of peptides. Traditional simplied molecular input-line entry system (SMILES) is an ecient tool to represent the molecular structure of dierent compounds. Quasi-SMILES represents an extension of traditional SMILES. This approach provides the possibility to involve dierent eclectic conditions related to analyzed endpoint in the modelling process. In addition, the quasi-SMILES can be used to represent structure of peptides via abbreviations of corresponding amino acids. In this study, quasi-SMILES represents sequences of amino acids in peptides that were tested as the basis to predict antimicrobial activity of 1581 peptides. Predictive potential of binary classication for antimicrobial activity for dierent splits is quite good when it comes to the training, invisible training, calibration, and validation sets. For the external validation sets, the statistical criteria are ranged: (i) sensitivity 0.82097; (ii) specicity 0.880.99; (iii) accuracy 0.870.98; and (iv) Matthews correlation coecient 0.730.97. The suggested optimal descriptors calculated with data on composition of amino acids in peptides can be a tool to predict antimicrobial activity of peptides. 1. Introduction Microorganisms may cause considerable problems for human health and the agricultural industry (Porto et al., 2012). Antimicrobial pep- tides oer an attractive alternative to traditional drugs (Gabere and Noble, 2017; Speck-Planche et al., 2012; Yousenejad et al., 2012). In addition, the antimicrobial peptides are an important component of cosmetic industry (Vandebriel and Loveren, 2010). Thus, the mathe- matical modelling of antimicrobial activity of peptides is a very at- tractive way to solve problems of human health and agribusiness. Hence, it is unsurprisingly, that databases for antimicrobial activity of various peptides together with dierent algorithms for prediction of activity of peptides untested in biochemical experiments were sug- gested. Widely used databases on antimicrobial peptides are DBAASP (Pirtskhalava et al., 2016); CS-AMPPred (Porto et al., 2012); CAMPR3 (Waghu et al., 2016); BACTIBASE (Hammami et al., 2007). The basis of the majority of algorithms, which are aimed to build up predictive models of activity of peptides, are physicochemical and biochemical parameters of amino acids (Yount and Yeaman, 2004; Speck-Planche et al., 2016). The physicochemical data usually used as descriptors to develop models for antimicrobial activity of peptides are polarity, electrostatics charges, 3D geometry, as well as descriptors of quantum mechanics (Pirtskhalava et al., 2016). The above mentioned parameters are usually involved in algorithms of partial least squares (PLS) (Jenssen et al., 2008); articial neural networks (ANN) (Torrent et al., 2011); random forest (RF) (Breiman, 2001); super vector machine (SVM) (Webb-Robertson, 2009); Nearest Neighbor Algorithm (Wang et al., 2011); Incremental Feature Selection (Gabere and Noble, 2017), and others. The CORAL software is a conceptual alternative of the above men- tioned approaches (Toropova and Toropov, 2017a,b). The software has been used to build up predictive models for (i) organic compounds (Toropov et al., 2017a,b); (ii) nanomaterials (Toropova and Toropov, 2017b); and peptides (Toropova et al., 2015). The CORAL software was developed as a tool to build up quantitative structure property/ac- tivity relationships (QSPRs/QSARs) for traditional organic compounds using simplied molecular input-line entry systems (SMILES) (Weininger, 1988). However, further application of the software https://doi.org/10.1016/j.biosystems.2018.05.003 Received 7 March 2018; Received in revised form 10 May 2018; Accepted 14 May 2018 Corresponding author. E-mail address: [email protected] (A.P. Toropova). BioSystems 169–170 (2018) 5–12 Available online 22 May 2018 0303-2647/ © 2018 Elsevier B.V. All rights reserved. T

Transcript of Prediction of antimicrobial activity of large pool of ...pubs.ccmsi.us/pubs/BioSys18-5.pdf · to...

Page 1: Prediction of antimicrobial activity of large pool of ...pubs.ccmsi.us/pubs/BioSys18-5.pdf · to predict the antimicrobial activity of large pool of peptides. Traditional simplified

Contents lists available at ScienceDirect

BioSystems

journal homepage: www.elsevier.com/locate/biosystems

Prediction of antimicrobial activity of large pool of peptides using quasi-SMILES

Alla P. Toropovaa,⁎, Andrey A. Toropova, Emilio Benfenatia, Danuta Leszczynskab,Jerzy Leszczynskic

a Department of Environmental Health Science, Laboratory of Environmental Chemistry and Toxicology, IRCCS-Istituto di Ricerche Farmacologiche Mario Negri, Via LaMasa 19, 20156 Milano, Italyb Interdisciplinary Nanotoxicity Center, Department of Civil and Environmental Engineering, Jackson State University, 1325 Lynch Street, Jackson, MS 39217-0510, USAc Interdisciplinary Nanotoxicity Center, Department of Chemistry and Biochemistry, Jackson State University, 1400 J. R. Lynch Street, P.O. Box 17910, Jackson, MS39217, USA

A R T I C L E I N F O

Keywords:PeptideAntimicrobial activityquasi-SMILESBioinformaticsMonte carlo methodCORAL software

A B S T R A C T

The purpose of this study was the estimation of ability of the so-called optimal descriptors calculated to be a toolto predict the antimicrobial activity of large pool of peptides. Traditional simplified molecular input-line entrysystem (SMILES) is an efficient tool to represent the molecular structure of different compounds. Quasi-SMILESrepresents an extension of traditional SMILES. This approach provides the possibility to involve different eclecticconditions related to analyzed endpoint in the modelling process. In addition, the quasi-SMILES can be used torepresent structure of peptides via abbreviations of corresponding amino acids. In this study, quasi-SMILESrepresents sequences of amino acids in peptides that were tested as the basis to predict antimicrobial activity of1581 peptides. Predictive potential of binary classification for antimicrobial activity for different splits is quitegood when it comes to the training, invisible training, calibration, and validation sets. For the external validationsets, the statistical criteria are ranged: (i) sensitivity 0.82–097; (ii) specificity 0.88–0.99; (iii) accuracy0.87–0.98; and (iv) Matthews correlation coefficient 0.73–0.97. The suggested optimal descriptors calculatedwith data on composition of amino acids in peptides can be a tool to predict antimicrobial activity of peptides.

1. Introduction

Microorganisms may cause considerable problems for human healthand the agricultural industry (Porto et al., 2012). Antimicrobial pep-tides offer an attractive alternative to traditional drugs (Gabere andNoble, 2017; Speck-Planche et al., 2012; Yousefinejad et al., 2012). Inaddition, the antimicrobial peptides are an important component ofcosmetic industry (Vandebriel and Loveren, 2010). Thus, the mathe-matical modelling of antimicrobial activity of peptides is a very at-tractive way to solve problems of human health and agribusiness.Hence, it is unsurprisingly, that databases for antimicrobial activity ofvarious peptides together with different algorithms for prediction ofactivity of peptides untested in biochemical experiments were sug-gested. Widely used databases on antimicrobial peptides are DBAASP(Pirtskhalava et al., 2016); CS-AMPPred (Porto et al., 2012); CAMPR3(Waghu et al., 2016); BACTIBASE (Hammami et al., 2007). The basis ofthe majority of algorithms, which are aimed to build up predictivemodels of activity of peptides, are physicochemical and biochemicalparameters of amino acids (Yount and Yeaman, 2004; Speck-Planche

et al., 2016). The physicochemical data usually used as descriptors todevelop models for antimicrobial activity of peptides are polarity,electrostatics charges, 3D geometry, as well as descriptors of quantummechanics (Pirtskhalava et al., 2016). The above mentioned parametersare usually involved in algorithms of partial least squares (PLS)(Jenssen et al., 2008); artificial neural networks (ANN) (Torrent et al.,2011); random forest (RF) (Breiman, 2001); super vector machine(SVM) (Webb-Robertson, 2009); Nearest Neighbor Algorithm (Wanget al., 2011); Incremental Feature Selection (Gabere and Noble, 2017),and others.

The CORAL software is a conceptual alternative of the above men-tioned approaches (Toropova and Toropov, 2017a,b). The software hasbeen used to build up predictive models for (i) organic compounds(Toropov et al., 2017a,b); (ii) nanomaterials (Toropova and Toropov,2017b); and peptides (Toropova et al., 2015). The CORAL software wasdeveloped as a tool to build up quantitative structure – property/ac-tivity relationships (QSPRs/QSARs) for traditional organic compoundsusing simplified molecular input-line entry systems (SMILES)(Weininger, 1988). However, further application of the software

https://doi.org/10.1016/j.biosystems.2018.05.003Received 7 March 2018; Received in revised form 10 May 2018; Accepted 14 May 2018

⁎ Corresponding author.E-mail address: [email protected] (A.P. Toropova).

BioSystems 169–170 (2018) 5–12

Available online 22 May 20180303-2647/ © 2018 Elsevier B.V. All rights reserved.

T

Page 2: Prediction of antimicrobial activity of large pool of ...pubs.ccmsi.us/pubs/BioSys18-5.pdf · to predict the antimicrobial activity of large pool of peptides. Traditional simplified

(Toropova et al., 2012; Veselinović et al., 2015; Toropov et al., 2012;Toropova et al., 2015) has shown the ability of the CORAL approach tobe a tool of building up predictive models based on all available eclecticinformation represented by so-called quasi-SMILES (Toropov andToropova, 2015). Traditional SMILES is a sequence of symbols, whichare a representation of the molecular structure. The quasi-SMILES isalso a sequence of symbols. However, these symbols are a representa-tion of not only molecular structure, but also of “all available eclecticdata” (Toropov and Toropova, 2015). The sequence of amino acids is aversion of the quasi-SMILES for the case of building up a predictivemodel for behavior of antimicrobial peptides.

The aim of this study is building up classification models of anti-bacterial activity of peptides (i.e. active – inactive) by using the optimal

descriptors calculated with sequences of amino acids (which are re-presented by the one-symbol abbreviations of amino acids).

2. Method

2.1. Data

The experimental data on the antimicrobial activity (1 means activeand −1 means inactive) of a large set of peptides was taken from theliterature (Speck-Planche et al., 2016). These peptides (n=1581) wererandomly distributed four times into the training (≈25%), invisibletraining (≈25%), calibration (≈25%), and validation sets (≈25%).

It is to be noted, that fourth split was selected as the best from onehundred random splits that obey the above rules of distributions. Thiswas done in order to confirm the hypothesis, “QSAR is a random event”,i.e. successful and unsuccessful distributions exist (Toropov et al.,2013). Table 1 shows that splits considered in our study are not iden-tical.

2.2. Building up predictive model

In order to build up a classification of peptides into two classes (i)active (+1); and (ii) inactive (-1) so-called semi-correlations describedin the literature (Toropova and Toropov, 2017a) have been used. Fig. 1contains a graphical representation of the traditional correlation andthe semi-correlation. The semi-correlation is a special case of the tra-ditional correlations, where dots in coordinates x, y (i.e. observed,predicted) are localized along two parallel lines (Fig. 1).

Fig. 2 contains the general scheme of building up the categoricalmodel for activity of peptides. Peptides are interpreted as sequences ofamino acids (represented by the one-symbol abbreviation). These se-quences are similar to SMILES, which were used as the basis to build upQSPR/QSAR models for endpoints of traditional molecules (Toropovet al., 2017a,b). The correlation weights for each one-symbol codes arecalculated by the Monte Carlo method. The numerical data on thecorrelation weights of amino acids are results of the optimization withtarget function defined as the following:

TF=R+R’− 0.1 * ABS (R− R’) (1)

where R and R’ are correlation coefficients between the DCW(T*,N*)and categorical values antimicrobial activity of peptides (1 for activepeptides; and −1 for inactive peptides), for the training and invisibletraining sets, respectively. The DCW(T*,N*) is the optimal descriptorcalculated as the following:

DCW (T*, N*)= Σ CW (Ak) (2)

Table 1The measure (%) of non-identity of splits into the training, calibration, andvalidation sets examined in this work.

=+

IdentityNN N

(%)0.5*( )

*100i j

i j

,

Ni is the number of substances which are distributed into the set for i-th split;Nj is the number of substances which are distributed into the set for j-th split.

=+

IdentityNN N

(%)0.5*( )

*100i j

i j

,

Ni,j is the number of substances which are distributed into the same set for bothi-th split and j-th split (set= training, invisible training, calibration, and vali-dation).Ni is the number of substances which are distributed into the set for i-th split;Nj is the number of substances which are distributed into the set for j-th split.Shaded values indicates the diagonal elements, in other words, any split isabsolutely identical to itself.

Fig. 1. The comparison of generalized traditional correlation and generalized semi-correlation.

A.P. Toropova et al. BioSystems 169–170 (2018) 5–12

6

Page 3: Prediction of antimicrobial activity of large pool of ...pubs.ccmsi.us/pubs/BioSys18-5.pdf · to predict the antimicrobial activity of large pool of peptides. Traditional simplified

the CW(Ak) are the correlation weights for amino acids Ak.The T and N are parameters of the Monte Carlo optimization. The T

is threshold, i.e. integer to discriminate amino acids into two classes: (i)rare, if the number of Ak in the training set is equal or less than T; and(ii) active non-rare, if the number of Ak in the training set is larger thanT. The N is the number of epochs of the Monte Carlo optimization. Theoptimization stops if the maximal correlation coefficient between theexperimental and predicted values for the calibration set is reached.Fig. 2 shows generalized scheme of evolution of semi-correlations forthe training, invisible training, and calibration sets. Fig. 2 shows thatT=T* and N]N* are values of these parameters which give the beststatistics for the calibration set.

Having numerical data on the correlation weights of amino acidsCW(Ak), one can calculate with Eq. (2), optimal descriptors DCW(T*,N*) and antimicrobial activity (AA) of peptides as the following:

AA=C0+C1 * DCW(T*,N*) (3)

The semi-correlation model calculated with Eq. (3) gives possibilityto build up binary classification as the following:

= ⎧⎨⎩

>− ≤

ClassifAA

ifAA1, 01, 0 (4)

The statistical quality of the classification models has been char-acterized by sensitivity (Sn), specificity (Sp), accuracy (Ac), andMatthews correlation coefficient (MCC)

=+

SnTP FN

TP(5)

=+

SpFP TN

TN(6)

= ++ + +

AcTP FP FN TN

TP TN(7)

= × − ×+ + + +

MCCTP FP TP FN TN FP TN FN

TP TN FP FN( )( )( )( ) (8)

TP=true positive; TN=true negative; FP=false positive; andFN= false negative.

The definition of domain of applicability for models built up herecontains two components: (i) the models are oriented to peptides,

Fig. 2. The general scheme of building up CORAL model for antimicrobial activity of peptides using semi-correlations. The R is correlation coefficient for the semi-correlation (Fig. 1).

A.P. Toropova et al. BioSystems 169–170 (2018) 5–12

7

Page 4: Prediction of antimicrobial activity of large pool of ...pubs.ccmsi.us/pubs/BioSys18-5.pdf · to predict the antimicrobial activity of large pool of peptides. Traditional simplified

which are simple linear sequences of amino acids, without branchings;and (ii) peptides characterized by statistical defect less than a thresholdthat is calculated with peptides of the training set. The general schemeof the calculation of statistical defects is the following (Gobbi et al.,2016):

A. Defect of amino acids (A)

= −+

Defect A P A P AN A N A

( ) ( ) '( )( ) '( ) (9)

where P(A) and P’(A) are probabilities of amino acids in the trainingand calibration sets, respectively; N(A) and N’(A) are frequencies of Ain the training and calibration sets, respectively.

B. The defect of quasi-SMILES that is representation of a peptide:

∑=Defect quasiSMILES defect A( ) ( ) (10)

In other words, the statistical defect of peptide is the sum of sta-tistical defects of all amino acids of the peptide. A peptide represent byquasiSMILES falls into the domain of applicability if

=Defect quasiSMILES defect quasiSMILES( ) 2*( ( )) (11)

Table 2Correlation weights amino acids obtained for splits 1–4. The N1, N2, N3 are thenumber of Ak in the training, invisible training, and calibration sets, respec-tively.

SAk CW(SAk) N1 N2 N3

Split 1A −0.49753 271 235 277C 0.87232 95 80 88D −1.56611 75 55 80E −1.12334 78 65 73F 1.25429 243 225 225G −0.81342 290 260 283H 1.62611 99 101 106I 1.37294 272 280 282K 1.75358 353 343 340L 1.62366 335 328 339M 0.12709 71 66 69N −0.31738 125 99 137P 0.05932 118 133 151Q −0.81197 81 80 86R 1.25451 171 195 194S −1.75349 217 210 206T 0.24958 134 124 125V 1.56271 238 228 262W 2.74690 122 147 153Y −0.31082 62 54 68

Split 2A −0.30987 264 256 253C 0.87314 81 96 82D −0.87721 73 71 61E −0.80864 78 64 65F 1.87213 229 223 235G −1.49509 276 274 272H 2.62762 103 96 102I 2.00068 287 281 265K 2.44168 363 333 343L 2.00231 347 326 332M 0.87299 81 74 54N −0.50274 124 116 113P −0.49565 132 138 135Q −0.74682 85 88 83R 1.75036 191 169 193S −1.81331 215 208 212T −0.05763 138 135 122V 1.75239 259 247 222W 3.43799 139 137 141Y 0.87035 62 62 51

Split 3A −0.31490 244 271 252C 0.87363 80 95 77D −0.87132 67 68 62E −0.55819 69 87 74F 1.81340 237 238 218G −1.31100 269 276 261H 1.44000 99 107 120I 1.87884 267 288 270K 1.93782 332 358 339L 2.12606 336 340 320M 0.43752 69 67 68N −0.94035 107 139 105P −0.43627 128 129 134Q −0.37875 95 81 81R 1.44136 189 199 180S −1.87537 206 221 198T 0.12451 124 129 121V 1.75410 240 251 225W 3.12509 147 143 130Y 1.12366 59 68 55

Split 4A −0.23288 248 260 261C 0.58958 86 91 81D −0.79258 77 79 62E −0.26528 80 83 69F 1.28029 244 223 239

Table 2 (continued)

G −0.88428 264 261 287H 1.26707 99 112 99I 1.22581 278 272 271K 1.34200 353 338 337L 1.47974 326 330 333M 0.50839 73 74 64N −0.18393 121 124 114P −0.32207 130 125 130Q −0.74995 79 103 84R 1.02612 195 195 182S −1.19328 195 207 216T 0.06818 132 138 124V 1.10559 234 241 240W 1.96769 148 150 124Y 0.25508 61 74 50

Table 3Example of calculation AA, and activity for peptide, which is re-presented by sequence.

Amino acid, Ak Correlation weight, CW(Ak)

R 1.2545W 2.7469C 0.8723V 1.5627Y −0.3108A −0.4975Y −0.3108V 1.5627R 1.2545V 1.5627R 1.2545G −0.8134V 1.5627L 1.6237V 1.5627R 1.2545Y −0.3108R 1.2545R 1.2545C 0.8723W 2.7469

RWCVYAYVRVRGVLVRYRRCW.DCW (1, 17)=21.9593.AA=−1.7822 + 0.11930 * 21.9593=0.8375.

A.P. Toropova et al. BioSystems 169–170 (2018) 5–12

8

Page 5: Prediction of antimicrobial activity of large pool of ...pubs.ccmsi.us/pubs/BioSys18-5.pdf · to predict the antimicrobial activity of large pool of peptides. Traditional simplified

where defect quasiSMILES( ) is average defect for the training set.C. Split defect is the sum of defects of all quasi-SMILES distributed

into the training set:

∑=SplitDefect defect SMILES( ) (12)

Thus, the defect quasiSMILES( ) is the basis to define the “statistical”domain of applicability i.e. the sub-set of peptides with not too largedefect according to inequality (11) (Gobbi et al., 2016).

3. Results and discussion

The models for antimicrobial activity based on semi-correlations arethe following:

Split 1

AA=−1.7822 (± 0.0036) + 0.11930(± 0.00027) * DCW(1,17)(13)

Split 2

AA=−1.8379 (± 0.0044) + 0.08947 (± 0.00024) * DCW(1,11)(14)

Split 3

AA=−1.9117 (± 0.0039) + 0.10698 (± 0.00024) * DCW(1,13)(15)

Split 4

AA=−1.6193 (± 0.0050) + 0.12526 (± 0.00041) * DCW(1,15)(16)

Table 2 contains numerical data on the correlation weights obtainedfor splits 1–4. Table 3 contains example of calculations of antimicrobialactivity for a sequence of amino acids.

Using models calculated with Eqs. (13)–(16), the classificationmodels for four splits were built up. Table 4 contains the statisticalcharacteristics of models developed here for antibacterial activity to-gether with statistical characteristics of models for antimicrobial ac-tivity suggested in the literature (Speck-Planche et al., 2016; Gabereand Noble, 2017). The comparison of the statistical quality of modelsbuilt up in this study, with the statistical quality of the above modelssuggested in the literature, confirms that the CORAL models based onthe semi-correlation are comparable or even better.

The predictive potential (i.e. statistical characteristics for externalvalidation set) of the CORAL models for distributions 1, 2, and 3 islower than the predictive potential of the model described in the lit-erature (Speck-Planche et al., 2016). However, in the above work(Speck-Planche et al., 2016), relative contributions of the amino acidsto the antibacterial activities were involved in the process of buildingup the predictive model. This information has not been used to build upmodels suggested here. In other words, the CORAL models are based

Table 4The statistical characteristics of models for antibacterial activity of peptides.

Split Set TP TN FP FN Sensitivity Specificity Accuracy MCC

1 Training 146 216 10 24 0.8588 0.9558 0.9141 0.8252Invisible training 154 205 8 28 0.8462 0.9624 0.9089 0.8194Calibration 131 217 18 38 0.7751 0.9234 0.8614 0.7142Validation 137 202 16 31 0.8155 0.9266 0.8782 0.7522

2 Training 150 207 16 31 0.8287 0.9283 0.8837 0.7651Invisible training 134 214 9 30 0.8171 0.9596 0.8992 0.7952Calibration 156 206 10 24 0.8667 0.9537 0.9141 0.8279Validation 140 219 11 24 0.8537 0.9522 0.9112 0.8170

3 Training 147 213 11 27 0.8448 0.9509 0.9045 0.8067Invisible training 172 204 12 21 0.8912 0.9444 0.9193 0.8385Calibration 139 222 8 18 0.8854 0.9652 0.9328 0.8605Validation 138 200 22 27 0.8364 0.9009 0.8734 0.7404

4 Training 138 194 16 47 0.7459 0.9238 0.8405 0.6852Invisible training 140 186 18 51 0.7330 0.9118 0.8253 0.6577Calibration 123 242 0 30 0.8039 1.0000 0.9241 0.8457Validation 145 233 3 15 0.9063 0.9873 0.9545 0.9063

Predictive model of antibacterial activity for peptides suggested in the literature: (Speck-Planche et al., 2016)

Training – – – – 0.94 0.94 0.94 0.89Validation – – – – 0.93 0.96 0.95 0.90

(Gabere and Noble, 2017), DAMPD dataset, n=3282

CAMPR3(RF) 505 1987 748 42 0.9232 0.7265 0.7593 0.4984CAMPR3(SVM) 493 1972 763 54 0.9013 0.7210 0.7511 0.4772ADAM 460 1884 851 87 0.8409 0.6888 0.7142 0.4031MLAMP 348 2250 485 199 0.6362 0.8227 0.7916 0.3930DBAASP 121 2540 195 426 0.2212 0.9287 0.8108 0.1894

(Gabere and Noble, 2017), APD3 dataset, n=10278

CAMPR3(RF) 1624 7146 1419 89 0.9480 0.8343 0.8533 0.6387CAMPR3(SVM) 1552 6987 1578 161 0.9060 0.8158 0.8308 0.5845ADAM 1560 5273 3292 153 0.9107 0.6156 0.6648 0.3929MLAMP 1295 6664 1901 418 0.7560 0.7781 0.7744 0.4300DBAASP 1076 7850 715 637 0.6281 0.9165 0.8685 0.5351

A.P. Toropova et al. BioSystems 169–170 (2018) 5–12

9

Page 6: Prediction of antimicrobial activity of large pool of ...pubs.ccmsi.us/pubs/BioSys18-5.pdf · to predict the antimicrobial activity of large pool of peptides. Traditional simplified

solely on information about sequences of amino acids in peptides andexperimental data on the antimicrobial activity.

In addition, one needs to note that the predictive model is a randomevent (Toropov et al., 2013; Toropova and Toropov, 2017b) affected bythe split of available data into the training and validation subsets. Thepredictive potential of model for fourth split is slightly better thanpredictive potential of model described in the literature (Speck-Plancheet al., 2016). However, the estimation of predictive potential of anapproach should be based on group of splits, taking into account thatsuccessful model is “good luck” but not “mathematical expectation”.

The obtained results provide some mechanistic interpretation of thestudied phenomena. Having results of several runs of the Monte Carlooptimization, one can obtain attributes of the quasi-SMILES of twocategories: (i) attributes with solely positive correlation weights, thiscategory is promoter of increase for the antibacterial activity; and (ii)attributes with solely negative correlation weights, this category ispromoter of decrease for the activity. It is to be noted, that there arealso attributes with both positive and negative correlation weightsobtained in different runs of the optimization. These attributes shouldbe qualified as attributes with an unclear role. Table 5 contains the listof promoters of increase and decrease for antibacterial activity ofpeptides. Consequently, the CORAL models for antimicrobial activityhave mechanistic interpretation (Gobbi et al., 2016). Fig. 3 representsdata on the semi-correlations together with the numerical data on theTP, TN, FP, and FP which are necessary to calculate the statisticalquality of the categorical classifications. Thus, the described approachgives reasonable model for antibacterial activity of peptides. Thetechnical details for the models are available in the Supplementary ma-terials section.

4. Conclusions

Antimicrobial activity of 1581 peptides has been modelled usingQSAR approach. The suggested categorical classification that was usedto develop QSAR models is characterized by good values of the statis-tical criteria: sensitivity, specificity, accuracy, and Matthews correla-tion coefficient. The different random distributions lead to variation inthe statistical quality of the models. However, in all considered casesthe statistical quality of the CORAL models is satisfactory. The de-scribed models are built up according to OECD principles 1. A definedendpoint: antibacterial activity of peptides; 2. An unambiguous algo-rithm: the CORAL software available on the Internet; 3. A defined do-main of applicability: inequality (11) for sequences of amino acidswithout branching; 4. Appropriate measures of goodness-of-fit: Eqs.(5)–(8); 5. A mechanistic interpretation: promoters of increase or de-crease for antimicrobial activity of peptides (Table 5).

Conflict of interest

The authors declare that they have no conflict of interest.

Table 5Amino acids, which are statistically stable promoters of increase or decrease forantimicrobial activity of peptides according to semi-correlations established inthis work for all four splits examined here. The N1, N2, and N3 are numbers ofamino acids in the training, invisible training, and calibration sets, respectively.

No. Amino acids CWs Run 1 CWs Run 2 CWs Run 3 N1 N2 N3

Split 1Promoters of increase

1 K 2.31749 1.93581 1.93522 353 343 3402 L 2.05902 1.68542 1.75037 335 328 3393 I 1.93880 1.62345 1.50193 272 280 2824 F 1.68431 1.24682 1.44102 243 225 2255 V 1.87571 1.69068 1.74750 238 228 2626 R 1.74947 1.37065 1.25362 171 195 1948 W 3.43552 2.68371 2.94069 122 147 1539 H 2.12589 1.81009 1.99536 99 101 10610 C 0.87418 0.69237 0.93739 95 80 88

Promoters of decrease1 G −1.25191 −1.00064 −1.00424 290 260 2832 A −0.49760 −0.43670 −0.50005 271 235 2773 S −2.12580 −1.62625 −1.74951 217 210 2064 N −0.43383 −0.30842 −0.37569 125 99 1375 Q −0.94077 −0.80841 −1.12680 81 80 866 E −1.25349 −1.06376 −0.75011 78 65 737 D −1.31037 −1.43411 −1.06421 75 55 80

Split 2Promoters of increase

1 K 2.00283 2.30891 2.49986 363 333 3432 L 1.75115 1.99915 2.31093 347 326 3323 I 1.56226 1.87872 2.43678 287 281 2654 V 1.37493 1.68609 1.87600 259 247 2225 F 1.49759 1.87470 1.87815 229 223 2356 R 1.37167 1.49938 1.75107 191 169 1937 W 2.81250 3.12246 3.50209 139 137 1418 H 2.00492 2.50095 2.87462 103 96 1029 C 0.87148 1.00230 0.99588 81 96 82

Promoters of decrease1 G −1.05845 −1.50212 −1.49803 276 274 2722 A −0.24925 −0.50018 −0.25098 264 256 2533 S −1.49847 −2.00045 −1.50133 215 208 2125 N −0.56592 −0.87772 −0.62946 124 116 1136 Q −0.62131 −1.06090 −1.05956 85 88 837 E −0.75072 −0.31395 −0.93298 78 64 658 D −0.62300 −0.99904 −0.87572 73 71 61

Split 3Promoters of increase

1 L 2.06367 2.31616 2.00251 336 340 3202 K 1.87102 2.18912 1.93597 332 358 3393 I 1.80884 2.00321 1.74682 267 288 2704 V 1.68686 1.93442 1.68802 240 251 2255 F 1.74583 1.99523 1.68924 237 238 2186 R 1.37197 1.55993 1.37678 189 199 1807 W 2.99966 3.43525 2.93808 147 143 1308 T 0.12920 0.12221 0.12147 124 129 1219 H 1.37006 1.62043 1.37553 99 107 12010 C 0.81522 0.93653 0.81293 80 95 77

Promoters of decrease1 G −1.24724 −1.37462 −1.19156 269 276 2612 A −0.37420 −0.43766 −0.37317 244 271 2523 S −1.81559 −2.12671 −1.81131 206 221 1985 N −0.87270 −0.99698 −0.87921 107 139 1056 Q −0.44080 −0.43384 −0.37853 95 81 817 E −0.49544 −0.62737 −0.56143 69 87 748 D −0.87969 −1.00078 −0.87445 67 68 62

Split 4Promoters of increase

1 K 2.43739 2.37230 2.12003 353 338 3372 L 2.49925 2.43357 2.49510 326 330 3333 I 2.18458 1.99560 2.00320 278 272 2714 F 2.00463 2.06298 1.87819 244 223 2395 V 1.93555 2.00168 1.87922 234 241 2406 R 2.12041 1.87850 1.93735 195 195 1827 W 3.37182 3.37551 3.12130 148 150 1248 H 2.06141 2.00281 1.87639 99 112 999 C 0.87482 0.50179 1.12417 86 91 81

Promoters of decrease1 G −1.31345 −1.43674 −1.37450 264 261 287

Table 5 (continued)

No. Amino acids CWs Run 1 CWs Run 2 CWs Run 3 N1 N2 N3

2 A −0.49702 −0.49579 −0.37325 248 260 2613 S −1.74719 −1.49946 −1.81243 195 207 2166 N −0.24550 −0.24620 −0.25353 121 124 1147 E −1.12017 −1.12841 −1.00248 80 83 698 Q −0.55856 −0.62070 −0.74747 79 103 849 D −1.62465 −0.62139 −1.12868 77 79 62

A.P. Toropova et al. BioSystems 169–170 (2018) 5–12

10

Page 7: Prediction of antimicrobial activity of large pool of ...pubs.ccmsi.us/pubs/BioSys18-5.pdf · to predict the antimicrobial activity of large pool of peptides. Traditional simplified

Fig. 3. The graphical representation of models for antimicrobial activity of peptides together with statistical characteristics of semi-correlations (n is the number ofpeptides in a set; R2 is correlation coefficient; s is root mean squared error; F is Fischer F-ratio).

A.P. Toropova et al. BioSystems 169–170 (2018) 5–12

11

Page 8: Prediction of antimicrobial activity of large pool of ...pubs.ccmsi.us/pubs/BioSys18-5.pdf · to predict the antimicrobial activity of large pool of peptides. Traditional simplified

Ethical approval

This article does not contain any studies with human participants oranimals performed by any of the authors.

Acknowledgements

APT, AAT and EB are grateful for the contribution of the projectLIFE-COMBASE contract (LIFE15 ENV/ES/000416). D.L. and J.L weresupported by the NSF CREST Interdisciplinary Nanotoxicity CenterGrant# HRD- 1547754.

Appendix A. Supplementary data

Supplementary data associated with this article can be found, in theonline version, at https://doi.org/10.1016/j.biosystems.2018.05.003.

References

Breiman, L., 2001. Random forest. Mach. Learn. 45, 5–32.Gabere, M.N., Noble, W.S., 2017. Empirical comparison of web-based antimicrobial

peptide prediction tools. Bioinformatics 33 (13), 1921–1929.Gobbi, M., Beeg, M., Toropova, M.A., Toropov, A.A., Salmona, M., 2016. Monte Carlo

method for predicting of cardiac toxicity: hERG blocker compounds. Toxicol. Lett.250 (-251), 42–46.

Hammami, R., Zouhir, A., Hamida, J.B., Fliss, I., 2007. BACTIBASE: a new web-accessibledatabase for bacteriocin characterization. BMC Microbiol. 7, 89.

Jenssen, H., Fjell, C.D., Cherkasov, A., Hancock, R.E., 2008. QSAR modeling and com-puter-aided design of antimicrobial peptides. J. Pept. Sci. 14, 110–114.

Pirtskhalava, M., Gabrielian, A., Cruz, P., Griggs, H.L., Squires, R.B., Hurt, D.E.,Grigolava, M., Chubinidze, M., Gogoladze, G., Vishnepolsky, B., Alekseev, V.,Rosenthal, A., Tartakovsky, M., 2016. DBAASP v.2: an enhanced database of struc-ture and antimicrobial/cytotoxic activity of natural and synthetic peptides. Nucleic.Acids Res. 44, D1104–D1112.

Porto, W.F., Pires, A.S., Franco, O.L., 2012. CS-AMPPred: an updated SVM model forantimicrobial activity prediction in cysteine-stabilized peptides. PLoS One 7 (12),e51444.

Speck-Planche, A., Kleandrova, V.V., Luan, F., Cordeiro, M.N.D.S., 2012. In Silico dis-covery and virtual screening of multi-target inhibitors for proteins in mycobacteriumtuberculosis. Comb. Chem. High. Throughput. Screen 15 (8), 666–673.

Speck-Planche, A., Kleandrova, V.V., Ruso, J.M., Cordeiro, M.N.D.S., 2016. First multi-target chemo-bioinformatic model to enable the discovery of antibacterial peptidesagainst multiple gram-positive pathogens. J. Chem. Inf. Model. 56 (3), 588–598.

Toropov, A.A., Toropova, A.P., 2015. Quasi-SMILES and nano-QFAR: united model for

mutagenicity of fullerene and MWCNT under different conditions. Chemosphere 139,18–22.

Toropov, A.A., Toropova, A.P., Raska Jr., I., Benfenati, E., Gini, G., 2012. QSAR modelingof endpoints for peptides which is based on representation of the molecular structureby a sequence of amino acids. Struct. Chem. 23 (6), 1891–1904.

Toropov, A.A., Toropova, A.P., Puzyn, T., Benfenati, E., Gini, G., Leszczynska, D.,Leszczynski, J., 2013. QSAR as a random event: modeling of nanoparticles uptake inPaCa2 cancer cells. Chemosphere 92 (1), 31–37.

Toropov, A.A., Toropova, A.P., Beeg, M., Gobbi, M., Salmona, M., 2017a. QSAR model forblood-brain barrier permeation. J. Pharmacol. Toxicol. Methods 88, 7–18.

Toropov, A.A., Toropova, A.P., Marzo, M., Dorne, J.L., Georgiadis, N., Benfenati, E.,2017b. QSAR models for predicting acute toxicity of pesticides in rainbow trout usingthe CORAL software and EFSA’s OpenFoodTox database. Environ. Toxicol.Pharmacol. 53, 158–163.

Toropova, A.P., Toropov, A.A., 2017a. CORAL: binary classifications (active/inactive) fordrug-induced liver injury. Toxicol. Lett. 268, 51–57.

Toropova, A.P., Toropov, A.A., 2017b. Nano-QSAR in cell biology: model of cell viabilityas a mathematical function of available eclectic data. J. Theor. Biol. 416, 113–118.

Toropova, A.P., Toropov, A.A., Rasulev, B.F., Benfenati, E., Gini, G., Leszczynska, D.,Leszczynski, J., 2012. QSAR models for ACE-inhibitor activity of tri-peptides basedon representation of the molecular structure by graph of atomic orbitals and SMILES.Struct. Chem. 23 (6), 1873–1878.

Toropova, M.A., Veselinović, A.M., Veselinović, J.B., Stojanović, D.B., Toropov, A.A.,2015. QSAR modeling of the antimicrobial activity of peptides as a mathematicalfunction of a sequence of amino acids. Comput. Biol. Chem. 59, 126–130.

Torrent, M., Andreu, D., Nogues, V.M., Boix, E., 2011. Connecting peptide physico-chemical and antimicrobial properties by a rational prediction model. PLoS One 6,e16968.

Vandebriel, R.J., Loveren, H.V., 2010. Non-animal sensitization testing: state-of-the-art.Crit. Rev. Toxicol. 40 (5), 389–404.

Veselinović, A.M., Veselinović, J.B., Toropov, A.A., Toropova, A.P., Nikolić, G.M., 2015.In silico prediction of the b-cyclodextrin complexation based on Monte Carlo method.Int. J. Pharm. 495, 404–409.

Waghu, F.H., Barai, R.S., Gurung, P., Idicula-Thomas, S., 2016. CAMPR3: a database onsequences: structures and signatures of antimicrobial peptides. Nucl. Acids Res. 44,D1094–D1097.

Wang, P., Hu, L., Liu, G., Jiang, N., Chen, X., Xu, J., Zheng, W., Li, L., Tan, M., Chen, Z.,Song, H., Cai, Y.D., Chou, K.C., 2011. Prediction of antimicrobial peptides based onsequence alignment and feature selection methods. PLoS One 6 (4), e18476.

Webb-Robertson, B.J., 2009. Support vector machines for improved peptide identificationfrom tandem mass spectrometry database search. Methods Mol. Biol. 492, 453–460.

Weininger, D., 1988. SMILES, a chemical language and information system. 1.Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28,31–36.

Yount, N.Y., Yeaman, M.R., 2004. Multidimensional signatures in antimicrobial peptides.Proc. Natl. Acad. Sci. U. S. A. 101 (19), 7363–7368.

Yousefinejad, S., Hemmateenejad, B., Mehdipour, A.R., 2012. New autocorrelationQTMS-based descriptors for use in QSAM of peptides. J. Iran. Chem. Soc. 9 (4),569–577.

A.P. Toropova et al. BioSystems 169–170 (2018) 5–12

12