Ph.D. Thesis Development of computational tools for RNA ... Magnus.pdf · Ph.D. Thesis Development...

Ph.D. Thesis

Development of computational tools for RNA tertiary structure prediction

Opracowanie narzędzi komputerowych do przewidywania struktury trzeciorzędowej RNA

Marcin Magnus

Supervisor:

Professor Janusz Marek Bujnicki

The work has been conducted in

the Laboratory of Bioinformatics and Protein Engineering

at the International Institute of Molecular and Cell Biology in Warsaw

and at

the Department of Biochemistry of Stanford University, USA

(under the supervision of professor Rhiju Das).

The Graduate School of Molecular Biology at

the Institute of Biochemistry and Biophysics Polish Academy of Sciences, in Warsaw

Warsaw, 2017

iii

Cover image by Marcin Magnus and Janusz M.

Bujnicki: An accurate 3D model of HCV IRES

RNA structure, obtained with the fully

automated RNA modeling method

SimRNAweb (Magnus et al. 2016), using only

RNA sequence as an input. The model has

RMSD of 5.52 Å to the experimentally

determined structure (PDB id: 1KH6 (Kieft et

al. 2002)). The superposition has been done and

visualized using PyMOL, and the image has

been repainted according to the style of

“Transverse Line” painting by Wassily

Kandinsky (1923), using a machine learning

method DeepArt.io.

https://academic.oup.com/nar/issue/45/1#127533-2871059

https://academic.oup.com/nar/issue/45/1#127533-2871059

iv

Abstract

Ribonucleic acid (RNA) is one of the key types of molecules found in living cells. It is

involved in a number of highly important biological processes, not only as the carrier of the

genetic information but also serving catalytic, scaffolding and structural functions. The

interest in the field of non-coding RNA has been increasing for the past few decades with the

new types of non-coding RNAs discovered every year. Similarly, to proteins, a 3D structure

of RNA molecule determines its function. In order to build a 3D model of RNA particle, one

can take advantage of high-resolution experimental techniques, such as biocrystallography.

However, experimental techniques are tedious, time-consuming, expensive, and require

specialized equipment, and not always can be applied. An alternative to experimental

techniques are methods for computational modeling. However, the results of the RNA-

Puzzles - a collective experiment for blind RNA structure prediction - show that accurate

modeling of RNA is still very difficult and there is still much room for improvement.

To facilitate the task of RNA 3D structure prediction, two new approaches were proposed in

this study: one for the prediction of relative model accuracy, and another for the generation of

RNA 3D structure models.

First, I developed a new approach for answering a question: How to choose a structural

model that is closest to the native structure? This task is called “model evaluation” and it is

an important step for 3D RNA structure prediction. A few methods were developed so far but

their accuracy is not sufficient and they behave differently depending on the dataset they are

used on. This stimulated the development of a meta-predictor, mqapRNA, which combines

the primary methods and uses the deep learning model to take advantage of their combined

strengths and to eliminate their individual weaknesses. In addition, mqapRNA is equipped

with a module that helps the user to refine the prediction by applying distance restraints

obtained from an experimental method or evolutionary analysis. The method is available as

an easy-to-use web server.

Second, I developed a new approach for RNA 3D structure prediction named EvoClustRNA

that takes advantage of incorporation of evolutionary information from distant sequence

homologs, based on a classic strategy of protein structure prediction. Based on the empirical

v

observation that RNA sequences from the same RNA family typically fold into similar 3D

structures, I tested whether it is possible to guide in silico modeling by seeking a global

helical arrangement, for the target sequence, that is shared across de novo models of

numerous sequence homologs. EvoClustRNA performs a multi-step modeling process and

can be coupled with any method for RNA structure prediction, such as SimRNA or Rosetta.

EvoClustRNA approach was tested on two blind RNA-Puzzles challenges. The predictions

ranked as the first of all submissions for the L-glutamine riboswitch and as the second for the

ZMP riboswitch. The method was also benchmarked on the testing dataset.

As a complementary activity, I developed a software package called rna-pdb-tools. It is a

Python library and a set of tools dedicated to RNA structural file handling and manipulating,

like (1) rebuilding of missing atoms in RNA structures, (2) structural clustering, (3)

standardization of PDB format to comply with the format required by RNA-Puzzles, (4)

visualization of secondary RNA structures and drawing RNA arch diagrams of secondary

structure triggered from Python scripts or Jupyter Notebooks, and much more. The code is

modular and well documented which should encourage new developers to build new

applications on top of rna-pdb-tools. The code is open and free to use, and can serve as an

example of “scientific software that computes”.

The ability to predict RNA 3D structure opens great opportunities for the new developments

in biotechnology and basic science. However, it is not possible to take advantage of these

opportunities without the understanding of the structures of these molecules. The proposed

projects can make investigations of RNA structures much more effective.

All developed tools are available online: https://genesilico.pl/mqapRNA/,

https://github.com/mmagnus/EvoClustRNA, https://github.com/mmagnus/rna-pdb-tools.

https://genesilico.pl/mqapRNA/

https://github.com/mmagnus/EvoClustRNA

https://github.com/mmagnus/rna-pdb-tools

vi

Streszczenie

Kwas rybonukleinowy (RNA, ang. ribonucleic acid) jest jednym z podstawowych typów

cząsteczek występujących w żywych komórkach. Jest zaangażowany w liczne ważne procesy

biologiczne, nie tylko jako nośnik informacji genetycznej, ale pełni także funkcje

katalityczne, regulacyjne, strukturalne. Ponieważ funkcja wielu rodzin cząsteczek RNA

uzależniona jest od ich struktury przestrzennej, możemy próbować zrozumieć mechanizm

funkcjonowania danego typu RNA poprzez poznanie kształtu cząsteczki. W celu poznania

struktury przestrzennej RNA można użyć technik doświadczalnych wysokiej rozdzielczości,

takich jak biokrystalografia. Techniki doświadczalne są jednak czasochłonne, wymagają

dużych nakładów pieniężnych i wymagają specjalistycznej aparatury, nie zawsze też ich

zastosowanie jest kończy się sukcesem. Alternatywą dla technik doświadczalnych są metody

modelowania komputerowego. Jednak jak pokazują wyniki doświadczenia RNA-Puzzles, w

ramach którego naukowcy z całego świata modelują struktury RNA dla zadanej sekwencji,

problem ten jest bardzo trudny i wyniki przewidywań są często niezadowalające.

Aby zbliżyć nas do rozwiązania problemu, w ramach niniejszej rozprawy proponuję dwa

nowe protokoły obliczeniowe, dotyczące modelowania struktury 3D RNA oraz

przewidywania dokładności modeli.

W typowych zadaniach modelowania komputerowego badacz otrzymuje zestaw

alternatywnych modeli struktury danego RNA. Staje on przed bardzo ważnym pytaniem: Jak

wybrać model najbardziej zbliżony do rzeczywistej (natywnej) struktury? Otrzymane in

silico modele muszą być poddane procedurze oszacowania ich jakości i uszeregowania ich

zgodnie z uzyskanymi wartościami. Dotychczas opracowano kilka metod przewidywania

struktury 3D RNA, ale ich dokładność nie jest wystarczająca, a również zachowują się one

odmiennie w zależności od stosowanego do testowania zestawu danych. W celu rozwiązania

tego problemu opracowałem nowy meta-predyktor, mqapRNA. Narzędzie to łączy oceny

uzyskane z kilku metod składowych, a następnie wykorzystuje model statystyczny oparty o

głębokie uczenie maszynowe do oceny jakości modeli struktury, w szerszym kontekście.

Dodatkowo, mqapRNA jest wyposażony w moduł, który pomaga użytkownikowi

doprecyzować wyniki programu poprzez możliwość dodatnia więzów odległości uzyskanych

vii

metodą doświadczalną lub z analizy ewolucyjnej. Metoda jest dostępna jako łatwy w

obsłudze serwis internetowy.

Abstrahując od metod oceny jakości modeli będących wynikiem przewidywania

komputerowego, sam proces przewidywania jest dużym wyzwaniem. Dlatego zdecydowałem

się opracować nowe podejście do przewidywanie struktury 3D RNA, które wcześniej z

dużym sukcesem było używane do modelowania struktur białek. Opierając się na obserwacji,

że sekwencje RNA należące do tej samej rodziny zwijają się do bardzo podobnej struktury

trzeciorzędowej, zbadałem, czy można wykorzystać to zjawisko do poprawy wyników

modelowania RNA. Zostały przeprowadzone niezależne symulacje zwijania różnych

sekwencji homologicznych, w celu wykrycia wspólnego dla nich ułożenia w przestrzeni

regionów helikalnych. Program EvoClustRNA wykonuje wieloetapowy proces modelowania

i może być sprzężony z jakąkolwiek metodą przewidywania struktury RNA, na przykład

SimRNA lub Rosetta. Podejście EvoClustRNA zostało sprawdzone na dwóch „ślepych”

przewidywaniach w ramach konkursu RNA-Puzzle. W przypadku modelowania

ryboprzełącznika wiążącego L-glutamine, model otrzymany w wyniku EvoClustRNA

uplasował się na pierwszym miejscu w ostatecznym rankingu, a model ryboprzełącznika

ZMP na miejscu drugim. Metoda została także sprawdzona na zestawie testowym.

W trzeciej części niniejszej rozprawy opisuję pakiet oprogramowania, którego opracowanie

umożliwiło realizację powyższych projektów. rna-pdb-tools jest biblioteką programistyczną

w języku Python i zestawem narzędzi przeznaczonych do obsługi i modyfikacji plików

strukturalnych RNA w formacie PDB, takich jak: odbudowa brakujących atomów w

strukturach RNA, standaryzacja formatów PDB, uruchamiane z poziomu skryptów w

Pythonie lub interaktywnego notatnika Jupyter, a także wielu innych.

Skuteczne komputerowe przewidywanie struktury przestrzennej RNA daje nowe możliwości

dla biotechnologii oraz nauk podstawowych. Jednak bez zrozumienia zależności struktury

RNA od jego sekwencji nie będzie można z nich skorzystać. Opracowane narzędzia mogą w

znaczący sposób ułatwić zrozumienie tej zależności.

Narzędzia dostępne są pod adresami https://genesilico.pl/mqapRNA/,

https://github.com/mmagnus/EvoClustRNA, https://github.com/mmagnus/rna-pdb-tools

https://genesilico.pl/mqapRNA/



viii

Acknowledgements

Thank You

To my parents Danuta and Jerzy, and my siblings, Ania, Adam, Natalia, Patryk who crossed the fingers for me and always supported me.

To Janusz Bujnicki, for always nurturing my hard work, critical spirit, motivation, and his guidance.

To the present and past members of the Bujnicki Lab, Agnieszka Faliszewska, Dorota Niedzałek, Filip Stefaniak, Michał Boniecki, Tomasz Wirecki, Kasia Merdas, Pietro Boccaletto, Adrianna Żyła, Elżbieta Purta, Dawid Głów, Radosław Pluta, Astha, Katarzyna Merdas, Astha, Catarina Almeida, Błażej Bagiński, Krzysztof Szczepaniak, Dorota Matelska, Grzegorz Chojnowski, Grzegorz Łach, Wayne Dawson, Łukasz Kozłowski, Stanisław Dunin-Horkawicz, Tomasz Puton, Irina Tuszynska, Magdalena Machnicka, Magda Byszewska, Diana Toczydłowska, Paweł Piątkowski, Marcin Pawłowski for the precious help, support, and funs

To Magda Konarska & Rhiju Das for their mentorship, and long discussions about SCIENCE.

To Henri Sara, Elmar Bucher, Matthias Nees, John Patrick Mpindi for such an enjoyable time at VTT.

To all guests and members of the Do Science Family, for being such an inspiring and lovely community.

To Wojtek Siwek for always being there and his friendship.

To Grzegorz Lorek for inspiration, deep insight, free thought, and joy of "ONE" life.

To the IIMCB team: Jacek Kuźnicki, Marcin Nowotny, Daria Goś, Agnieszka Potęga, Dorota Libiszowska, Justyna Szopa, Agata Skaruz, Hanna Iwaniukowicz for precious help and support in all crazy activates.

To the developers of all open source tools used in this work.

To Paulina for her love and patience.

To SCIENCE!

ix

Abbreviations

CM - Covariance model

Cryo-EM - cryo-electron microscopy

DCA - direct coupling analysis

DNA - deoxyribonucleic acid

EC - enrichment score

ESR/EPR - electron spin/paramagnetic resonance

FRET - Förster resonance energy transfer

INF - interaction network fidelity

MD - molecular dynamics

MOHCA - multiplexed hydroxyl radical cleavage analysis

MSA - multiple sequence alignment

NM - normal mode

NMR - nuclear magnetic resonance

PDB - Protein Data Bank

RMSD - root mean square deviation

RNA - ribonucleic acid

SHAPE - selective 2'-hydroxyl acylation analyzed by primer extension

ZMP - 5-aminoimidazole-4-carboxamide riboside 5′-monophosphate

x

Publications

The thesis convers partially the results described in the following scientific publications:

1. Z. Miao, R. W. Adamiak, M. Antczak, R. T. Batey, A. J. Becka, M. Biesiada, M. J.

Boniecki, J. M. Bujnicki, S.-J. Chen, C. Y. Cheng, F.-C. Chou, A. R. Ferré-D'Amaré, R. Das,

W. K. Dawson, F. Ding, N. V. Dokholyan, S. Dunin-Horkawicz, C. Geniesse, K. Kappel, W.

Kladwang, A. Krokhotin, G. E. Łach, F. Major, T. H. Mann, M. Magnus, K. Pachulska-

Wieczorek, D. J. Patel, J. A. Piccirilli, M. Popenda, K. J. Purzycka, A. Ren, G. M. Rice, J.

Santalucia, J. Sarzynska, M. Szachniuk, A. Tandon, J. J. Trausch, S. Tian, J. Wang, K. M.

Weeks, B. Williams, Y. Xiao, X. Xu, D. Zhang, T. Zok, and E. Westhof, “RNA-Puzzles

Round III: 3D RNA structure prediction of five riboswitches and one ribozyme.,” RNA, vol.

23, no. 5, pp. 655–672, May 2017.

2. M. Magnus*, M. J. Boniecki*, W. Dawson, and J. M. Bujnicki, “SimRNAweb: a web

server for RNA 3D structure modeling with optional restraints.,” Nucleic Acids Research,

vol. 44, no. 1, pp. W315–9, Jul. 2016.

3. P. Piatkowski, J. M. Kasprzak, D. Kumar, M. Magnus, G. Chojnowski, and J. M.

Bujnicki, “RNA 3D Structure Modeling by Combination of Template-Based Method

ModeRNA, Template-Free Folding with SimRNA, and Refinement with QRNAS.,” Methods

Mol. Biol., vol. 1490, no. Suppl, pp. 217–235, 2016.

4. Z. Miao, R. W. Adamiak, M.-F. Blanchet, M. Boniecki, J. M. Bujnicki, S.-J. Chen, C.

Cheng, G. Chojnowski, F.-C. Chou, P. Cordero, J. A. Cruz, A. R. Ferré-D'Amaré, R. Das, F.

Ding, N. V. Dokholyan, S. Dunin-Horkawicz, W. Kladwang, A. Krokhotin, G. Lach, M.

Magnus, F. Major, T. H. Mann, B. Masquida, D. Matelska, M. Meyer, A. Peselis, M.

Popenda, K. J. Purzycka, A. Serganov, J. Stasiewicz, M. Szachniuk, A. Tandon, S. Tian, J.

Wang, Y. Xiao, X. Xu, J. Zhang, P. Zhao, T. Zok, and E. Westhof, “RNA-Puzzles Round II:

assessment of RNA structure prediction programs applied to three large RNA structures.,”

RNA, vol. 21, no. 6, pp. 1066–1084, Jun. 2015.

5. M. Magnus*, D. Matelska*, G. Lach, G. Chojnowski, M. J. Boniecki, E. Purta, W.

Dawson, S. Dunin-Horkawicz, and J. M. Bujnicki, “Computational modeling of RNA 3D

structures, with the aid of experimental restraints.,” RNA Biol, vol. 11, no. 5, pp. 522–536,

2014.

* joint first authorship

xi

Funding

This work was supported by the following sources.

Foundation for Polish Science (FNP) grant to professor Janusz Bujnicki, Modeling of RNA

and RNA-protein complexes: from sequence to structure to function, TEAM/2009-4/2/styp3.

Mazovia Scholarship to Marcin Magnus, executed under the Operational Programme

Human Capital – Priority 8.2.2, is addressed to PhD students engaged in the innovative

scientific research in areas considered particularly important for the development of Mazovia

Voivodship, 2014/2015 NR 669.

National Science Centre (NCN) grant Etiuda 2 to Marcin Magnus, Development and

application of bioinformatics tools to assess the quality of RNA structures,

2014/12/T/NZ2/00501.

National Science Centre (NCN) grant Preludium 9 to Marcin Magnus, RNA structure

prediction based on modeling the target sequence and homologous sequences, UMO-

2015/17/N/NZ2/03360.

xii

Table of Content

Abstract ................................................................................................................................................. iv

Streszczenie ........................................................................................................................................... vi

Acknowledgements ............................................................................................................................viii

Abbreviations ....................................................................................................................................... ix

Publications ........................................................................................................................................... x

Funding ................................................................................................................................................. xi

Table of Content .................................................................................................................................. xii

1 Introduction .................................................................................................................................. 1

1.1 Ribonucleic acid (RNA) ........................................................................................................ 1

1.2 RNA structure ........................................................................................................................ 2

1.2.1 RNA secondary structure .................................................................................................. 3

1.2.2 RNA tertiary structure ....................................................................................................... 6

1.3 RNA structure prediction with low-resolution experimental data ....................................... 16

1.4 RNA families ....................................................................................................................... 19

1.5 RNA-Puzzles ....................................................................................................................... 21

2 Aim of this work ......................................................................................................................... 24

3 Materials & Methods ................................................................................................................. 25

3.1 Hardware ............................................................................................................................. 25

3.2 Software ............................................................................................................................... 25

3.3 Structure visualizations........................................................................................................ 26

3.4 Databases ............................................................................................................................. 27

3.5 Development of mqapRNA ................................................................................................. 27

3.5.1 Datasets ........................................................................................................................... 27

3.5.2 Primary methods ............................................................................................................. 28

3.5.3 Secondary structure comparison ..................................................................................... 29

3.5.4 Standardization of PDB files ........................................................................................... 29

xiii

3.5.5 Evaluation of scoring functions ....................................................................................... 30

3.5.6 Statistical analyses .......................................................................................................... 30

3.5.7 Implementation of the web server ................................................................................... 31

3.6 Development of EvoClustRNA ........................................................................................... 31

3.6.1 Multiple sequence alignment generation and selection of homologs .............................. 31

3.6.2 Modeling of sequences with SimRNA/SimRNAweb and Rosetta.................................. 32

3.6.3 Clustering routine ............................................................................................................ 33

4 Results ......................................................................................................................................... 34

4.1 mqapRNA ............................................................................................................................ 34

4.1.1 Implementation of mqapRNA ......................................................................................... 34

4.1.2 Performance of mqapRNA .............................................................................................. 39

4.1.3 mqapRNA web server: quality prediction with optional restraints ................................. 44

4.2 EvoClustRNA ...................................................................................................................... 49

4.2.1 Implementation of EvoClustRNA ................................................................................... 49

4.2.2 Blind predictions with EvoClustRNA in the RNA-Puzzles ............................................ 50

4.2.3 Performance of EvoClustRNA ........................................................................................ 52

4.3 rna-pdb-tools ........................................................................................................................ 61

5 Discussion .................................................................................................................................... 68

5.1 mqapRNA ............................................................................................................................ 68

5.1.1 Similar tools or approaches ............................................................................................. 69

5.2 EvoClustRNA ...................................................................................................................... 71

5.2.1 Similar tools or approaches ............................................................................................. 71

5.3 rna-pdb-tools ........................................................................................................................ 73

5.3.1 Future directions .............................................................................................................. 75

5.4 Potential limitations of the RNA 3D structure prediction methods ..................................... 77

5.4.1 RNA-ligand interactions ................................................................................................. 77

5.4.2 Non-canonical interactions .............................................................................................. 78

5.4.3 Loop modeling ................................................................................................................ 80

5.4.4 Sampling of conformational space .................................................................................. 82

6 Conclusions ................................................................................................................................. 85

7 Supplementary data ................................................................................................................... 87

S1. List of all the sequences and secondary structures used in the benchmark of EvoClustRNA and

a list of links to the SimRNAweb predictions .................................................................................. 87

xiv

Table of Figures................................................................................................................................... 91

Table of Tables .................................................................................................................................... 99

Reference ........................................................................................................................................... 100

1

1 Introduction

1.1 Ribonucleic acid (RNA)

Ribonucleic acid (RNA) is one of the key types of molecules that are essential for the

functioning of living cells. It is involved in a number of highly important biological processes

serving catalytic, scaffolding and structural functions. With the discovery that RNAs can

perform catalytic reactions, our vision that RNA simply serves as information transfer

molecules has dramatically changed. We call these RNAs ribozymes, and for this discovery,

Sidney Altman and Thomas Cech received the Nobel Prize in 1989. Today we know that

RNAs not only serve as an intermediary between DNA and proteins, but are also able to

perform catalytic reactions and are involved in a variety of processes in cells, such as

translation, transcription, gene expression and more!

The more we learn about RNA molecules, the more we discover new ways for their potential

use in medicine, biotechnology, and basic science. For example, riboswitches are a unique

feature of bacteria with a great diversity and distribution, (McCown et al. 2017) and therefore

have become a promising target for antibacterial treatments. Fluorescent riboswitches,

combined with “interchangeable” aptamer domains that can bind various ligands, are

becoming an important tool in basic science for monitoring metabolites in living cells

(Kellenberger et al. 2015; Strack et al. 2013). This can lead to a revolution, similar to the

discovery of the GFP protein (Nobel Prize in 2008). MicroRNAs are used in medicine for

new therapies and in molecular biology to silence genes of interest(Hayes et al. 2014).

Scientists are investigating CRISPR-Cas9 (Pennisi 2013) – a prokaryotic immune system - as

a tool for genome editing. It has been also proved that long noncoding RNAs are involved in

the cancer development (Li et al. 2017; Xu et al. 2017). Many antibiotics, e.g.,

aminoglycoside antibiotics (Kulik et al. 2015), bind to ribosomal RNA and disable bacterial

protein synthesis. Alas, we do not know yet the function of newly discovered circular RNAs

(Szabo and Salzman 2016). RNA because of its ability to self-assembly3 (Chworos et al.

2004) seems to be ideal for creating nanorobots - biodevices that can be programmed, for

example, to detect microRNAs (Aw et al. 2016) related to human diseases in the blood, or

regulate gene expression (Berens et al. 2015), and much more.

2

When investigating the universe of the mentioned above features, we must always consider

that in order to conduct any function, RNA molecule must fold into a specific structure.

1.2 RNA structure

The structure of RNA is hierarchical, which means that we can distinguish levels of

organization: (1) a primary structure (the nucleic acid sequence), (2) a secondary structure

(canonical interactions between the bases in an RNA chain), (3) and then a tertiary structure

(arrangement of secondary structure elements in the three-dimensional space).

The first level of organization is an RNA sequence, the so-called RNA primary structure. The

primary structure is described as a chain of ribonucleotides. This chain is a linear polymer,

linked by the phosphodiester bond. Each ribonucleotide (Fig. 1.2.1) consists of a nucleoside

(a ribose and a base) and a phosphate residue.

Figure 1.2.1: Ribonucleotide - a building block of RNA. Source (Wikimedia-Commons)

Four different common ribonucleotides are the building blocks of RNA molecules, which

contain four different bases, connected to the ribose. These bases are: purines: adenine (A)

and cytosine (C), and pyrimidines: guanine (G) and uracil (U).

At this level of organization RNA is very similar to DNA. However, there are very important

and biologically relevant differences. RNA has one extra oxygen atom attached to the C2′

3

sugar. This extra atom induces the RNA molecules to be extremely reactive, and prone to

degradation. The second difference is the presence of uracil instead of thymine (T).

Consequently, RNA has only four standard building blocks, which makes a sequence search

and alignment of sequences far more difficult compared to proteins, which consist of twenty

standard amino acids.

Many RNAs found in nature, exhibit additional residues, beyond the standard ones. They are

generated as chemical modifications, which are introduced post-transcriptionally by different

enzymes, and usually modification occur on a 2ʹ-OH group of a ribose moiety, or/and on one

or more of different atoms of a base moiety. One of the most common modification is a

pseudouridine, in which an uracil is linked to a ribose via a carbon-carbon bond instead of a

nitrogen-carbon bond. According to the MODOMICS database (Czerwoniec et al. 2009;

Dunin-Horkawicz et al. 2006; Machnicka et al. 2013) in September 2017, there were over

160 known modifications occurring in RNA. However, at the current stage of RNA structural

bioinformatics, all methods but ModeRNA (Rother et al. 2011a) neglect modified residues

and are designed to predict structures for only standard residues.

Another important difference between DNA and RNA is the typical feature of RNA to fold

into complex three-dimensional (3D) structures. DNA molecules usually consist of two

strands coiled around each other to form a very well defined, and very long double helix. By

contrast, RNA molecules tend to be relatively short, with a single strand folded into short

helices interspersed by loops, and their functional form requires an intrinsic, complex

structure.

We can distinguish two levels of this organization: secondary structure, and tertiary structure.

1.2.1 RNA secondary structure

Nucleic acid bases can interact in various ways, including base-base stacking and edge-to-

edge pairing (canonical and non-canonical). While base stacking interactions provide the key

driving force for folding of an RNA molecule, the edge-to-edge pairing interactions,

mediated by hydrogen bonds, provide directionality and specificity (Leontis and Westhof

2001). Leontis and Westhof proposed a classification, based on the observation that the

4

planar edge-to-edge, hydrogen-bonding interactions between RNA bases, which involve one

of three distinct edges: the Watson–Crick (W) edge, the Hoogsteen (H) edge, and the Sugar

(S) edge (Fig. 1.2.2A). Moreover, each base in a pair can interact in either of two orientations

with respect to the glycosidic bonds, cis or trans, relative to the hydrogen bonds (Fig.

1.2.2B). It gives twelve geometric base pair families (Fig. 1.2.2C) and eighteen base pairing

relations, due to the asymmetry of some base pairs. Besides, bases can form triples that can

be also classified and characterized (Abu Almakarem et al. 2012).

5

Figure 1.2.2: Leontis/Westhof classification of base pairings. (A) RNA bases - adenine (A),

cytosine (C), guanine (G) and uracil (U) - involve one of three distinct edges: the Watson–

Crick (W) edge, the Hoogsteen (H) edge, and the Sugar (S) edge. (B) Each pair of can

interact in either cis or trans orientations with respect to the glycosidic bonds. (C) For these

reasons, all base pairs can be grouped into twelve geometric base pair families and eighteen

pairing relationships (bases are represented as triangles). Each pair is represented by a

symbol that can be used in a secondary structure and a tertiary structure diagrams. Filled

symbols mean cis base pair configuration, and open symbols, trans base pair. (D)

Interestingly, bases can form triples and they have own classification devised by Leontis and

coworkers (Abu Almakarem et al. 2012)(Creative Commons License)

Canonical base pairs are G-C, connected by three hydrogen bonds, and A-U, connected by

two hydrogen bonds. These pairs are characterized by their isostericity (geometrical

equivalence), which gives rise to a regular A-form double helix, and allows each of the four

combinations of canonical pairs to substitute for each other, without distorting the 3D helical

structure.

6

The secondary structure is defined as a set of canonical interactions between the bases in an

RNA chain, while the tertiary structure is described as the positions of the atoms in the three-

dimensional space. The fundamental elements of the secondary structure of RNA are single-

stranded fragments and paired fragments (helices). Depending on the structural context,

several types of unpaired fragments can be distinguished: (1) single stranded fragments at the

5′ or 3′ ends of the RNA chain, (2) hairpin loops, occurring at the ends of double stranded

fragments, (3) bulges of single nucleotides, (4) interior loops, consisting of several unpaired

nucleotides inside the helix, and (5) junctions, connecting several helices. RNA molecules

also form pseudoknots, a structural configuration where one single stranded region folds back

on itself and connects another single stranded region within a stem.

The number of possible secondary RNA structures increases exponentially with the length of

the sequence. Although the secondary structure prediction is a key it still remains an unsolved

problem in structural biology of RNA. The earliest algorithms dynamically searched for the

secondary structure with the lowest free energy, taking into account the hydrogen bonding

energy of the canonical base pairs (Nussinov et al. 1978). Next generations of methods took

into account the energy of the base stacking (Zuker and Stiegler 1981), and the possibility of

creating pseudoknots (Rivas and Eddy 1999). The CompaRNA web server (Puton et al. 2013)

provides a continuous benchmark of automated standalone, and web server methods for RNA

secondary structure prediction, and collects predictions of over 40 tools! This server was

published in 2013, thus one should expect that today, there are even more methods for RNA

secondary structure prediction.

Elements of secondary structure can fold to create more complex tertiary shapes.

1.2.2 RNA tertiary structure

The tertiary structure of RNA is formed by an appropriate spatial arrangement of secondary

structure elements. Its formation is conditioned by long-range effects, created between the

single stranded regions. The phosphate groups of RNA are negatively charged, making RNA

a charged molecule. For this reason, mono and divalent metal cations, including K+ and

Mg2+

, which neutralize negative charge, play important role in the RNA folding. It is

important to mention that most of the computational methods for RNA tertiary prediction

neglect the presence of ions, and do not model them explicitly.

7

In nature, RNAs form complicated molecular 3D architectures. Some RNAs can perform

their function only when folded into a particular shape. By studying the spatial structure of

RNA, we can try to understand the mechanism of action of a particular type of RNA. To

determine the spatial structure of RNA, researchers can use experimental techniques, such as

biocrystallography or nuclear magnetic resonance (NMR) spectroscopy. However, the

experimental techniques, are tiresome, expensive, and require specialized equipment. An

alternative to the experimental techniques are computer modeling methods. Although the

computer modeling methods are not as accurate as mentioned above experiments, they can be

successfully used to investigate the function and mechanism of action of the RNA molecules

(Kladwang et al. 2012). Therefore, there is a need for computational methods that can

provide reliable models of RNA structures efficiently and cheaply, using only information on

a nucleotide sequence. The goal of computational structural bioinformatics is not to replace

experimental techniques, but to compliment them especially when the for answer scientific

questions are beyond their reach. Unfortunately, despite the fact that computational methods

are being continuously improved, they not always predict the correct structures of RNA.

A collation of an example secondary and the corresponding tertiary structure of a riboswitch

(the Pistol ribozyme) is shown in Figure 1.2.3.

8

Figure 1.2.3: Collation of an example secondary (A) and the corresponding tertiary structure

(B) of the Pistol ribozyme (PDB code: 5K7c (Ren et al. 2016)). This riboswitch adopts a

compact tertiary architecture stabilized by an embedded pseudoknot (violet) fold and is

composed of three helical regions, P1 (green), P2 (blue), P3 (orange). This is a self-cleaving

ribozyme that is widely distributed in nature (Jimenez et al. 2015). The cleavage site is

marked in yellow. The secondary structure diagram was generated with VARNA (Darty et al.

2009), and the tertiary structure figure was generated with PyMOL (DELANO 2002)

1.2.2.1 RNA tertiary structure computational prediction

The secondary structure determination (or prediction) is often the starting point for the spatial

(3D) structure determination of RNA. Programs for predicting tertiary structure of RNA

generally represent two categories: (1) methods based on the laws of physics, (2) methods

based on experimental data extrapolating knowledge of experimentally solved structures.

The first approach is based on Anfinsen's hypothesis (Anfinsen 1973), formulated in 1973 for

proteins, and later adapted to other macromolecules, including RNA. According to Anfinsen,

at the environmental conditions at which folding occurs, the native structure is a unique,

stable and kinetically accessible minimum of the free energy. Since the accurate quantum-

chemical calculations of the free energy derived directly from the Schrödinger equation are

very costly calculations, many approximations are used. The potential energy function of the

system (i.e., force field) is written in the form of the sum of several elements, accounting for

the geometry of covalent bonds or the spacing between atomic atoms, parameterized using

quantum-chemical calculations or experimental measurements. The most popular force fields

9

used for simulation biomolecular systems are Amber (Case et al. 2005) and CHARMM

(Brooks et al. 2009). However, their computational cost prevents the Molecular Dynamics of

the whole macromolecular structures and usually are only used to optimize the geometry of

the model, obtained by other methods. The force field methods are also used to simulate short

processes, such as ligand binding and to investigate the stability of RNA fragments. DMD

(Discrete Molecular Dynamics) (Ding et al. 2008) is a program that uses discrete molecular

dynamics and a mostly physics-based energy function. To make physics-based calculation

feasible, an RNA molecule is reduced to a coarse-grained representation.

The second group of methods is based on extrapolating the knowledge of structures. For

some methods in this group further simplifications are used, such as grouping of atoms to be

represented as single pseudo-atoms. In programs using coarse-grained representation of a

molecule, the energy function is devised based on the solved molecular structures searching

for a model imitating the law of RNA folding. The effectiveness of this approach, in the

prediction of RNA 3D models, has been documented for several large RNA molecules

modeled with constraints on the secondary structure and tertiary interactions (Jonikas et al.

2009; Miao et al. 2017). NAST (Jonikas et al. 2009), ERNWIN (Kerpedjiev et al. 2015) and

SimRNA (Boniecki et al. 2016) are good examples of state-of-the-art programs utilizing

coarse-grained approach for RNA molecule folding simulations. Among the methods that

extrapolate the fragments of already solved structures are (1) assembly based methods and (2)

homology modeling (comparative modeling) (3) manual building structures based on

figments. In the first approach, structural motifs are found in a database of known structures

and a prediction of an assembly of these fragments in accordance with the predicted topology

of the whole molecule is made. An assembly is scored by the corresponding evaluation

function, and a final prediction is generated, or the process is repeated iteratively. Examples

of such methods are RNAComposer (Popenda et al. 2012), MC-Sym|MC-Fold (Parisien and

Major 2008), FARNA (Das and Baker 2007), 3dRNA . Unlike fragment assembly,

comparative modeling methods, such as ModeRNA (Rother et al. 2011b) (Rother et al.

2011a), RNA123 (Eriksson et al. 2014), MacroMoleculeBuilder (Flores et al. 2010), requires

a precise indication of the homologous structure of the RNA molecule and the alignment of

the corresponding sequence. Another subgroup are tools that can be for manual structure

building such us ERNA-3D (Zwieb and Müller 1997), MANIP (Massire and Westhof 1998),

10

Assemble (Jossinet et al. 2010), RNA2D3D (Martinez et al. 2008), S2D (Jossinet and

Westhof 2005), Nucleic Acid Builder program (NAB) (Macke and Case 2009). They have

been used with a great success for modeling, for example, architecture of group I catalytic

introns (Michel and Westhof 1990), tmRNA (Burks et al. 2005). However, this thesis focuses

only on automated predictive methods (Table 1.2.1).

The protocols of RNA 3D structure predictions both using template-based ModeRNA and

template-free Folding SimRNA, and Refinement with QRNAS, are described in details here

(Piatkowski et al. 2016).

Table 1.2.1 Computation methods for RNA 3D structure prediction, based on (Magnus et al. 2014).

Type Method Name Description Representation Probing of

conformations Folding simulation

DMD (Discrete Molecular Dynamics)

Coarse-grained simulation method that uses discrete molecular dynamics and a mostly physics-based energy function

Coarse-grained (3 centers / residue)

Discrete molecular dynamics

Folding simulation

SimRNA Coarse-grained simulation method that uses Monte Carlo sampling method and a knowledge-based energy function

Coarse-grained (5 centers / residue)

Monte Carlo

Folding simulation

NAST (The Nucleic Acid Simulation Tool)

Very coarse-grained simulation method that uses molecular dynamics and relies almost completely on restraints supplied by a user

Coarse-grained (1 center/ residue)

Molecular dynamics

Fragment assembly

FARNA (Fragment Assembly of RNA) / Fragment Assembly of RNA with Full Atom Refinement (FARFAR)

Adaptation of the ROSETTA method for RNA structure prediction, assembles the structure from short single- stranded fragments using a Monte Carlo procedure and a hybrid physics/statistics-based scoring function, followed by full-atom refinement with a physics-based function

Full-atom Monte Carlo

Fragment assembly

MC-Fold|MC-Sym A method that assembles RNA structures from nucleotide cyclic motifs (NCN) with the sampling defined as a constraint satisfaction problem and evaluates the resulting conformations with a hybrid physics/statistics-based scoring function

Full-atom Constraint satisfaction problem

Fragment assembly

RNA Composer s A method that can assemble large RNA structures from fragments taken from RNA FRABASE, using user-defined restraints, based on the machine translation principle

Full-atom Machine translation workflow

12

For projects covered in this thesis, two RNA 3D structure prediction methods were used:

SimRNA developed by dr Michał Boniecki and colleagues in the laboratory of professor

Janusz Bujnicki and FARNA (an extension of ROSETTA) developed by professor Rhiju Das

and colleagues first in the laboratory of professor Baker and later in his own group. These

methods will be described here briefly.

Michał Boniecki, Janusz Bujnicki and colleagues at the International Institute of Molecular

and Cell Biology in Warsaw developed SimRNA, a method for RNA folding simulations and

3D structure prediction that uses a coarse-grained representation of five atoms per residue

and a statistical potential methodology. The method predicts RNA 3D structure from

sequence alone, and, if available, can use additional structural information in the form of

secondary structure restraints, distance restraints that define the local arrangement of certain

atoms. Moreover, the method can jump-start the simulation with a 3D structure provided in a

PDB file. The energy function is based entirely on statistics derived from databases of known

structures. For space sampling, the Monte Carlo algorithm was implemented. SimRNA is

available as a standalone package that requires the user to have some computer skills and a

powerful computer – typical simulation (of a sequence ~70 residues) take around 6 hours, on

an 80-core machine. To help biologists with no bioinformatics background use SimRNA, a

web server was implemented, SimRNAweb (Magnus et al. 2016). The web service that

simplifies the steps of the stand-alone package does not require the user to supply computing

power and memory, provides a simple interface for the user, and displays the progress of the

simulation in real time. This renders the approach available to an individual who is not

necessarily an expert in RNA structure and does not have access to state-of-the-art 3D

molecular modeling facilities, but who needs a model of the RNA 3D structure, for instance

to design biochemical experiments, or may want to observe the conformational changes of

the RNA as it folds.

Rhiju Das, David Baker and colleagues at the University of Washington developed the

Fragment Assembly of RNA (FARNA) tool based on the Rosetta Protein Modeling (Leaver-

Fay et al. 2011). The program uses a simplified representation of the RNA model, where each

nucleotide is represented in the form of one pseudo-atom. The method predicts the tertiary

structure by assembling of short 3-residue fragments sampling, using Monte Carlo algorithm,

guided by a knowledge-based energy. The method was upgraded in 2010 by Das and his

13

team by adding the addition of extensive new energy terms within the force field. The new

method is called Fragment Assembly of RNA with Full-Atom Refinement (FARFAR).

FARFAR defines terms for hydrogen bonding between bases and backbone oxygen atoms,

and, importantly includes information about bonds between hydrogen and the hydroxyl O2′

group (which is the difference between RNA and DNA). It also includes an energy term for

C-H...O contacts, which contribute to the conformational preferences of the nucleotides and

participate in the formation of some non-Watson–Crick base pairs. A description how to use

FARNA/FARFAR can be found here (Cheng et al. 2015). For short RNA fragments (up to 32

nucleotides) Rosetta can be accessed via the Rosetta Online Server That Includes Everyone

ROSIE) (Moretti et al. 2017).

1.2.2.2 Model quality assessment programs

As a result of computer modeling, the researcher gets a set of alternative models of RNA

structure. How to choose a model that is the closest to the real one? This predictive task,

called “model evaluation”, “quality assessment”, or “scoring”, is a crucial step for 3D RNA

structure prediction.

Programs that try to solve this problem are called MQAPs (Model Quality Assessment

Programs). MQAPs analyze structural models and calculate for each of them a quality score,

which often aims at predicting the global and/or local accuracy of the method, as compared to

the “real” structure, which is typically not available in real-life cases. In addition, MQAPs

can also provide the user with a list of errors for a given model, informing, for example, of

any chain breaks, incorrect rotamers, custom lengths of atoms, steric conflicts. It is worth

mentioning that initially the methods we would call today MQAPs were not used to evaluate

theoretical models resulting from computer modeling. The very first MQAPs were used to

detect errors in structures determined using X-ray crystallography methods. The first

crystallographic structures, due to the low resolution and difficulty of tracing the protein

chain in the density map, often contained serious errors, e.g., in the first crystallographic

model of the small Rubisco protein subunit the polypeptide chain was led in the opposite

direction compared to that in the true structure (Chapman et al. 1988). The most known

model-evaluation methods, so-called, “stereochemistry MQAPs” are PROCHECK

(Laskowski et al. 1993), WHATCHECK (Dunbrack 2004).

14

The second group comprises MQAPs that are knowledge-based statistical potentials. An

example of these programs is Verify3D (Eisenberg et al. 1997). In this approach, first a

statistical potential has to be developed based on a database of solved experimentally

structures. Next, for analyzed structural models, a statistical potential returns the value of the

quality assessment, which reflects how often the given structural features occurred in the

database. Proteins with rare structural features receive poor quality score. MQAPs based on

statistical potentials are less sensitive to small errors and can be used to evaluate the quality

of theoretical models. With the development of structural bioinformatics, the need for such

methods has increased.

Another group of programs is called, “clustering MQAPs”. These programs need multiple

alternative models (often tens or even thousands) to predict scores for them. The quality

assessment reflects the average similarity of the structural features of a given model to the

rest of the models in the analyzed pool. Models with overrepresented structural features are

rated as most likely to be most accurate. On the other hand, models with structural features

occurring only in a small number of other models get poor quality rating. Examples of such

programs include Pcons (Lundström et al. 2008), ModFOLDclust (McGuffin 2008) and 3D-

Jury (Ginalski et al. 2003).

The last group of MQAPs are methods based on so-called “meta-predictors”. These programs

use statistical models to interpret scores calculated based on third party tools, called primary

predictors. Meta-predictors are based on learning methods, such as support vector machine,

linear regression, network, or recently deep neural networks. Examples of such programs are

QA-ModFOLD (McGuffin 2008) and developed in the laboratory of professor Janusz

Bujnicki, Meta-MQAP (Pawlowski et al. 2008). mqapRNA, a program described in this

study, is an attempt to bring the principle underlying Meta-MQAP to RNA structure

bioinformatics.

MQAPs can also be divided into two types of quality assessment they compute - global and

local. Global MQAPs for a structural model calculates one quality value. In contrast, local

MQAPS also assess the local quality of a model and can be used to detect parts of a model

that need refolding or further minimization.

15

For proteins, many of the mentioned approaches turned out to be very effective for scoring

models (Kryshtafovych et al. 2017). In contrast, model quality assessment of RNA models is

at a very early stage. However, several attempts have been made recently toward the

development of statistical potentials for quality assessment for 3D RNA models, e.g.,

Ribonucleic Acids Statistical Potential (RASP) (Capriotti et al. 2011) RNA KB potential

(Bernauer et al. 2011), 3dRNAscore (Wang et al. 2015), εSCORE (Bottaro et al. 2014).

RASP is a statistical potential that is derived from a non-redundant set of 85 RNA structures.

The method is based on geometrical descriptors that explicitly account for base pairing and

base stacking interactions, and it includes a representation of local and non-local interactions

in RNA structures. In addition, the method is capable of a local quality assessment. The total

RASP score is the sum of the individual scores of all interactions found within an RNA

molecule. The method is easy to install and to use. Moreover, it can also be used via a web

server http://melolab.org/webrasp/home.php (Norambuena et al. 2013).

RNA KB includes two fully differentiable knowledge-based potentials, a coarse-grained one

and an all-atom one. The potentials were derived from a curated dataset of RNA structures.

Based on the observed distance measurements in this dataset, a potential mean force was

built, as described previously for proteins (Lu and Skolnick 2001). RNA KB potentials

implicitly incorporate all base interactions into distance-based potentials. The tool is quite

hard to use and requires a basic knowledge of Molecular Dynamics. RNA KB is distributed

as a force field that can be used for Molecular Dynamics implementation in the GROMACS

package.

3dRNAscore is a knowledge-based potential, which combines distance-dependent and

dihedral-dependent energies. The functional form of 3dRNAscore was devised from

Boltzmann distribution, and contains two energy terms: the distance-dependent energy and

the backbone dihedral-dependent energy. The parameters in the scoring function were

obtained based on a training set of non-redundant RNA tertiary structures.

εSCORE employs a coarse-grained representation (one bead per nucleotide) and is not

sequence dependent. εSCORE describes an RNA structure as a collection of vectors that

represents base-base and stacking interactions. The method was trained on the crystal

structure of the H. marismortui large ribosomal subunit. εSCORE is not only a scoring

http://melolab.org/webrasp/home.php

16

function, but it is also a metric that can be used to compare two RNA structures. The software

for performing the calculations is freely available as a part of the Barnaba package.

Scoring methods can also be useful if we have a pool of relatively good quality models. Since

predicting the structure of large (>70 nucleotides) RNAs remains challenging task (Laing and

Schlick 2010; Miao et al. 2017). In order to increase the accuracy of the prediction of RNA

structure by bioinformatics tools, both at the secondary and tertiary level, experimental data

can be used.

1.3 RNA structure prediction with low-resolution experimental data

Experimental techniques for RNA secondary structure determination typically utilize

chemical or enzymatic probing and can be used either in vitro or in vivo (Table 1.3.1). The

main principle is that chemical reagents and nucleases used for this type of analysis interact

differentially with paired and unpaired nucleotide residues, e.g., ribonuclease (RNase) V1 is

reactive toward residues in double-stranded RNA, and RNase S1 is reactive toward single-

stranded regions. The use of base-selective chemical reagents (DMS, kethoxal, CMCT, See

Table 1.3.1) provides structural information about the base stacking, hydrogen bonding, and

electrostatic environment adjacent to the base. Local nucleotide flexibility and dynamics can

be inferred from experiments that interrogate all four RNA nucleotides. For instance,

selective 2′-hydroxyl acylation analyzed by primer extension (SHAPE) technique uses

hydroxyl-selective electrophiles that react with the 2′-hydroxyl group at flexible or disordered

nucleotides (Merino et al. 2005). The in-line probing method does not require the use of any

chemicals but exploits the natural instability of RNA molecules. The RNA is incubated at

slightly alkaline pH, and the spontaneous cleavage of the sugar backbone by adjacent 2′-

hydroxyl groups, which reflects the local nucleotide flexibility, is monitored (Nahvi and

Green 2013). Although there is a clear correlation between the local reactivities of RNA

molecules and base pairing probabilities, the problem of how to incorporate the probing data

into computational modeling procedure is not straightforward. The difficulty originates from

the fact that reactivities depend on the structural context and are influenced by tertiary

contacts (Washietl et al. 2012). Thus, computational methods have been adapted to allow

transforming the reactivities to discrete states (paired or unpaired), or calibrating the

interaction energy term proportionally to the reactivities (Mathews et al. 2004). There have

17

been attempts to integrate Molecular Dynamics simulations with SHAPE reactivates

(Kirmizialtin et al. 2015) and in-line probing experiments (Mlynsky and Bussi 2017).

Experiment techniques can be also used to detect non-local interactions and the data they

generate can be processed into a list of distance (long-range) restraints. Distance restraints are

important for the modeling process, as even a small number of them are sufficient to reduce

the conformational space sufficiently to allow accurate prediction of native RNA structures

(Lavender et al. 2010). A “mutate-and-map” strategy (Kladwang et al. 2011) is based on the

observation that when a paired nucleotide is mutated, its partner becomes more accessible to

reagent, which can be readily detected by subsequent chemical probing (e.g., by SHAPE).

Importantly, this strategy can reveal not only pairings in secondary structure, but also tertiary

contacts between sequentially distant fragments of the molecule. Multiplexed hydroxyl

radical cleavage analysis (MOHCA) (Das et al. 2008) is another technique that provides

information about long-range contacts. There, RNAs are created with randomly incorporated

nucleotides tethered to a Fe(II)–EDTA moiety, which can be used to induce through-space

cleavage of nearby residues in the RNA. Sites of that cleavage and the location of the probe

nucleotide can be identified by two-dimensional gel electrophoresis. Experimental methods

that are used to probe long-range contacts include UV- or chemically induced cross-linking,

site-directed cleavage, fluorescence resonance energy transfer (FRET) (Klostermeier and

Millar 2001), electron spin resonance (ESR/EPR) (Qin and Dieckmann 2004). All of the

experimental techniques mentioned above can be used both in guiding prediction process and

in filtering out the best predictions from a pool of RNA 3D models. mqapRNA allows for

filtering based on a set of distance restraints and this aspect will be discussed in this thesis.

Type of restraints Method Description Secondary structure SHAPE (Selective 2′-Hydroxyl

Acylation analyzed by Primer extension)

Method for quantitative detection of local nucleotide flexibility. 2′-OH in flexible, unpaired nucleotides reacts preferentially with a probing reagent, forming adducts that can be identified as stops to primer extension by reverse transcriptase.

Secondary structure DMS (dimethylsulfate footprinting)

DMS reacts with adenine at N1 and cytosine at N3. Reactive cytosines and adenines can be detected by reverse transcription and are

18

considered as unpaired. Secondary structure CMCT (1-cyclohexyl-

(2-morpholinoethyl) carbodiimide metho-p-toluene sulfonate)

CMCT reacts with N3 of uridine and, to a lesser extent, N1 of guanine. Reactive residues can be detected by reverse transcription and are considered as unpaired.

Secondary structure Kethoxal Kethoxal specifically attacks accessible N1 and N2 of guanine, and it is used for detection of unpaired guanines. The modified sites can be detected by reverse transcription.

Secondary structure + tertiary contacts

Mutate-and-map SHAPE/DMS/CMCT chemical probing for a large number (preferably all) of point mutants of the RNA sequence. Analysis of changes in secondary structures of the set of point mutants can be used to infer tertiary contacts.

Tertiary contacts MOHCA (multiplexed hydroxyl radical cleavage analysis)

enables the detection of pairs of contacting residues via random incorporation of radical cleavage agents. Contacting residues are detected from a cleavage pattern analyzed in two-dimensional gel electrophoresis.

Tertiary contacts Cross-linking Based on the formation of covalent bonds between spatially close regions of RNA that may be distant in sequence. Can be achieved using physical factors such as UV light or by chemical reagents.

Distances between labeled residues

FRET (Forster Resonance Energy Transfer)

Distances between fluorescent dyes linked to RNA molecule are inferred from the intensity of energy transfer.

Distances between labeled residues

ESR/EPR (Electron Spin/Paramagnetic Resonance) spectroscopy

Distances are derived from the measured spin–spin splittings for unpaired electrons localized on paramagnetic labels linked to RNA molecule

Table 1.3.1: Low-resolution experimental methods that generate particularly useful data for

computational prediction of RNA 3D structure, based on (Magnus et al. 2014). An accurate

secondary structure or/and distance restraints can be used with mqapRNA to refine the final

ranking.

19

Experimental techniques are a great source of information that can be used for RNA 3D

structure prediction. However, we can also learn a lot of about the structure by a thoughtful

analysis RNA alignments.

1.4 RNA families

Just like proteins, RNAs can be grouped into families that have evolved from a common

ancestor. Sequences of RNAs from the same family can be aligned to each other to give a

multiple sequence alignment (MSA). The analysis of patterns of sequence conservation or the

lack thereof can be used to detect important conserved regions, e.g., regions that bind ligands,

active sites, or involved in other important functions.

An accurate RNA sequence alignment can improve secondary structure prediction.

According to the CompaRNA (Puton et al. 2013) continuous benchmarking platform,

methods that exploit RNA alignments, such as PETfold (Seemann et al. 2008) outperform

single sequence predictive methods.

RNA alignments can be used to improve tertiary structure prediction. Weinreb and coworkers

(Weinreb et al. 2016) adapted the maximum entropy model to RNA sequence alignments to

predict long-range contacts between residues for 180 RNA gene families. They applied the

information about predicted contacts to guide in silico simulations and observed significant

improvement in predictions of five cases they investigated. mqapRNA, a method described in

this work, has a capability of processing this type of restraints and use them for scoring

models. Another way to use RNA alignments is take advantage of an observation that

members of the same family tend to fold into the same 3D shape (Fig. 1.4.1). RNA

alignments can be used to carry out independent folding simulations for a subset of the

homologous sequences in the MSA and then identifying the best models common to all

folded sequences via simultaneous clustering of the independent folding runs. This approach

was earlier implemented and benchmarked for proteins by Bonneau and coworkers (Bonneau

et al. 2001) and successfully applied to in silico model tertiary structures of major protein

families (Bonneau et al. 2002). To the best of my knowledge, EvoClustRNA developed in

this study is the first attempt to use this approach for RNA 3D structure prediction.

20

Figure 1.4.1: RNA families tend to fold into the same 3D shape. Structures of the riboswitch

c-di-AMP solved independently by three groups: for two different sequences obtained from

Thermoanaerobacter pseudethanolicus (PDB id: 4QK8) and Thermovirga lienii (PDB id:

4QK9) (Gao and Serganov 2014), for a sequence from Thermoanaerobacter tengcongensis

(PDB id: 4QLM) (Ren and Patel 2014) and for a sequence from Bacillus subtilis (PDB id:

4W90) (the molecule in blue is a protein used to facilitate crystallization) (Jones and Ferré-

D'Amaré 2014). There is some variation between structures in the peripheral parts (marked

with red arrows), but the overall structure of the core is preserved.

Information about RNA families is collected in the Rfam database (Nawrocki et al. 2015).

Each RNA family is represented by multiple sequence alignments, consensus secondary

structures and covariance models (CMs). Another source of information about RNA families

is RNArchitecture (http://genesilico.pl/RNArchitecture/) (Boccaletto et al. 2017). It is a

database developed in the laboratory of professor Bujnicki, which provides a comprehensive

description of relationships between known families of structured RNAs (taken from Rfam),

with a focus on structural similarities. RNArchitecture includes 2688 families of which only

2.54% (70 families) have a structural model solved experimentally (Fig. 1.4.2). Thus, there is

a huge need for fast and accuracy methods for RNA structure determination, both

experimental and computational to provide structural insights into these RNA molecules.

http://genesilico.pl/RNArchitecture/

21

Figure 1.4.2: According to the RNArchitecture database, there are only 3% (70) Rfam

families with known experimentally solved structures, and 97% (2,618 families) without

known structures.

1.5 RNA-Puzzles

To track the progress in computational methods for RNA 3D structure prediction and how

close, the RNA-Puzzles initiative was proposed and implemented by professor Eric Westhof

and coworkers. It is a collective experiment for blind RNA structure prediction modeled after

a well-established initiative Critical Assessment of Techniques for Protein Structure

Prediction Experiment (CASP) (Kryshtafovych et al. 2017). The organizers of RNA-Puzzles

receive from crystallographers an RNA sequence for which the structure has been solved in

their laboratories, and is not yet publicly available. The sequence is sent out to groups

involved in the modeling of RNA structures all around the world. These groups have

approximately a month to apply available bioinformatics methods to model the structure for

the target sequence and to send relevant results to the organizers. The goal of the experiment

is to determine the capabilities and limitations of the current state of the art methods for 3D

RNA structure prediction based on sequence. This challenge is also an opportunity to

evaluate the progress that has been made in the RNA structure prediction methodologies, as

well as what has to be done to achieve better solutions. The initiative identifies specific

bottlenecks that may hold back the field and promotes the available methods, providing

guidance in the choice of suitable tools for real-world problems. For each target, a ranking of

models is prepared and can be sorted according to various criteria (Fig. 1.5.1), such as root-

22

mean-square deviation (RMSD), interaction network fidelity (INF). In addition, each

submitted model has its own page where is a JSmol (Hanson et al. 2013) visualization is

shown with a model superimposed on the native structure (Fig. 1.5.2). The rankings can be

found at http://ahsoka.u-strasbg.fr/rnapuzzlesv2/results/. Until now, twenty puzzles have been

set up and three publications, describing three rounds of the experiment, have been published

to summarize the results and discuss the progress in the field (Miao et al. 2017; Miao et al.

2015). To summarize them briefly, huge progress was made since the first round, and some

of the models reached a near-atomic resolution, like in the case of a twister sister (RNA

Puzzle 19 challenge) (Liu et al. 2017) or a Zika virus domain (RNA Puzzle 20 challenge)

(Akiyama et al. 2016). A very important problem in the RNA structure determination is the

prediction of the non-Watson-Crick interactions that are key factor in RNA folding. Another

problem is that some of the submitted puzzles’ models have high Clash Scores.

http://ahsoka.u-strasbg.fr/rnapuzzlesv2/results/

23

Figure 1.5.1: The results of RNA Puzzle 13. The second model in the ranking (sorted

according to RMSD) is a model obtained with a prototype version of EvoClustRNA

developed at the Stanford University. There is not one the way to sort the models. Different

metrics have unique properties, and a researcher should decide what is useful for his/her

application. RMSD informs about a geometrical similarity between a prediction the

crystallographic structure (the lower, the better). INF informs about the similarity of

interaction networks and ranges from 0 to 1 (the higher, the better). Several partial INF can

be computed: INF WC (the canonical interactions only), INF NWC (the non-canonical

interactions only), INF stacking (the stacking interactions only). INF ALL takes into account

all the interactions mentioned above. This RNA-Puzzle shows one of the biggest problems in

the RNA 3D structure prediction, very low INF NWC in all submissions, which means lack

of accurate prediction non-canonical interactions.

Figure 1.5.2: The detailed view of the results of the ZMP riboswitch (RNA Puzzle 13). For

each submitted model a detailed summary is available online that includes a superposition of

a prediction, in this case, the EvoClustRNA prediction (red), on the crystallographic structure

(green). Various metrics are shown in the result summary.

24

2 Aim of this work

The prediction of three-dimensional structures for complex RNAs remains a challenging task,

despite progress made recently by many researchers working in this field of science. The aim

of this work it to develop and benchmark three tools that makes this process more feasible:

(1) mqapRNA – a model quality assessment tool for RNA 3D models,

(2) EvoClustRNA – a predictive method based on simulations of homologs,

(3) rna-pdb-tools – a toolbox for RNA structural bioinformatics.

mqapRNA is a new scoring method that uses a deep learning algorithm to provide the

improved quality prediction of RNA structural models. To test the method, a set of datasets

were prepared, and the benchmark was performed. EvoClustRNA is a clustering routines of

evolutionary conserved regions (helical regions) for RNA fold prediction. rna-pdb-tools is a

new package of a Python library and a set of over 50 tools to enhance development of new

applications and procedures in RNA structural bioinformatics.

25

3 Materials & Methods

3.1 Hardware

All calculations, included in this work, were performed on resources provided by the

International Institute of Molecular and Cell Biology: HPC (High-Performance Computing)

Cluster (hostname: Peyote2, operating system: Ubuntu 10.04.4 LTS), Apple MacBook laptop

(macOS Sierra), and a virtual machine mqapRNA-vm (Ubuntu 14.04.3 LTS). The initial

version of EvoClustRNA was run at the University of Stanford: HPC (hostname: Biox3,

operating system: CentOS 6).

3.2 Software

All tools, created in this study: mqapRNA, EvoClustRNA, rna-pdb-tools, are written in

Python (version 2.7, http://www.python.org/). Python is a scripting language that uses Object

Oriented Programming, is open source and free to use. Python libraries, used in the projects,

are as follows: multiprocessing, Pandas, pytest, argparse, pyflakes. The code follows best

practices described by Kristan Rother (Rother 2017), Robert Martin (Martin 2008), Andrew

Hunt & David Thomas (Hunt and Thomas 1999), e.g., short functions, extensive

documentation, automated testing, version control.

GNU Emacs (version 25.1, https://www.gnu.org/software/emacs/) is an extremely extensible

and customizable text editor that was used for this work in various areas: alignment

preparation, code editing, note-taking, and more. In addition to the standard installation of

GNU EMACS, following extensions were used: magit, org-mode, markdown-mode, python-

mode.el, jedi, flycheck, yasnippet, projectile, sphinx-doc, RealGUD, autopep8, Emacs Speak

Statistics (ESS). The configuration file can be found under

https://github.com/mmagnus/emacs-env. This editor was also used for PDB files modification

using pdb-mode, and alignment preparation with RALEE (Griffiths-Jones 2005).

Git (version 2.11, https://git-scm.com/) was used to manage all the scientific code. Git is a

free and open source distributed version control system. “Distributed” means that there is no

(central) repository that each developer has to send his or her code to. Each copy of a

http://www.python.org/)

https://www.gnu.org/software/emacs/

https://github.com/mmagnus/emacs-env

https://git-scm.com/)

26

repository has its history and later can be easily merged with another repository of a team or

another developer. To host the code online, to make it available for everyone to download,

GitHub (online, https://github.com/) is used. All local changes of programs are sent to

GitHub and can be then seen by users all around the world. GitHub is free of charge for open

source projects. The GitHub repositories of the projects can be found under the links:

https://github.com/mmagnus/EvoClustRNA and https://github.com/mmagnus/ /rna-pdb-tools.

A tutorial on how to start working with Git written by the author of this thesis can be found at

http://rna-pdb-tools.readthedocs.io/en/latest/git.html.

Documentation All projects, described in this work, are well documented using Python

docstrings in classes, modules, and functions in concert with Sphinx (version 1.6.3,

http://www.sphinx-doc.org/en/stable/). Sphinx is a free, open source, very easy to use tool

that creates beautiful documentation in various formats, e.g., HTML, PDF, ePub. Sphinx is

run locally to generate documentation on local machines.

To be able to share the documentation publicly, Read the Docs (RTD, online,

https://readthedocs.org/) is used to provide a web interface for documentations of the

projects. The servers of Read The Docs tracks changes at GitHub repositories. If there is a

change in code or documentation, the RTD server is triggered, a new documentation is

compiled and after a few second is presented online. The RTD documentation can be found

under the links: http://EvoClustRNA.rtfd.io, http://rna-pdb-tools.rtfd.io. For all projects, the

Google style docstrings (https://google.github.io/styleguide/pyguide.html) via Napoleon is

used. Napoleon is a Sphinx Extension that enables Sphinx to parse Google style docstrings -

the style recommended by Khan Academy (http://www.sphinx-

doc.org/en/stable/ext/napoleon.html#type-annotations)

3.3 Structure visualizations

Structure visualizations in 3D were generated with PyMOL (version 1.7.4 Edu Enhanced for

Mac OS X by Schrödinger) (DELANO 2002). VARNA (version 3.93) (Darty et al. 2009) is a

plug-in, written in Java, dedicated to draw secondary RNA structures. It was used to visualize

the secondary structure of RNA in this work.

https://github.com/



http://rna-pdb-tools.readthedocs.io/en/latest/git.html

http://www.sphinx-doc.org/en/stable/

https://readthedocs.org/

http://evoclustrna.rtfd.io/

http://rna-pdb-tools.rtfd.io/

https://google.github.io/styleguide/pyguide.html

http://www.sphinx-doc.org/en/stable/ext/napoleon.html#type-annotations

http://www.sphinx-doc.org/en/stable/ext/napoleon.html#type-annotations

27

3.4 Databases

Protein Data Bank (Berman et al. 2000) (http://www.pdb.org/) is a database of

experimentally-determined structures of proteins, nucleic acids, and biomolecular complexes.

All structures, solved experimentally and used in this study were obtained from the Protein

Data Bank database.

In the study, the Rfam database (http://rfam.xfam.org/) was also used, see Materials &

Methods 3.6.1.

3.5 Development of mqapRNA

3.5.1 Datasets

To train mqapRNA, two datasets were used: RASP, RNA KB.

The first dataset (RASP) was made by Capriotti to develop the RASP method (Capriotti et al.

2011). The dataset was obtained by generating from the 85 native structures a set of Gaussian

restraints for dihedral angles and atom distances. For each native RNA structure, a set of 500

decoy structures was built by randomly removing an increasing fraction of constraints,

generated from the native RNA structure. Each decoy was built using the MODELLER

computer program (Sali and Blundell 1993), using a subset of restraints as Gaussian

potentials. This dataset can be downloaded via http://melolab.org/supmat.html. Four

structures were too big (over 200 nucleotides) to be considered for the development of

mqapRNA, and, therefore, were removed from the dataset.

The second dataset was prepared to develop RNA KB potential (Bernauer et al. 2011). The

dataset contains two subsets. The first one, RNA KB-Molecular Dynamics (MD), is based on

a set of Molecular Dynamics simulation in the explicit solvent that generated structures that

have RMSD values a few angstroms (typically 2Å) away from the native structure. The

subset contains five sequences with 3500 distorted models per sequence. The second subset,

RNA KB-Normal Mode (NM), was generated by Normal Mode perturbation of the crystal

structures. The subset contains 15 sequences with 500 models per sequence. All the decoys

http://www.pdb.org/)

http://rfam.xfam.org/

http://melolab.org/supmat.html

28

can be downloaded from http://csb.stanford.edu/rna, and further details are described in the

corresponding article. In addition, one more dataset was prepared to only test mqapRNA.

To test the methods, the third dataset was prepared of all models submitted to the RNA-

Puzzle organizers (http://ahsoka.u-strasbg.fr/rnapuzzles/). First, all models were manually

inspected to detect discrepancies that can not be solved automatically with rna-pdb-tools;

e.g., various chain names. Second, all models were standardized with rna-pdb-tools. This

dataset is available under a link https://github.com/mmagnus/RNA-Puzzles-Normalized-

submissions , with detailed descriptions how models were edited. This is a very unique and

valuable dataset that will be useful for the community.

3.5.2 Primary methods

mqapRNA includes a Python interface as wrappers around the primary methods. If a given

method returns also addition subscores, e.g. energy of stacking possible, all subscores were

collected and used for a statistical model (Table 3.5.1). The primary predictors were divided

into four categories: (1) model quality methods: 3RNAscore (Wang and Xiao 2002), RASP

(Capriotti et al. 2011), RNAkb (Bernauer et al. 2011), εSCORE (Bottaro et al. 2014); (2)

RNA structure modeling methods: SimRNA (Boniecki et al. 2016) and Rosetta (in two

modes: low resolution (Das and Baker 2007), and full-atom high resolution (Das et al.

2010)); (3) clash score calculator (using the Probe program from Molprobity suite (Adams

et al. 2010)) and correctness of geometry analyzer (using the Suitename program from

Molprobity suite (Adams et al. 2010)), radius of gyration implemented in a Python script.

Method Subscores

1. SimRNA simrna_steps, simrna_total_energy, simrna_base_base,

simrna_short_stacking, simrna_base_backbone,

simrna_local_geometry, simrna_bonds_dist_cp,

simrna_bonds_dist_pc, simrna_flat_angles_cpc,

simrna_flat_angles_pcp, simrna_tors_eta_theta,

simrna_sphere_penalty, simrna_chain_energy

2a. Rosetta - coarse

grained low resolution

farna_rna_vdw, farna_rna_base_backbone,

farna_rna_backbone_backbone, farna_rna_repulsive,

farna_rna_base_pair, farna_rna_base_axis,

farna_rna_base_stagger, farna_rna_base_stack,

farna_rna_base_stack_axis, farna_rna_rg,

farna_atom_pair_constraint, farna_linear_chainbreak,

farna_rna_data_backbone, farna_score_lowres

2b. Rosetta - full-atom

high resolution

farna_fa_atr, farna_fa_rep, farna_fa_intra_rep,

farna_lk_nonpolar, farna_fa_elec_rna_phos_phos,

farna_ch_bond, farna_rna_torsion, farna_rna_sugar_close,

farna_hbond_sr_bb_sc, farna_hbond_lr_bb_sc, farna_hbond_sc,

farna_geom_sol, farna_atom_pair_constraint_hires,

http://csb.stanford.edu/rna

http://ahsoka.u-strasbg.fr/rnapuzzles/

https://github.com/mmagnus/RNA-Puzzles-Normalized-submissions

https://github.com/mmagnus/RNA-Puzzles-Normalized-submissions

29

farna_linear_chainbreak_hires, farna_score_hires

3. RASP rasp_c3_pdb_energy, rasp_c3_no_contacts, rasp_c3_norm_energy,

rasp_c3_mean_energy, rasp_c3_sd_energy, rasp_c3_zscore,

rasp_bb_pdb_energy, rasp_bb_no_contacts, rasp_bb_norm_energy,

rasp_bb_mean_energy, rasp_bb_sd_energy, rasp_bb_zscore,

rasp_bbr_pdb_energy, rasp_bbr_no_contacts,

rasp_bbr_norm_energy, rasp_bbr_mean_energy,

rasp_bbr_sd_energy, rasp_bbr_zscore, rasp_all_pdb_energy,

rasp_all_no_contacts, rasp_all_norm_energy,

rasp_all_mean_energy, rasp_all_sd_energy, rasp_all_zscore

4. RNA KB rnakb_bond, rnakb_angle, rnakb_proper_dih,

rnakb_improper_dih, rnakb_lj14, rnakb_coulomb14, rnakb_lj_sr,

rnakb_coulomb_sr, rnakb_potential, rnakb_kinetic_en,

rnakb_total_energy

5. 3RNAscore x3rnascore

6. εSCORE escore

7. Geometry Analysis analyze_geometry

8. Clash Score clash_score

Table 3.5.1: A list of subscores extracted from the primary methods used for training and

prediction with mqapRNA. For each analyzed structure, all these scores are provided in a

CSV output file, both in the standalone version and the web servers

3.5.3 Secondary structure comparison

Secondary structure comparisons were calculated based on outputs of ClaRNA (Waleń et al.

2014) using the Interaction Network Fidelity (INF) value which is computed as:

where TP is the number of correctly predicted base–base interactions, FP is the number of

predicted base–base interactions with no correspondence in the solution model, and FN is the

number of base–base interactions in the solution model not present in the predicted model

(Miao et al. 2017).

3.5.4 Standardization of PDB files

All structures before scoring were standardized with rna-pdb-tools

(https://github.com/mmagnus/rna-pdb-tools).


30

3.5.5 Evaluation of scoring functions

To assess the accuracy of the prediction, 5-fold cross-validation was performed on the RASP

and RNA KB datasets. None of structures of the RNA-Puzzles dataset were used at any stage

of training the statistical model. The cross-validation was performed using built-in

functionality of the H2O platform via the H2O Flow web interface.

To assess the performance of the scoring functions, rank correlation (Spearman, R) between

scores and RMSDs were calculated. To compare structural models to native structures, root

mean square deviation (RMSD) was used. RMSD is defined by the following formula:

where δ is the Euclidean distance between a given pair of corresponding atoms. RMSD is

calculated for all heavy atoms. The R ranges from -1 to 1. If the energy is perfectly linear to

the RMSD, R is equal to 1. If the energy is random, R is equal to 0.

The second metric used to assess the performance was Enrichment Score (ES), described in

the publication about RNA KB (Bernauer et al. 2011). The enrichment score is defined as:

where Etop10% is the set of structures with energies in the top 10%, and Rtop10% is the set of

structures with the RMSD in the lowest 10%. | Etop10% ∩ Rtop10% | is the number of structures

in the intersection of these two sets. The ES ranges from 0 to 10. If the energy is perfectly

linear to the RMSD, ES is equal to 10. If the energy is random, ES is equal to 1.

3.5.6 Statistical analyses

A wide range of statistical methods was applied to complete the project: Pearson and

Spearman's rank correlation coefficients, data normalization, statistical model building.

Statistical analyses were carried out using R (version 3.3) and Python (version 2.7) with

Jupyter - former IPython (Pérez and Granger 2007). The final statistical model was built with

31

the H2O platform (https://www.h2o.ai/). H2O is a free and open source machine-learning

platform that allows for building statistical models training on big (and small) data.

3.5.7 Implementation of the web server

The web server of mqapRNA (http://genesilico.pl/mqapRNA/) was implemented in Python

(https://www.python.org/, version 2.7) coupled with Django (https://www.djangoproject.com,

version 1.5.1) and SQLite (http://www.sqlite.org, version 3.8.2). SQLite is a self-contained,

high-reliability, embedded, full-featured, public-domain, SQL database engine that keeps all

the data in one file. Database management system is used to store information about users’

submissions. Django is a Python Web framework that is designed for fast development of

web services. mqapRNAweb provides a clean interface that is developed to be user-friendly

even for users without prior expertise in RNA bioinformatics.

3.6 Development of EvoClustRNA

3.6.1 Multiple sequence alignment generation and selection of homologs

For each sequence, the corresponding Rfam (Nawrocki et al. 2015) alignment was

downloaded. Rfam (Nawrocki et al. 2015) (http://rfam.sanger.ac.uk/) is a database of RNA

sequences grouped into RNA families. Each family is represented as a statistical model (CM,

covariance model) using Infernal software (Nawrocki et al. 2009) that combines sequential

and structural (secondary structure) information. Sequences in alignments were sorted by

length, and the redundancy was reduced to the threshold of sequence similarity to 90% with

Jalview (Waterhouse et al. 2009). Four of the shortest sequences were selected for the

modeling. The conserved regions were visually identified in Emacs using the RALEE

(Griffiths-Jones 2005) plugin. A new pseudo-sequence named “x” (Fig 3.6.1) was created to

mark the conserved residues which should be cut out for clustering. If the target sequence

was not in the alignment, it was manually added. Based on alignments a set of FASTA input

files with sequences and their secondary structures were created with Jalview (Fig. 3.6.2) and

used as an input for modeling.

https://www.h2o.ai/)

http://genesilico.pl/mqapRNA/

https://www.python.org/

https://www.djangoproject.com/

http://www.sqlite.org/

http://rfam.sanger.ac.uk/

32

Figure 3.6.1: The alignment preparation. The conserved residues are marked with “x” in the

pseudo-sequence “x”. The marked as the conserved residues columns can be inspected in an

arc diagrams of RNA secondary structures (Lai et al. 2012) as the pink line (at the very

bottom).

Figure 3.6.2: Each sequence and associated secondary structure was "Saved as" to a Fasta

file and used at the next stage of modeling with the use of the Jalview program.

3.6.2 Modeling of sequences with SimRNA/SimRNAweb and Rosetta

For modeling with SimRNA (Boniecki et al. 2016), the SimRNAweb (Magnus et al. 2016)

(http://genesilico.pl/SimRNAweb) server was used with the default parameters (1% of the

http://genesilico.pl/SimRNAweb

33

lowest energy frames taken for clustering, 500 - a number of simulation steps). SimRNA

trajectories were downloaded from the server and one hundred low-energy models were

obtained from each SimRNA trajectory with programs implemented in rna-pdb-tools

(https://rna-pdb-tools.readthedocs.io/en/latest/utils.html#simrna).

For Rosetta, a pipeline implemented in rna-pdb-tools (utils Rosetta, https://rna-pdb-

tools.readthedocs.io/en/latest/utils.html#rosetta) was used as described in the work of Cheng

and coworkers (Cheng et al. 2015). The procedure starts with pre-assembling of helices. Then

Rosetta runs, without minimization, to obtain 10,000 output models. Next, 1/6 (17%) of the

lowest energy models is minimized. For each Rosetta run, one hundred low-energy models

were selected for clustering with EvoClustRNA.

3.6.3 Clustering routine

The clustering procedure used with EvoClustRNA has been implemented by Irina Tuszyńska

for the use of DARS-RNP and QUASI-RNP (statistical potentials for protein-RNA docking)

(Tuszyńska and Bujnicki 2011). In the case EvoClustRNA, the procedure was slightly

modified, but the underlying principles remained the same. The program is an

implementation of an algorithm used for clustering with Rosetta for protein structure

prediction (Simons et al. 1999), also described in (Bonneau et al. 2001). Briefly, one hundred

low-energy structures for each homolog are taken for clustering. The clustering procedure is

iterative and begins with calculating a list of neighbors for each structure. Two structures are

considered as neighbors when their RMSD between them is smaller than a given distance

cutoff. To find a proper cutoff, an iterative procedure of clustering starts from 0.5 Å and

incremented by 0.5 Å, until the three biggest clusters contains half of all structures used for

clustering. For example, for five homologs, 500 structures are clustered. An iterative

clustering stops when there are at least 250 structures in the three biggest clusters.

https://rna-pdb-tools.readthedocs.io/en/latest/utils.html#simrna

https://rna-pdb-tools.readthedocs.io/en/latest/utils.html#rosetta

https://rna-pdb-tools.readthedocs.io/en/latest/utils.html#rosetta

34

4 Results

4.1 mqapRNA

mqapRNA (where “mqap” stands for “model quality assessment program”) is a computer

program that analyses a set of models provided by the user in the PDB format and predicts

quality scores. It is a meta-predictor, a method designed to use other methods (called:

primary methods), and to analyze their outputs by dedicated statistical model. Such approach

could provide a better prediction by overcoming weaknesses of individual methods and

building on their individual strengths. The meta-prediction approach has been shown

successful in structural bioinformatics, in particular in protein (Albrecht et al. 2003) and

RNA secondary structure prediction (Siebert and Backofen 2005), protein fold-recognition

(Kurowski and Bujnicki 2003), identification of protein domains (Saini and Fischer 2005),

and evaluation of protein model quality (Pawlowski et al. 2008). Earlier, I used this kind of

approach to improve prediction of subcellular localization for proteins (Magnus et al. 2012).

4.1.1 Implementation of mqapRNA

Primary prediction methods. mqapRNA relies on existing methods and a statistical model

to potentially provide better prediction than each individual method. Based on the results

obtained for the series of primary predictors, mqapRNA uses their outputs to generate a

consensus prediction. Table 3.5.1 in the Materials & Methods section lists all the primary

predictors used in this work.

35

Figure 4.1.1: Graphical diagram of primary methods used by mqapRNA to describe the

analyzed model. (A) other methods for model quality assessment, (B) RNA modeling

software (C) Others.

Primary methods are divided into two groups. One group of programs includes dedicated

methods for model quality assessment: RNAscore, RASP, RNA KB, εSCORE (Fig. 4.1.1A).

The other group of programs are methods for RNA structure modeling, that also allows for

calculating structural descriptors and energy values, that can be used as input in final

statistical model. For each analyzed structure, mqapRNA runs a single step simulation with

SimRNA and Rosetta (both executable in two modes: low resolution (FARNA), and full-

atom high resolution (FARFAR)) to generate scores for the input model (Fig. 4.1.1B). One

more group contains methods for calculating clash score, correctness of geometry and radius

of gyration (Fig. 4.1.1B). For each analyzed structure, mqapRNA runs all the above-

mentioned programs to generate a list of scores and uses it for quality prediction using a deep

learning statistical model.

Datasets used for training and testing. To build a statistical model, two datasets were used:

RASP, RNA KB. These datasets consist of near-native (deliberately perturbed) structural

36

models (“decoys”). Decoy structures are used to test discriminative power of scoring

methods. A good scoring method should be effective in identifying near-native decoys in a

pool of structures.

The RASP dataset was generated by MODELLER (Sali and Blundell 1993) with a set of

Gaussian restraints for dihedral angles and atom distances from 85 native structures. The

dataset includes 85 decoy sets, each containing 500 structures (Fig. 4.1.2).

Figure 4.1.2: Example of a decoy set from the RASP dataset of the adenine riboswitch (PDB

ID: 1Y26). (A) The native structure. (B-F) A set of structures (files in the PDB format)

selected from this decoy with increasing deviation from the native (in parentheses are

RMSDs to the native). Files: (B) 1y26X_M100 (RMSD: 1.7Å), (C) 1y26X_M200 (RMSD:

2.49Å), (D) 1y26X_M300 (RMSD: 3.23Å), (E) 1y26X_M400 (RMSD: 3.31Å), (F)

1y26X_M500 (RMSD: 5.12Å).

The second dataset used for training mqapRNA was RNA KB. This dataset includes two

subsets: RNA KB-Molecular Dynamics (MD) and RNA KB-Normal Mode (NM). The first

subset was generated by position-restrained molecular dynamics and Replica-Exchange

Molecular Dynamics (REMD) simulations and covers a wide near-native RMSD range (from

0.1 to 10 Å, Fig. 4.1.3). In the REMD simulation, 1ns REMD simulations are performed for

each RNA structures. The subset contains five decoy sets, each containing 3500 structures.

The second subset of RNA KB was generated by Normal Mode perturbation method. The

structures in this subset possess stereochemically correct bond lengths and angles but without

correct base pairing (Fig. 4.1.4). The subset contains 15 decoy sets, each including 500

structures.

37

The third dataset was used only for testing and includes all models submitted to the RNA-

Puzzle organizers by participating groups.

Figure 4.1.3: Histograms of RMSDs [Å] per dataset. In red, the datasets used for training

mqapRNA; in orange, the dataset used only for testing. X: number of structures (not scaled in

the same way for all plots because of the very diverse ranges), Y: RMSDs [Å].

All datasets have different structural properties (Fig. 4.1.3 and Fig. 4.1.4). The RASP dataset

covers RMSD from 0 Å to 10 Å (median: 3.94 Å) and structures with very distorted base

pairing (median of INF 0.79). RNA KB-MD contains structures the closest to the native

structures in terms of geometry (median of RMSD: 1.59 Å) and base pairing (median of INF:

1.0). RNA KB-MD covers a narrower range of RMSD but presents different geometrical

distortions from the prior physics-based force field method with very divergent secondary

structure similarity (median of INF 0.77). The RNA-Puzzle dataset consists of structure that

are far from the native structures (median: 14.78 Å, standard deviation: 7.46 Å).

Figure 4.1.4: Histograms of Secondary Structure (INFs) per dataset. In red, the datasets used

for training mqapRNA, in orange, the dataset used only for testing. X: number of structures

(not scaled in the same way for all plots because of the very diverse ranges), Y: Secondary

Structure similarity of a given model to a secondary structure of a native structure (INFs).

Training of the statistical model mqapRNA is based on machine learning methods to

recognize patterns in the data, to make predictions about new data. It utilizes “supervised

38

learning” where the good examples and bad examples are presented to a statistical model in

order to train it to make predictions. For each structure of the training datasets, a list of scores

from primary methods was obtained. Each structure was described by 70 variables - scores

obtained from 8 primary methods. The response variable was the value of the structure

RMSD to the native structure (Fig. 4.1.5A), and this is the value that mqapRNA projects for

new structures (of unknown RMSD) based only on scores from the primary methods (Fig.

4.1.5B).

Figure 4.1.5: mqapRNA is a machine learning based method. (A) First, a statistical model

was built on a training dataset of structures of known RMSD to native structures. Each

structure is described by a list of scores, results of the primary methods. Since this is the

training set, RMSD of these structure to native structures is known. This process allows

mqapRNA to detect what is the correspondence between scores and RMSDs. (B) Next, the

statistical model is applied for new cases, where RMSD is unknown.

The statistical model used in mqapRNA is based on the deep learning algorithm. The series

of grid searches were performed to find an optimal set of parameters for the statistical model.

The five-fold cross-validation was performed to limit the bias towards the training dataset.

The whole procedure was performed using the machine learning platform, H2O. At the very

end of this process, the final statistical model was selected to be used in mqapRNA. An

accurate statistical model, based on the training data, should be able to discover links

39

between scores and RMSD values, therefore, for a new vector where only scores are known,

the method should predict the theoretical RMSD. The predicted RMSDs are a measure of the

quality of the structure.

Figure 4.1.6: Contribution (“Importance”) to a given subscore (“Variable”) in the final deep

learning model developed for mqapRNA (a plot generated with the H2O flow Notebook).

The higher, the more a given subscore is required for accurate predictions of the statistical

model.

Figure 4.1.6. shows the impact (“variable importance”) on a given variable in the final

statistical model. Surprisingly, the variable with the greatest impact had a score that describes

the radius of gyration of an analyzed model. This might mean that the appropriate radius of

gyration (compactness) of models is important for quality prediction. The second score

(scaled importance: 0.68) on the list is a component of RASP (“RASP All Interactions

Normalized Energy”), and the third 3RNAscore (scaled importance: 0.59). The statistical

model also depends on a number of chains (“No Chains”) and the length of an analyzed

structural model (7th and 8th in the ranking, respectively).

4.1.2 Performance of mqapRNA

To test how the scores of the methods correlate with the observed structural deviation from

the native conformation, rank correlations were calculated between the structural deviations,

40

measured as RMSD to the native structure and scores on three datasets: RASP, RNA KB

(Molecular Dynamics & Normal Mode), and the submission to the RNA-puzzles.

The first benchmark uses rank correlations (R) to show how well a given method is able to

rank all the models, from very good to very bad. Figure 4.1.7 shows all rank correlations

between each decoy and scoring methods. To compare performance of the scoring method on

datasets of different sizes, the weighted average was introduced. mqapRNA (Fig. 4.1.7, 3rd

column) outperformed all other scoring methods achieving a weighted average (Fig. 4.1.7,

the last row of the plot) of rank correlations of 0.77. The second was RASP with a weighted

average of 0.74. SimRNA scored as the third method with a weighted average of 0.71. Note,

that mqapRNA also achieved a very high accuracy factor (average per decoy of over 0.8) for

datasets: RASP, RNA KB-Molecular Dynamics, RNA KB-Normal Mode. Clash Score and

Analyze Geometry performed poorly with weighted averages 0.23 and 0.42 respectively.

However, all the methods scored poorly for the RNA-Puzzle datasets, compared to the others.

This low-quality prediction is due to the higher level of distortion complexity of the RNA-

Puzzle datasets. This might suggest that the datasets of RASP and RNA KB do not represent

deviations of models that one might encounter in real life case studies of RNA structure

prediction.

The second benchmark uses Enrichment Score (ES) to show how many of 10% of the best

models were scored by a given method as 10%. This metric tests the capability of methods to

identify the subset of the best models in a given decoy set. Figure 4.1.8 shows all Enrichment

Scores between each decoy set and scoring methods. In this test, SimRNA achieved the

highest weighted average of 5.4 and outperformed mqapRNA with a weighted average of 5.3.

Once again, the RNA-puzzles dataset was the most difficult for the quality prediction. The

best method on this dataset was FARNA (in the high resolution mode) with an average of

2.3. mqapRNA on this dataset was the second with an average of 1.8. Interestingly,

Secondary Structure (INF) which is a scoring that is a comparison of the secondary structure

of a model with the true secondary structure obtained from a crystal structure achieved an

average of 3.6. This scoring assumes that the predicted structure for a given sequence is the

same as the secondary structure of the crystal structure, which in practice is very difficult to

obtain. For the first four consecutive RNA-Puzzles, almost all methods achieved EC of zero.

For these RNA-Puzzles the participating groups submitted only one/two models per group.

41

Thus, to get a high EC value for RNA-Puzzle 1 with only twelve submitted models (for

example), a scoring function should detect one particular model, since 10% of twelve is 1.2.

Interestingly, for all the methods, a huge difference in the performance has been recorded

between the RASP (the dark read area in Figure 4.1.8) and the RNA KB datasets (the mix

read blue area in the middle of the Figure 4.1.8). This might suggest that the RASP dataset is

composed of a limited number of near-native models. Since the majority of models are far

from the native structures, they cannot be detected as good ones. In the case of the RNA KB

datasets, it appears that there are many near-native models, but the scoring functions have

problems distinguishing them from worse models. Figure 4.1.8 also displays that this is the

case when the RNA KB decoys are similar to the native structures, making their assignment

to the best 10% of the models much harder.

In Figure 4.1.9., a close-up on the RNA-Puzzle 14 scorings is shown. mqapRNA achieved an

EC of 7.7, being able to identify a group of near-native models. Other methods were not able

to rank models properly. Note, that both modes of FARNA (the high resolution mode and the

low resolution mode) scored as the second with an EC of 5.8.

42

Figure 4.1.7: Rank correlations for each decoy set and scoring method. mqapRNA (3rd

column) outperformed other scoring functions with a weighted average of rank correlations

of 0.77)

43

Figure 4.1.8: Enrichment Score for each decoy set and scoring method. mqapRNA (3rd

column) is outperformed by SimRNA (10th column) by 0.1 in terms of EC.

44

Figure 4.1.9: Close-up on the RNA-Puzzle 14 results in a form of RMSD [Å] vs Score plots.

The perfect method should follow a diagonal in a plot. mqapRNA achieved an EC of 7.7 and

was able to identify a group of the near-native models. Other methods were not able to rank

models properly.

4.1.3 mqapRNA web server: quality prediction with optional restraints

The mqapRNA web server is a workflow based on a combination of computational tools. It

offers a user-friendly web interface to submit RNA PDB structures and view the results. All

steps of the analysis are automatized, which makes the process of scoring available to users

who would otherwise become tripped up by installing many programs locally. All

intermediate results can be downloaded and processed by the user. The server can be found

under the http://genesilico.pl/mqapRNA/ link. The server is free and open to all users, with

no login requirement. Further details can be found at the documentation page of the server

(http://genesilico.pl/mqapRNA/documentation).

A user can submit RNA structural files in the PDB format in three different forms: (1) a

single file, (2) a single file with many models (“NMR-style”), or a ZIP file with multiple

PDB files (Fig. 4.1.10). All PDB files are processed with rna-pdb-tools to get the RNA

standardized structures so the primary methods can be run on them. Since incorporation of

the information about RNA secondary structure improves the quality prediction of models,

users can provide their own secondary structure or let mqapRNA to predict it. The secondary




http://genesilico.pl/mqapRNA/documentation)

45

structure can be predicted with the use of experimental chemical probing method - SHAPE

data. The user can also provide a set of distance restraints to refine the quality prediction. The

distance restraints and secondary structure do not have to be provided upfront; the user can

submit them also at the result page. Moreover, both type of restraints can be easily re-

submitted at the result page to help the user to select the right models. A complete quality

prediction result consists of a plot of mqapRNA scores (Fig. 4.1.11), a table of the scores (Fig

4.1.11). and the distance restraints editor (Fig. 4.1.12). The server accepts distance restraints

in a flat text file. We tested the method to improve the quality of prediction with the

evolutionary restraints, and MOHCA restraints. For evolutionary restraints, the suggested

distance is 7 Å, while for MOHCA-seq 25 Å (Das et al. 2008). Analysis of those scores can

help the user to decide which structure to select for further investigation. The raw output

files from each step of the prediction are also available and the user can carry out additional

data analysis, if desired.

46

Figure 4.1.10: The homepage of the mqapRNA web server.

47

Figure 4.1.11: A result page of mqapRNA. The page is divided into three panels: a plot of

mqapRNA score, a table of the score, and the restraints editor. The distance restraints can be

easily modified and re-submitted to the server. The results will be immediately updated

which might encourage the user to try different sets of restraints.

48

Figure 4.1.12: Distance restraints editor at the bottom of the result page. The user can upload

a file with distance restraints or use an online editor to modify his/her query. After the re-

submission, the scores are re-calculated, and a new plot is generated.

49

4.2 EvoClustRNA

4.2.1 Implementation of EvoClustRNA

Based on the observation that RNA sequences from the same RNA family fold into a highly

conserved structure, together with professor Das, we made an assumption that a similar

process can be observed in silico. We assumed that computational modeling could be used to

detect global helical arrangements for the target sequence, based on the arrangements within

a subset of homologs. Thus, this project explores the use of multiple sequence alignment

information and parallel modeling of RNA homologs to improve ab initio RNA structure

prediction methods. To build a structural model of the target sequence, a multi-step modeling

process must be performed (Fig. 4.2.1).

Figure 4.2.1: The scheme of the proposed methodology. (A) Homologous sequences are

found for the target sequence, and an RNA alignment is created. (B) Using Rosetta and

SimRNA or/and Rosetta, structural models for all sequences are generated. (C) The

conserved regions are cut out and clustered. (D) The final prediction of the method is the

model containing the most commonly preserved structural arrangements in the set of

homologs.

50

First, a subset of homologous sequences for the target sequence is selected using an

alignment from the Rfam database. Alignments are processed as described in the Materials &

Methods section 3.6.1. Subsequently, independent folding simulations are performed with

SimRNAweb and Rosetta for the selected sequences to generate initial models. Then,

structural fragment, which are evolutionarily conserved helical regions that were determined

from the alignment, are extracted from all obtained models and clustered. The center (model

with the highest number of neighbors) the biggest cluster is taken as the final prediction.

In the current implementation of the method, the user should create a new line “x” in the

alignment that marks the regions that are selected for the clustering. This line can be created

automatically with rna-pdb-tools. However, the user can also define a region for clustering.

This step is critical for the whole process, and the user should carefully include in clustering

only the wanted regions.

The initial version of the method, which was developed at Stanford University with professor

Rhiju Das, used models generated with Rosetta. However, the EvoClustRNA method itself is

independent from the source of analyzed initial structural models. For this reason, I decided

to also test the EvoClustRNA with a method using models generated with SimRNAweb, a

tool for RNA structure prediction developed in the laboratory of professor Janusz Bujnicki.

EvoClustRNA is implemented as a set of Python programs, which can be downloaded

together with the documentation and examples from the GitHub repository

(https://github.com/mmagnus/EvoClustRNA). The evoClustRNA.py main script requires an

input alignment and a folder with initial models of all homologs to generate an all-vs-all

distance matrix between selected clustering fragments. The next step is to use the

evoClust_autoclustix.py, which is an implementation of an iterative clustering procedure. As

results of this script, a set of clusters is generated. The structure with the highest number of

neighbors of the first (biggest) cluster is taken as the final prediction.

4.2.2 Blind predictions with EvoClustRNA in the RNA-Puzzles

EvoClustRNA was tested on the RNA-Puzzle 13 problem. The target of 71 nucleotides was

an RNA 5-aminoimidazole-4-carboxamide riboside 5′-monophosphate (ZMP) riboswitch,

which can up-regulate de novo purine synthesis in response to increased intracellular levels of


51

ZMP (Trausch et al. 2015). The alignment for this riboswitch was downloaded from the

Rfam database (RF01750), whence ten homologs were selected for modeling with Rosetta.

The secondary structures for all homologs were devised with Jalview based on the Rfam

alignment. The pseudoknot was suggested in the available literature (Kim et al. 2015) and it

was used for modeling. The EvoClustRNA prediction with an RMSD of 5.55 A with respect

to the native structure (Fig. 4.2.2) was the second in the total ranking of RNA-Puzzles,

(http://ahsoka.u-strasbg.fr/rnapuzzlesv2/result/Puzzle13/). The final prediction was made

based on the visual inspection of the best clusters, which were obtained by using the

EvoClustRNA method.

Figure 4.2.2: The RNA-Puzzle 13 - the ZMP riboswitch. The superposition of the native

structure (green) and the EvoClustRNA prediction (blue). The RMSD between structures is

5.55 A, the prediction was ranked as the second in the total ranking of the RNA-Puzzles

(according to the RMSD values).

EvoClustRNA was also used in the RNA-Puzzles for modeling the problem 14. The RNA

molecule of interest was the 61-nucleotide long L-glutamine riboswitch, which upon

glutamine binding undergoes a major conformational change in the P3 helix (Ren et al.

2015). It was the first RNA-Puzzle, for which the participating groups were asked to model

two forms of the RNA molecule: one with a ligand (“bound”) and another one without a

ligand (“free”). However, the EvoClustRNA method was used only to model the “bound”

http://ahsoka.u-strasbg.fr/rnapuzzlesv2/result/Puzzle13/

52

form. The alignment for this RNA family (RFAM ID: RF01739) was downloaded from the

Rfam database, whence two homologs were selected for modeling with Rosetta. It was

suggested in the literature (Westhof 2010) that the structure included an E-loop motif. This

motif was found in the PDB database and was used as a rigid fragment during the modeling.

Three independent simulations were performed and the final prediction was obtained in a

fully automated manner. The native structure of the riboswitch superimposed on the model

obtained with the EvoClustRNA method is shown in Fig. 4.2.3. The EvoClustRNA

prediction was ranked at the first place in the overall ranking with 5.56 Å RMSD with respect

to the native structure (http://ahsoka.u-strasbg.fr/rnapuzzlesv2/result/Puzzle14Bound/).

Figure 4.2.3: The RNA Puzzle 14 - L-glutamine riboswitch. The RMSD between the native

structure (green) and the EvoClustRNA prediction (blue) is 5.56 Å.

4.2.3 Performance of EvoClustRNA

To rigorously test the EvoClustRNA methodology, the dataset composed of nine RNAs with

known experimentally solved structures was used. This dataset included (1) five RNAs used

to benchmark modeling restraints from direct coupling analysis by Weinreb and coworkers

(Weinreb et al. 2016), (2) four RNA-Puzzles, 6, 13, 14, 17 (Table 4.2.1, rows from 6 to 9).

To compare the results obtained by Weinreb et al. with a single sequence predictions and

EvoClustRNA runs, Table 4.2.1 includes a column “DCA” with RMSDs calculated for

models from Weinreb’s publication. A single sequence and EvoClustRNA predictions were

performed using both SimRNAweb and Rosetta.

http://ahsoka.u-strasbg.fr/rnapuzzlesv2/result/Puzzle14Bound/

53

According to our results, EvoClustRNA|SimRNAweb improved the results in 5 out of 9

cases. However, the improvement was relatively small, namely 0.30 Å RMSD. In the case of

EvoClustRNA|Rosetta, the obtained models were 0.36 Å RMSD less accurate than Rosetta

models generated for single sequences. Interestingly, SimRNAweb and Rosetta gave similar

results on average regarding RMSDs.

All sequences and secondary structures used for modeling are listed as Supplementary

Information S1.

Adenine riboswitch (Ade, PDB ID: 1Y26, RFAM ID: RF00167). The first RNA in Table 1

is the adenine riboswitch. The sequence used for modeling is 72-nucleotide long. This

riboswitch has a pseudoknot and it was used for modeling. The best RMSD was achieved by

SimRNAweb (6.85 Å), which was even better than modeling with the use of evolutionary

restraints by Weinreb et al (9.23 Å). EvoClustRNA did not improve the results, and a model

1 2 3 4 5 6 7 8 9 10

No RNA Len. DCA SimRNA

web

EvoClustRNA|

SimRNAweb

Improvement

of

EvoClustRNA|

SimRNAweb

Rosetta

EvoClustRNA|

Rosetta

Improvement

of

EvoClustRNA|

Rosetta

1. Ade 72 9.23 6.85 7.52 -0.67 9.02 13.89 -4.87

2. TPP 80 10.35 22.37 24.08 -1.71 20.88 13.92 6.96

3. tRNA 76 8.58 14.37 10.35 4.02 13.11 14.6 -1.49

4. cdiGMP 76 11.1 12.26 9.65 2.61 11.41 14.53 -3.12

5. THF 89 8.84 12.22 11.35 0.87 4.83 7.68 -2.85

6. COB #6 168 NA 31.02 33.39 -2.37 31.44 33.19 -1.75

7. ZMP #13 71 NA 6.42 6.73 -0.31 8.32 6.73 1.59

8. GlnA #14 61 NA 4.71 4.44 0.27 6.54 4.83 1.71

9. Pistol #17 62 NA 12.19 12.17 0.02 12.72 12.17 0.55

- Average - - 13.60 13.30 0.30 13.14 13.50 -0.36

Table 4.2.1: The performance of EvoClustRNA on the test dataset. The results for nine

RNAs. Column 1, original numeration. Column 2, RNA type and PDB ID code for each

RNA. Column 3, sequence length. Column 4, RMSD [Å] of models obtained by Weinreb et

al., only for RNAs 1-5. Column 5, RMSD of the first cluster obtained with SimRNAweb.

Column 6, RMSD [Å] of the first cluster obtained with EvoClustRNA|SimRNAweb. Column

7, the difference between column 6 and column 5. Column 8, RMSD [Å] of the first cluster

obtained with Rosetta. Column 9, RMSD [Å] of the first cluster obtained with

EvoClustRNA|Rosetta. 10, the difference between column 9 and column 8. The

improvements in RMSDs when EvoClustRNA is used are marked in green, the cases where

EvoClustRNA worsened the results are marked in red.

54

of EvoClustRNA|SimRNAweb gave 7.52 Å, while a value of 13.89 Å was obtained by using

EvoClustRNA|Rosetta (Fig. 4.2.4).

Figure 4.2.4: The native structure (PDB ID: 1Y26). Models generated by (B) Weinberg et al.

(C) SimRNAweb (D) EvoClustRNA|SimRNAweb (E) Rosetta (F) EvoClustRNA|Rosetta.

All models exhibit the native-like fold. However, only models C, D exhibit similar

orientation of secondary structure elements with respect to the native structure.

Thiamine pyrophosphate-sensing riboswitch (TPP, PDB ID: 2GDI, RFAM ID:

RF00059). The model obtained with DCA restraints achieved an RMSD of 10.35 Å. This

riboswitch was predicted poorly by all four approaches, with RMSDs ranging from 13.92 Å

to 24.08 Å. The model obtained with EvoClustRNA|Rosetta was the most accurate with

RMSD of 13.92 Å (Fig. 4.2.5).

55

Figure 4.2.5: The native structure (PDB ID: 2GDI). Models generated by (B) Weinberg et al.

(C) SimRNAweb (D) EvoClustRNA|SimRNAweb (E) Rosetta (F) EvoClustRNA|Rosetta.

Only model B shares the three-dimensional fold with the native structure, with RMSD of

13.92 Å.

HIV reverse-transcription primer tRNA (PDB ID: 1fir, RFAM: RF00005). The best

model for this tRNA structure was modeled by Weinberg et al. using DCA restraints (RMSD

8.58 Å). The most accurate model from four other approaches was generated by

EvoClustRNA|SimRNA with RMSD 10.35 Å (Fig. 4.2.6).

56

Figure 4.2.6: (A) The native structure (PDB ID: 1FIR). Models generated by (B) Weinberg

et al. (C) SimRNAweb (D) EvoClustRNA|SimRNAweb (E) Rosetta (F)

EvoClustRNA|Rosetta. Only model B shares the three-dimensional fold with the native

structure, with an RMSD of 10.35 Å.

c-di-GMP-II riboswitch (cdiGMP, PDB ID: 3Q3Z, RFAM ID: RF01786). Similarly to the

THF riboswitch, this structure is a long helix that folds back on itself and forms a

pseudoknot. The best model, in terms of RMSD, was generated with

EvoClustRNA|SimRNAweb (RMSD 9.65 Å). However, the fold of this RNA is complicated

(compared to the THF riboswitch), as a result, none of the methods generated this fold

correctly (Fig. 4.2.7).

Figure 4.2.7: (A) The native structure (PDB ID: 3Q3Z). Models generated by (B) Weinberg


EvoClustRNA|Rosetta. The RMSDs range from 9.65 Å to 14.53 Å.

57

Tetrahydrofolate riboswitch (THF, PDB ID: 4LVV, RFAM ID: RF00059). This structure

is a simple long helix that folds back on itself and forms a pseudoknot. The fold of this RNA

is relatively simple and all predicted models were well predicted with RMSDs ranging from

4.83 Å to 12.22 Å. Interestingly, the DCA modeling was outperformed by Rosetta (RMSD

4.83 Å) and EvoClustRNA|Rosetta (RMSD 7.68 Å) (Figure 4.2.8).

Figure 4.2.8: (A) The native structure (PDB ID: 4LVV). Models generated by (B) Weinberg


EvoClustRNA|Rosetta. Model E is the closest to the native structure with an RMSD 4.83 Å.

Adenosylcobalamin riboswitch - RNA Puzzle 6 (COB, PDB ID: 4GXY, RFAM ID:

RF00174). This RNA is a riboswitch, which was experimentally solved with the ligand.

Since none of the methods explicitly predicts RNA-ligand interactions, all generated models

were far from the native structure with RMSDs ranging from 31.02 Å to 33.39 Å (Fig. 4.2.9).

58

Figure 4.2.9: (A) The native structure (PDB ID: 4GXY). Models generated by (B)

SimRNAweb (C) EvoClustRNA|SimRNAweb (D) Rosetta (E) EvoClustRNA|Rosetta. Due to

missing RNA-ligand interactions, none of the models is close to the native structure (RMSDs

range from 31.02 Å to 33.39 Å).

ZMP (5-aminoimidazole-4-carboxamide ribonucleotide) riboswitch - RNA Puzzle 13

(PDB id: 4XW7, Rfam id: RF01750). The best model of this short (71-nucleotide long)

riboswitch was obtained with SimRNA (RMSD 6.42 Å). EvoClustRNA improved predictions

only in the case of EvoClustRNA|Rosetta by 1.59 Å. The P2 helix (Figure 4.2.10, in green) in

the native structure makes interactions with the binding pocket where the ZMP ligand binds.

These interactions are missing in all predictions as the P2 helix is protruding outward from

the binding pocket. Once again, missing RNA-ligand interactions hampered a correct

modeling of an RNA sequence.

Figure 4.2.10: (A) The native structure (PDB ID: 4XW7). Models generated by (B)

SimRNAweb (C) EvoClustRNA|SimRNAweb (D) Rosetta (E) EvoClustRNA|Rosetta.

L-glutamine riboswitch - RNA Puzzle 14 (GlnA, PDB ID: 5DDO, RFAM ID: RF01739).

The best model of this 61-nucleotide long riboswitch was obtained with

EvoClustRNA|SimRNAweb (RMSD 4.44 Å). In all predictions, a fragment of an E-loop

59

motif was used. The RMSDs of models were ranging from 4.44 Å to 6.54 Å (Fig. 4.2.11).

EvoClustRNA improved predictions in both modes, using models from SimRNA

(improvement of 0.27 Å) and Rosetta (improvement of 1.71 Å). This structure was

experimentally solved with the ligand. However, the ligand was not modeled explicitly, as in

the case of previous RNAs.

Figure 4.2.11: (A) The native structure (PDB ID: 5DDO). Models generated by (B)

SimRNAweb (C) EvoClustRNA|SimRNAweb (D) Rosetta (E) EvoClustRNA|Rosetta. The

most accurate model of this riboswitch was generated with EvoClustRNA|SimRNAweb

(RMSD 4.44 Å).

Pistol ribozyme - RNA Puzzle 17 (PDB ID: 5K7C, RFAM ID: RF02679). The RNA-

Puzzle 17 is a Pistol ribozyme. This is a 62-nucleotide long, self-cleaving ribozyme.

Figure 4.2.12: (A) The native structure (PDB ID: 5K7C). Models generated by (B)


60

In all predicted models (Fig. 4.2.12), the substrate (Fig. 4.2.12, chain in red) is located

“behind” (SimRNAweb) or within (Rosetta) the molecule. The predictions reached RMSDs

around 12.5 Å, ranging from 12.17 Å to 12.72 Å.

61

4.3 rna-pdb-tools

To facilitate the daily work of a researcher working in RNA structural bioinformatics, a

project named rna-pdb-tools was initiated (https://github.com/mmagnus/rna-pdb-tools). rna-

pdb-tools is a Python library and a set of tools dedicated to RNA structural file handling and

manipulating, like (1) rebuilding of missing atoms in RNA structures, (2) structural

clustering, (3) standardization of PDB formats to comply with the format required by RNA-

Puzzles, (4) visualization of secondary RNA structures and drawing RNA arch diagrams of

secondary structure triggered from Python scripts or Jupyter Notebooks, and much more.

Additionally, rna-pdb-tools should be considered as a library of functions rather than a closed

program with one fixed set of functionalities. rna-pdb-tools is a framework of various

functions, and if needed the user is invited to extend it with his/her own scripts on the top of

the existing package. In this way, it is possible to adapt the framework for every specific

case, for example to have a particular parser or converter that can be used by the user for a

very specific application.

Furthermore, to ensure the quality control of the code, the software is under heavy testing by

Travis CI, every time is detected a change. Travis CI is a hosted, distributed, continuous

integration service, used to build and test software projects (https://travis-ci.org/). To verify

the correctness of all operations performed by the software, a set of input files is prepared.

During each test, the input files are processed to get output files, and the output files are

compared with each other during each test.

rna-pdb-tools is a core part of my other projects: NPDock (RNA/DNA-protein docking

method, http://genesilico.pl/NPDock/) (Tuszyńska et al. 2015), SimRNAweb (RNA 3D

structure prediction method) (Magnus et al. 2016), EvoClustRNA, and mqapRNA. The rna-

pdb-tools package has been recognized by the organizers of the RNA-Puzzles, and it has been

suggested at the homepage of the experiment as an approved tool to process structures for the

contest (http://ahsoka.u-strasbg.fr/rnapuzzles/). The step-by-step tutorial that explains how to

prepare files for submission to the RNA-Puzzles can be found here https://rna-pdb-

tools.readthedocs.io/en/latest/rna-puzzles.html.


http://genesilico.pl/NPDock/

http://ahsoka.u-strasbg.fr/rnapuzzles/)

62

The central part of the package is the rna_pdb_tools_lib library and the rna_pdb_toolsx.py

script. The script uses functions coded in the main library, and it is an interface to run them

from the command line. The full list of operations of rna_pdb_toolsx.py can be displayed

using “-h” as argument of the command:

$ rna_pdb_toolsx.py -h

usage: rna_pdb_toolsx.py [-h] [--version] [-r] [-c] [--is_pdb] [--is_nmr]

[--un_nmr] [--orgmode] [--get_chain GET_CHAIN]

[--fetch] [--fetch_ba] [--get_seq] [--get_ss]

[--rosetta2generic] [--get_rnapuzzle_ready] [--rpr]

[--no_hr] [--renumber_residues]

[--dont_rename_chains] [--dont_fix_missing_atoms]

[--dont_report_missing_atoms] [--collapsed_view]

[--cv] [-v] [--replace_hetatm] [--inplace]

[--edit EDIT] [--delete DELETE]

file [file ...]

rna_pdb_tools - a swiss army knife to manipulation of RNA pdb structures

Usage

$ for i in *pdb; do rna_pdb_toolsx.py --delete A:46-56 $i > ../rpr_rm_loop/$i ; done

$ rna_pdb_toolsx.py --get_seq *

# BujnickiLab_RNApuzzle14_n01bound

> A:1-61

# BujnickiLab_RNApuzzle14_n02bound

> A:1-61

CGUUAGCCCAGGAAACUGGGCGGAAGUAAGGCCCAUUGCACUCCGGGCCUGAAGCAACGCG

[...]

positional arguments:

file file

optional arguments:

-h, --help show this help message and exit

--version

-r, --report get report

-c, --clean get clean structure

--is_pdb check if a file is in the pdb format

--is_nmr check if a file is NMR-style multiple model pdb

--un_nmr Split NMR-style multiple model pdb files into

individual models [biopython]

--orgmode get a structure in org-mode format <sick!>

--get_chain GET_CHAIN

get chain, .e.g A

--fetch fetch file from the PDB db

--fetch_ba fetch biological assembly from the PDB db

--get_seq get seq

--get_ss get secondary structure

--rosetta2generic convert ROSETTA-like format to a generic pdb

--get_rnapuzzle_ready

get RNApuzzle ready (keep only standard atoms).Be

default it does not renumber residues, use

--renumber_residues [requires biopython]

--rpr alias to get_rnapuzzle ready)

--no_hr do not insert the header into files

--renumber_residues by default is false

--dont_rename_chains used only with --get_rnapuzzle_ready. By default

--get_rnapuzzle_ready rename chains from ABC.. to stop

behavior switch on this option

--dont_fix_missing_atoms

used only with --get_rnapuzzle_ready

--dont_report_missing_atoms

used only with --get_rnapuzzle_ready

--collapsed_view

--cv alias to collapsed_view

63

-v, --verbose tell me more what you're doing, please!

--replace_hetatm replace 'HETATM' with 'ATOM' [tested only with

--get_rnapuzzle_ready]

--inplace in place edit the file! [experimental, only for

get_rnapuzzle_ready, delete, get_ss, get_seq]

--edit EDIT edit 'A:6>B:200', 'A:2-7>B:2-7'

--delete DELETE delete the selected fragment, e.g. A:10-16

The functions of the package can also be imported to one’s projects, for example “–is_pdb”

can be accessed from the shell utility:

$ rna_pdb_toolsx.py --is_pdb input/1I9V_A.pdb

True

$ rna_pdb_toolsx.py --is_pdb input/image.png

False

but also from a Python script:

>>> from rna_pdb_tools_lib import *

>>> s = RNAStructure('input/1I9V_A.pdb')

>>> s.is_pdb()

True

rna-pdb-tools can also be used from the Emacs text editor (Fig. 4.3.1), and some of its

functions can be executed via plugins in my note-taking system Geekbook

(https://github.com/mmagnus/geekbook).

Figure 4.3.1: rna-pdb-tools can be run also from Emacs. A researcher can edit a PDB file

using the text-oriented functionality of this editor and then without leaving the editor can

apply the RNApuzzle function to standardize the file.

https://github.com/mmagnus/geekbook

64

A list of example command-line utils included in the packages:

The user: “I want to”: Command-line utils to use:

get a sequence based on a PDB file rna_pdb_toolsx.py --get_seq *.pdb

download a PDB file rna_pdb_toolsx.py --fetch <PDB id>

compare text-content of PDB files diffpdb.py <fn1.pdb> <fn2.pdb> (Fig. 4.3.4)

annotate secondary structure of my PDB files clarna_app.py - a wrapper to ClaRNA

or

rna_x3dna.py - a wrapper to 3dna

calculate RMSDs between the target file and a set of

other files

rmsd_calc_to_target.py -t <target.pdb> *.pdb

compare interactions networks (base pairs) between

two 3D structures

rna_calc_inf.py -t <target.pdb> *.pdb

a wrapper to ClaRNA

filter a set of PDB files to select ones that fulfil

required distance restraints

rna_filter.py -s <restraints.txt> -s *.pdb

calculate distances based on given restraints on PDB

files or SimRNA trajectories

refine my models rna_refinement.py -n <steps> *.pdb

– a wrapper around QRNAS

merge single files into an NMR-style multiple model

file PDB file

rna_pdb_merge_into_one.py *.pdb > out.pdb

model an RNA sequence with Rosetta and process

output files

a set of tools to work with Rosetta:

rna_rosetta_check_progress.py,

rna_rosetta_cluster.py, rna_rosetta_min.py,

rna_rosetta_run.py

process output files of SimRNA/SimRNAweb a set of tools to work with SimRNA:

rna_simrna_cluster.py, rna_simrna_extract.py,

rna_simrna_lowest.py

download output files from the SimRNAweb server

for a given job id

rna_simrnaweb_download_job.py <job id>

edit occupancy or B Factor in only in a part of a PDB

file

e.g. rna_pdb_edit_occupancy_bfactor.py --occupancy

--select A:1-40,B:1-22 --set-to 0

<pdb.pdb>

edit a part of a chain (change fragment A:1-75 to

A:7-81)

e.g. rna_pdb_toolsx.py --edit 'A:1-75>A:7-81'

3q3z_rpr.pdb > 3q3z_rpr_A7-81.pdb

add missing bases rna_pdb_toolsx.py --get_rnapuzzle_ready <fn.pdb>

(Fig 4.3.2)

65

Figure 4.3.2: rna_pdb_toolsx.py is able to rebuild missing base (drawn in thin line) to

complete a structure.

Figure 4.3.3: rna-pdb-tools comes with a detailed documentation that can be viewed online

or as a PDF file.

The rna-pdb-tools package is published as an open source project (GPL-3.0 license), thus it

can be widely used, deployed and modified by the scientific community. rna-pdb-tools is well

documented in both online documentation and tutorials that will walk the user through

66

various use cases (http://rna-pdb-tools.readthedocs.io/en/latest/) as well as a PDF manual

(over 130 pages as of September 2017) (Fig. 4.3.3).

Figure 4.3.4: diffpdb.py is a tool to detect differences in formatting between two PDB files.

First, the tool removes columns of coordinates, and next compares only columns with

annotation (atom naming, numbering).

The rna-pdb-tools package was extended by a set of functions for

analyzing/editing/formatting RNA alignments. A set of operations that can be done with rna-

pdb-tools are shown in Fig. 4.3.5 and can be explored in a Jupyter notebook available under a

link

https://github.com/mmagnus/rna-pdb-

tools/blob/master/rna_pdb_tools/utils/rna_alignment/rna_alignment.ipynb. A user can easily

load a new alignment, subset columns or sequences (rows), save a subset to a new file, plot

an RChie plot, get a secondary structure and a sequence of each of sequences in an alignment

(Fig 4.3.5), and more. The functions can be imported to a user’s own Pythons script but also

to a Jupyter notebook. The scripts were used to process the data, RNA alignments, secondary

structures, tertiary structures for the database and classification system of RNA families,

RNArchitecture (http://genesilico.pl/RNArchitecture) (Boccaletto et al. 2017)

https://github.com/mmagnus/rna-pdb-tools/blob/master/rna_pdb_tools/utils/rna_alignment/rna_alignment.ipynb

https://github.com/mmagnus/rna-pdb-tools/blob/master/rna_pdb_tools/utils/rna_alignment/rna_alignment.ipynb

http://genesilico.pl/RNArchitecture

67

Figure 4.3.5: A fragment of the demo on the RNA alignment functionality implemented in

rna-pdb-tools. Top: a user can load a new alignment and plot an RChie plot, bottom: a user

can also get a secondary structure and a sequence for a row taken for an alignment (gaps are

removed) in the text format or get a visualization using VARNA. The functions can be

imported to a user’s own Python scripts but also to a Jupyter notebook (as shown in the

figure).

68

5 Discussion

5.1 mqapRNA

The aim of the first project described in this work was to facilitate the task of selection of the

most accurate RNA 3D models from a pool of models obtained by use of various RNA 3D

structure prediction methods. The new method is a meta-predictor, mqapRNA, which

combines the existing methods and uses the deep learning model to take advantage of their

combined strengths and to eliminate their individual weaknesses. In the benchmark presented

in this study, mqapRNA (on average) outperformed other existing methods, and at this stage,

the method is a great starting point for further statistical model optimization and improved

training on even bigger datasets of more diverse structures. In addition, mqapRNA allows for

interactive refinement of the predictions by applying distance restraints obtained from

experimental methods or evolutionary analysis, and by using secondary structure

information. The method is available as an easy-to-use web server.

However, it is important to realize how theoretical datasets, generated for method

development, can vary from cases observed in real life. The benchmark showed that,

although all the methods perform very well on theoretical decoys, they poorly perform in

scoring models created by scientists in real life scenario, e.g., the RNA-Puzzles targets. The

reason could be that the models submitted by groups have different patterns of distortions,

that the datasets do not account for. The second reason could be that, if we start a 3D RNA

structure prediction from the sequence only it is very hard to reach models accurate enough to

be scored efficiently by the existing methods.

The benchmark devised for this study highlights the importance of a correct secondary

structure. A correct secondary structure can be used as a reasonable evaluation method and

can help to identify models of poor quality. The secondary structure prediction is a complex

problem on its own, and one should understand how to apply experimental data to obtain an

accurate prediction. The RNA-puzzles publications describe cases where a wrong secondary

structure led to wrong three-dimensional models.

69

mqapRNA can only be useful if it is applied to score accurate RNA 3D structural models.

However, the results of the RNA-Puzzles show that accurate modeling of RNA is still very

challenging and accuracy of obtained models is far from near-native structures.

5.1.1 Similar tools or approaches

The program uses a deep learning statistical model to interpret outputs of primary methods

and provides quality predictions. According to the benchmark, the method outperforms other

existing methods. However, there is still ample room of improvement, in particular in the

case of models of non-trivial distortions like in the RNA-puzzles. One way would be to use

more diverse decoys that could allow improvement in the quality and consistency of

predictions as well as clarify what accounts for a good or bad model of RNA. A robust

feature selection analysis of the statistical model could better reflect on identifying critical

factors on the accurate assessment. The accuracy of the method strongly depends on the

training set. mqapRNA was not trained on decoys generated by SimRNA or Rosetta or even

experimental observed intermediates, which could probably improve the statistical model.

Another direction of development of the method would be to add new primary methods, that

can score RNA structural models. This area of science is a very active field, and one should

expect more methods coming within the next years.

mqapRNA can be run as an easy-to-use web server, similarly to the web server for RASP,

called WebRASP (Norambuena et al. 2013) (http://melolab.org/webrasp/). RASP is very fast,

easy to install and can be used for quality prediction. RNA KB is a force-field in the

GROMACS (Van Der Spoel et al. 2005) package for Molecular Dynamics simulations,

which makes it hard to use for researchers without prior experience in running molecular

dynamics simulations. Methods εSCORE, 3RNAscore are easy to install and run, however;

according to the benchmark, they are not as good as the best method, mqapRNA.

mqapRNA is able to predict the “global quality” of models and provides (just) one score per

structure. However, one can think of a tool which assesses the quality at the level of

individual residues. This approach is named “local quality assessment” and could be

developed and tested in further implementations of mqapRNA. Such functionality is

implemented in Meta-MQAP (Pawlowski et al. 2008), which is an analogical tool but for

protein quality assessment. With this kind of local quality assessment, it would be possible to

http://melolab.org/webrasp/

70

(1) detect misfolded parts of RNA and apply further optimization, (2) replace a given

molecular fragment with a new one from a database of fragments, (3) refold the RNA

entirely, using SimRNA. QA-RecombineIt (Pawlowski et al. 2013) is a method, developed in

the Bujnicki laboratory, that assesses the quality of protein 3D structure models and improves

the accuracy of these models by merging fragments of multiple input models.

71

5.2 EvoClustRNA

An efficient scoring method can be applied with a success only if in a pool of input models

are near-native structures. The analysis of the decoys from the RNA-Puzzles experiment

suggests that we need more accurate methods for RNA 3D structure prediction to begin with.

To facilitate RNA structure prediction, EvoClustRNA, a new evolutionary approach for RNA

3D structure was implemented and benchmarked.

EvoClustRNA could be tested with models produced by other method for modeling, e.g.

RNAComposer (Popenda et al. 2012), MC-Sym|MC-Fold (Parisien and Major 2008),

iFoldRNA (Ding et al. 2008), etc.

EvoClustRNA could be potentially improved by a different set of parameters for clustering.

The procedure of selecting homologs could also be investigated and its variants tested.

Clustering visualized with Clans could improve the final selection of models.

However, combining EvoClustRNA with a DCA analysis would be the most beneficial. This

is also in same direction, in which the protein version of the methodology has gone (Richard

Bonneau, private communication).

One of the drawbacks of the method EvoClustRNA is the alignment preparation. At the

current stage, alignments were prepared manually with some help of scripts from the rna-

pdb-tools package. This could be further simplified with a new script developed as part of the

packages.

EvoClustRNA in some cases improved the results. However, EvoClustRNA highly depends

on initial models, which makes it limited as much as the original predictive methods. Thus,

the major current challenges in RNA structure prediction lie within an improvement of

algorithms in predictions of (1) RNA-ligand interactions, (2) non-canonical interactions, (3)

loop modeling.

5.2.1 Similar tools or approaches

EvoClustRNA was inspired by a similar approach that was used in protein structure

prediction (Bonneau et al. 2001). The approach is still used for protein structure prediction

72

(Richard Bonneau, private communication) and it was applied, for example, for modeling

structures for major protein families (Bonneau et al. 2002). To the best of my knowledge,

EvoClustRNA is the first time when this methodology was applied for RNA.

However, there are other ways how RNA sequence alignments could be used to improve

tertiary structure prediction.

The first one is to use evolutionary restraints as described by Weinreb and coworkers

(Weinreb et al. 2016) and Leonardis and coworkers (De Leonardis et al. 2015). These

methods require alignments with over 1000 sequences (De Leonardis et al. 2015) to provide

sufficient statistics for detecting nucleotide coevolution, which is not always is possible. In

addition, these methods are very sensitive to false-positives that can result in wrong models.

In contrast, EvoClustRNA can be used even when only a few (3-5) homologs are available.

The second way is to apply methods such as RMdetect (Cruz and Westhof 2011), to detect

RNA motifs from an RNA alignment. However, this approach gives only information about

some part of an RNA molecule, and a tertiary structure prediction method must be used to

obtain a full-length model.

73

5.3 rna-pdb-tools

Structural bioinformatics of RNA is a relatively young area of science that is struggling with

the lack of bioinformatics tools to facilitate the daily work of a researcher. The main problem

of the existing tools is that there is no universal parser that will solve all the problems that

one might have when working with PDB files, and that will suit the need of various users.

There are already many libraries, developed for researchers to work with PDB structures, in

languages, such as R (Bio3d by (Grant et al. 2006); Haskell (hPDB by (Gajda 2013)); Python

(BioPython (Cock et al. 2009), PyCogent (Knight et al. 2007)). The problem with these

packages is that they are primarily designed to work on protein structural files. In principle,

protein structure files are not different from RNA ones. However, for everyday work,

researchers working on RNA structures need a set of RNA-related functions, such as

preparation of the structure for the RNA-Puzzles competition, preparation for the SimRNA

simulation, getting the secondary structure, etc. Several RNA structural files parsers are

available for the scientific community. A set of tools that comes with Rosetta by professor

Rhiju Das and coworkers

(https://www.rosettacommons.org/docs/latest/application_documentation/rna/RNA-tools),

and by Peter Kerpedjiev and coworkers (Kerpedjiev et al. 2015)

(https://github.com/ViennaRNA/forgi), both are written in Python. However, RNA-tools is

intended to work on input and output files for Rosetta and it is not designed as a complete

package. Forgi is a Python library for manipulating RNA secondary structure and can solve

only a limited set of problems.

rna-pdb-tools provides an easy to adapt framework for user’s own tools. Just by copying-and-

pasting, and then modifying existing code, a user can build a new application very quickly.

rna-pdb-tools also shows how some third-party tools can be efficiently wrapped into

command-line utils, e.g., ClaRNA (Waleń et al. 2014). ClaRNA is a classifier of contacts in

RNA 3D structures. The program is written in Python and due to its single-thread

architecture, it is relatively slow. rna-pdb-tools includes a wrapper around ClaRNA, that can

run multiple instances of ClaRNA on all available processors and make the whole procedure

much faster. Moreover, rna_calc_inf.py provides the same interface for inputs as a script for

calculating RMSDs, rna_calc_rmsd.py. For that reason, both utils can be run in the similar

https://www.rosettacommons.org/docs/latest/application_documentation/rna/RNA-tools

https://www.rosettacommons.org/docs/latest/application_documentation/rna/RNA-tools

https://github.com/ViennaRNA/forgi)

https://github.com/ViennaRNA/forgi)

74

way: “rna_calc_inf.py -t <native.pdb> *.pdb”, “rna_calc_rmsd.py -t <native.pdb> *.pdb”,

which simplifies composing complex workflows.

rna-pdb-tools can be used both as command-line tools and in a Jupyter Notebook

(https://jupyter.org/) (former IPython (Pérez and Granger 2007)). The Jupyter Notebook is an

open-source web application that allows users to create and share documents that contain live

code - works with such languages as with Python, R, Scala – equations, visualizations and

explanatory text. The functions implemented in rna-pdb-tools can be imported to such

notebooks to create reproducible analyses that can be uploaded online and shared with the

RNA structural bioinformatics community. One such notebook is uploaded together with the

rna-pdb-tools packages and illustrates the steps performed for the Bujnicki group to collect

information about the RNA-Puzzle 18 problem (https://github.com/mmagnus/rna-pdb-

tools/blob/master/rp18.ipynb) (Fig. 5.3.1). The notebook reports the results of various

secondary structure prediction methods, and a successful hit for the target sequence in the

PDB database. The structure in the PDB database, Xrn1-resistant RNA from the 3'

untranslated region of a flavivirus (PDB ID: 4PQV) (Chapman et al. 2014), turned out to be a

homolog of the RNA Puzzle 18 and was used for a comparative modeling. Because of the

problem of reproducibility in bioinformatics (Sandve et al. 2013), rna-pdb-tools with Jupyter

notebooks seems to be a valuable combination to help scientists to share their analyses, e.g.,

protocols used for modeling in the RNA-Puzzle challenge, that can be later reproduced by

others.

https://jupyter.org/

https://github.com/mmagnus/rna-pdb-tools/blob/master/rp18.ipynb)

https://github.com/mmagnus/rna-pdb-tools/blob/master/rp18.ipynb)

75

Figure 5.3.1: The Jupyter notebook (a part of the whole notebook) for the RNA-Puzzle 18

problem. The notebook reports steps of a bioinformatical analysis to collect information

about the target sequence, such as: secondary structure predictions using three different

methods and a BLAST search on the PDB database that led to the detection of a homolog

used later for a comparative modeling.

5.3.1 Future directions

In the future, the rna-pdb-tools package could be merged with BioPython. rna-pdb-tools

already are using internally the Bio.AlignIO class of the BioPython package. Next step, in the

development of rna-pdb-tools would be to bring functions implemented in my package to

BioPython to provide a unified package for structural bioinformatics.

Another direction of the development would be a modification of some functions to work

also on mmCIF files (Crystallographic Information File).

76

rna-pdb-tools also needs even better documentation and better tests. Hopefully, still small,

but growing community of rna-pdb-tools users will contribute to improve the documentation

and add new tests.

There is a need for a new comprehensive workflow for RNA structure prediction. A huge

problem in the field is to make such workflows, because of incompatibility in input and

output data formats. This fragmentation of tools in bioinformatics leads to difficulties in

combining them efficiently into a full setup for a complete analysis. Even the PDB formats

used by the methods to define models may be very different. An enormous amount of time in

the development of mqapRNA was spent to prepare tools that will process input files and

convert them into formats that can be accepted by the existing scoring methods. The

realization of this task led to the development of rna-pdb-tools, as a practical converter of one

set of formats into another. To build complex workflow, we need well-written wrappers

around tools that will expose unified interfaces and allow for building complete pipelines.

The documentation of rna-pdb-tools could be a place to describe the tools and fill gaps in

original documentations. Such workflows could be implemented as a set of command-line

tools or as IPython Notebook where a Python script controls a flow of programs and data.

“... scientific programming does not compute” (Merali 2010). In my opinion, this is very true.

Merali in his paper described cases where wrong implementations caused retractions of

publications. What to do, so scientific programming will compute. Write clean code,

document, and test it. I hope that rna-pdb-tools will serve as an example of scientific code

that computes. More about rna-pdb-tools can be found under a link

https://media.readthedocs.org/pdf/rna-pdb-tools/latest/rna-pdb-tools.pdf.

https://media.readthedocs.org/pdf/rna-pdb-tools/latest/rna-pdb-tools.pdf

77

5.4 Potential limitations of the RNA 3D structure prediction methods

Based on the results of RNA 3D structure prediction runs, the potential limitations of the

predictive methods will be discussed in this section.

5.4.1 RNA-ligand interactions

The adenosylcobalamin riboswitch (RNA Puzzle 6) is a riboswitch, which was

experimentally solved with the ligand. Rosetta and SimRNA do not explicitly predict RNA-

ligand interactions, therefore, all the predicted models were in “unfolded” conformation and

the RMSDs were high.

Figure 5.4.1. The native structure (PDB ID: 4GXY) solved with the ligand (indicated by the

arrow).

To test whether any interactions that could improve this modeling can be detected, a DCA

analysis was conducted (as described in (Weinreb et al. 2016)). A set of interactions were

detected (Fig. 5.4.2), however, none of them occurred between the ligand-bound structured

78

core of RNA and a bent peripheral domain (Fig. 5.4.2, in yellow). This might mean that DCA

restraints would not improve a prediction by bringing these two parts closer in space.

Figure 5.4.2: The results of a DCA analysis performed for the adenosylcobalamin

riboswitch. The bars represent interactions detected by DCA analysis (the structure made

transparent to highlight the bars). The red box indicates the interface between the core and

the peripheral domain with the lack of predicted interactions).

5.4.2 Non-canonical interactions

tRNAs are difficult to model in silico because the form many non-canonical interactions (Fig.

5.4.3), which SimRNA and Rosetta are not able to predict correctly. Moreover, most of

tRNAs contain modified nucleotides. Since SimRNA and Rosetta do not model modified

nucleotides, an “unmodified” sequence of A, G, C, and U residues only was used for the

modeling. For the latter two reasons, the DCA-based modeling was expected to outperform

other predictive approaches.

79

Figure 5.4.3: A network of canonical and non-canonical interactions depicted using the

Leontis/Westhof classification obtained with RNAView (Yang et al. 2003) for the structure

of tRNA (PDB id: 1FIR).

The thiamine pyrophosphate-sensing riboswitch binds directly to thiamine pyrophosphate

(TPP) to regulate gene expression through a variety of mechanisms in archaea, bacteria and

eukaryotes. The high RMSDs of the predicted models can be explained by the lack of key

interactions in generated models.

Figure 5.4.4: Secondary/tertiary structure presentation in the Leontis–Westhof nomenclature.

Two non-canonical interactions A69-C38 and A69-C22 (highlighted in red) were not

predicted by SimRNA or Rosetta (Lang et al. 2007).

The bound ligand keeps two helices (P3 and P5) together and, the binding is stabilized two

non-canonical interactions A69-C38 and A69-C22 (Fig. 5.4.4) (Lang et al. 2007). At the

80

current stage of the development of SimRNA and Rosetta, this type of interactions is

impossible to predict. Hence, the DCA-based modeling in this case outperformed other

approaches, as was expected.

5.4.3 Loop modeling

The Pistol ribozyme (RNA Puzzle 17) includes a conserved region with an A-minor motif.

The AAA trinucleotide (Fig. 5.4.5A, red) is interacting with the minor groove of the P1 stem

(Fig. 5.4.5A, green) in the native structure. However, the motif was not formed in any of the

predictions (Fig. 5.4.5B-E, red). The ribozyme cleaves the substrate and there is a very sharp

bend in the backbone involving the G53-U54 cleavage site (Fig. 5.4.5A, yellow). This bend

was not predicted by any of the used methods. The structure includes a six-base-pair

pseudoknot involving complementary loop segments between the hairpin and the internal

loops with the pseudoknot duplex positioned between stems P1 and P3. This pseudoknot was

used as an input for modeling and it was accurately modeled in all the predictions (Fig.

5.4.5A-E, violet).

Figure 5.4.5: Color-coded: G53-U54 cleavage site (yellow), P1 (green), pseudoknot (violet),

P2 (blue), loops (dark blue) (A) the native structure (PDB ID: 5K7C), and models generated

by (B) SimRNAweb (C) EvoClustRNA|SimRNAweb (D) Rosetta (E) EvoClustRNA|Rosetta.

81

Figure 5.4.6: Superposition of all predicted (A) P1 stems and pseudoknots, (B) P2 stems, (C)

P3 stems. All the fragments are of are good accuracy (RMSDs up to 3.5 Å).

Figure 5.4.7: Fragments of stems P1 with pseudoknots and single-stranded regions extracted

from all the predictions. A conserved region with the AAA trinucleotide (red) is interacting

with the minor groove of the P1 stem (green) in the native structure. However, the motif was

not formed in any of the predictions.

Interestingly, although the models looked very different from each other when the extracted

fragments were superimposed: pseudoknots and P1 stems, P2 steams, and P3 stems, were

very similar with RMSDs up to 3.5 Å (Fig. 5.4.6). Even the P1 stem with pseudoknot of 22

82

nucleotides was predicted accurately in all models with the RMSDs between 2.97 Å and 3.41

Å.

The largest deviations were observed in the loop with RMSDs ranging from 9.97 Å to 11.59

Å (Fig. 5.4.7). Loops are difficult to model because they are usually formed owing to non-

canonical interactions or/and RNA-ligand interactions. Moreover, there are fewer loops in the

PDB database than helical regions, therefore, statistical potentials might have more problems

to model them correctly. Loops are peculiarities and with accumulation of new

experimentally solved RNA structures, predictive methods are expected to generate better

predictions.

5.4.4 Sampling of conformational space

To test whether there were structures that shared the same topology in comparison with the

native structure in the pool of 500 structures of homologs, the results of clustering were

visualized with Clans (Frickey and Lupas 2004) (Fig. 5.4.8). Clans uses a version of the

Fruchterman–Reingold graph layout algorithm to visualize pairwise sequence similarities in

either two-dimensional or three-dimensional space. The program was designed to calculates

pairwise attraction values to compare protein sequences; however, it is possible to load a

matrix of precomputed attraction values and thereby display any kind of data based on

pairwise interactions. Therefore, the Clanstix program from the rna-pdb-tools package was

used to convert the all-vs-all distance (RMSD) matrix, between selected for clustering

fragments from the EvoClustRNA|SimRNAweb run, into an input file for Clans. The results

of clanstix are shown in Fig. 5.4.8. In this clustering visualization, 100 models of five

homologs are shown (each homolog uniquely colored, models of the target sequence are

colored in lime). Models with a pairwise distance in terms of RMSDs lower than 6 Å are

connected. The native structure was added to this clustering (Fig. 5.4.8A, big dot) to see

where it would be mapped. Interestingly, the native structure was mapped to the small

cluster. In this cluster, there are three models for the target sequence. The model the closest to

this the cluster center (Fig. 5.4.8B) achieved an RMSD of 6.98 Å to the native structure. This

clustering visualization showed that there were models generated with the correct fold, but

none of them were selected as the final prediction. The final prediction was the center of the

biggest cluster (Fig. 5.4.8C).

83

Figure 5.4.8: Clustering visualized with Clans (A) the native structure, (B) the model with

the close fold to the native, detected in a small cluster, (C) the biggest cluster with the model

that was returned as the final prediction.

An analogous analysis was performed the results of clustering of

EvoClustRNA|SimRNAweb run for the TPP riboswtich. Models with a pairwise distance in

terms of RMSDs lower than 9 Å are connected. Interestingly, the native structure (Fig.

5.4.9A, big dot) was mapped to a cluster of models of one of the homologs (Fig. 5.4.9, blue).

The center of this cluster (Fig. 5.4.9B) achieved an RMSD (of helical, shared fragments) of 9

Å to the native structure. In this cluster, there were not models for the target sequence. Since

SimRNAweb was not able to detect non-canonical interactions, most of the structures were in

“open” conformation and clustered far from the native structure. The final prediction was

(Fig. 5.4.9C) achieved an RMSD of 24.08 Å with respect to the native.

84


the close fold to the native (C) the biggest cluster with the model that was returned as the

final prediction.

These two analyses showed that SimRNAweb was able to sample conformational space

efficiently and near-native structures are generated during simulations. Incorrect predictions

were made because of the problem with the energy function to score models properly.

85

6 Conclusions

RNAs are one of the key molecules of life and are involved in a number of highly important

biological processes. Starting from storage information through signaling to enzymatic

activity and many others. RNAs can also serve as excellent tools and targets in medicine

(e.g., miRNA therapies, rRNAs as targets for antibiotics) and biotechnology (e.g., gene

editing with CRISPR-Cas9).

To perform their function complex RNA molecules must fold into a specific structure. Since

high-resolution experimental techniques are not always applicable, in this study two new

methods for computational modeling were developed and their results were investigated.

A a new scoring method mqapRNA was developed and showed to be relatively efficient in

scoring 3D structural models. To provide accurate models for scoring, EvoClustRNA, a new

evolutionary approach for RNA 3D structure prediction was implemented and benchmarked.

The realization of these two projects would not be possible without a toolbox, rna-pdb-tools,

of various scripts that allows for fast building of new applications and efficient data

management.

mqapRNA and EvoClustRNA highly depends on initial models and suffer from the

limitations of the predictive methods as indentified in this study: (1) lack of prediction of

RNA-ligand (Fig. 5.1A) and (2) non-canonical interactions (Fig. 5.1B), and (3) difficulties in

modeling loops (Fig.5.2C).

The results described in this thesis suggest that there is a need for more holistic and

thoughtful pipline for RNA structure prediction (Fig. 5.1D) which must include:

methods for homology search and sequence alignment preparation,

methods for secondary structure predictions based on sequence alignments and

methods for local motif detection (to use them as fragments in prediction),

methods for RNA 3D structure prediction with the capability to predict RNA-ligand

and non-canonical interactions with the aid of experimental or evolutionary restraints,

which should be run to predict structures of a few homologs,

86

methods for scoring the obtained models to generate the final prediction.

In this work three of tools that could be included in the ultimate pipeline for RNA 3D

structure prediction were described.

Figure 5.4.1: Limitation of the predictive methods identified based on the results of this

study (A-C) and a description of the ultimate pipeline for RNA 3D structure prediction (D).

By exploring new ideas by developing new tools and identification of limitations of the

current RNA 3D structure prediction methods, this work is bringing us closer to the near-

native computational RNA 3D models.

87

7 Supplementary data

S1. List of all the sequences and secondary structures used in the

benchmark of EvoClustRNA and a list of links to the SimRNAweb

predictions

ade

> 1Y26:X|PDBID|CHAIN|SEQUENCE

CGCUUCAUAUAAUCCUAAUGAUAUGGUUUGGGAGUUUCUACCAAGAGCCUUAAACUCUUGAUUAUGAAGUG

(((((((((...((((((......[[.))))))........((((((]].....))))))..)))))))))

http://genesilico.pl/SimRNAweb/jobs/ade_pk-35b2a2c1/

> AAML04000013.1

UAUAACAUAUAAUUUUGACAAUAUGGGUCAUAAGUUUCUACCGGAAUACCGUAAAUAUUCUGACUAUGUAUA

((((.((((...((.((((.....[[)))).))........(.(((((]].....))))).)..))))))))

http://genesilico.pl/SimRNAweb/jobs/9c6339e0-591c-498d-9745-1a828f9ee81d/

> BA000028.3/1103960-1104044

UUUUCAUAUAAUCGCGGGGAUAUGGCCUGCAAGUUUCUACCGGUUUACCGUAAAUGAACCGACUAUGGAAA

(((.((((...(.(((((.....[[))))).)........(((((.(]].....).)))))..)))).)))

http://genesilico.pl/SimRNAweb/jobs/7bc1d432-eac8-47cf-a42e-aa3c89efc721/

> U51115.1/15606-15691

ACCUCAUAUAAUCUUGGGAAUAUGGCCCAUAAGUUUCUACCCGGCAACCGUAAAUUGCCGGACUAUGCAGG

.(..((((...(..((((.....[[))))..)........(((((((]].....)))))))..))))..).

http://genesilico.pl/SimRNAweb/jobs/e614e4a0-0898-45f2-9964-52db07279965/

> AAFV01000199.1/524-602

((((((((...((((((....[[)))))).........(((((]]....)))))...))))))))

http://genesilico.pl/SimRNAweb/jobs/2e496700-b989-4044-883d-d34257b022ab/

tpp

> tpp

gGACUCGGGGUGCCCUUCUGCGUGAAGGCUGAGAAAUACCCGUAUCACCUGAUCUGGAUAAUGCCAGCGUAGGGAAGUUc

(((((((((.((((.(((.....))))))......)..)))).....(((...((((......))))...)))..)))))

http://genesilico.pl/SimRNAweb/jobs/16662ebf-cf31-42d1-98a3-2aae31f28087/

>CP000050.1/1019813-1019911

CCGCCGAAGUGGGGGUACCACAGCACUGCUGCGGUUGAGAUAGUCCCUUCGAACCUGAUCCGGCUCAUACCGGCGUAGGGAAGCUUCGUUAGA

UGCGCU

.....(((((((((..(((.(.........).))).........)))).....(((...((((......))))...)))...)))))......

......

http://iimcb.genesilico.pl/SimRNAweb/jobs/aed2c40b-bb70-44a7-846d-b133359fc6bd/

>BX248356.1/234808-234920

ACGAGAUGCCCGGGUGCCAUGUGCUUGCUGUACGUGGCUGAGACGGCUGUUUGGCCGAACCGUAGAACCUGAUCUGGGUAAUACCAGCGAUAG

GAAGACUUCAUACUGUGACU

.....(.(.((((..((((...............)))).....((((......))))..))).....(((...((((......))))....))

)..).).)............

http://iimcb.genesilico.pl/SimRNAweb/jobs/0abbb76e-9cda-482f-abb2-94557e91acd8/

>AE017180.1/640928-641029

AUAGUCUGCUGGGGGAGUUCUUGGGAACUGAGACGGGCAACGCCCGAACCCUUUGAACCUGAUCCGGUUUAUACCGGCGUAGGGAAGCGGCCA

GAAACAAUC

.....(.(((.(((..((((....)))).....(((((...)))))..)))......(((...((((......))))...)))..))).)...

.........

http://iimcb.genesilico.pl/SimRNAweb/jobs/6bff10d7-d4ec-43ce-8f79-8f538fa1ae65/

>AL766847.1/75304-75402

CACAAGGGAGUGCCUUGAGCUGAGAUUGCAGAUAUGCAAAAUCCUCUAACCUGAUCUCGUUAGGACGAGCGUAGGAAUUGUG

(((((((((..((.....)).....(((((....)))))..))))....((((..((((......))))..))))..)))))

http://genesilico.pl/SimRNAweb/jobs/d2609d4d-bd6f-49fd-acbe-0ab278e0166b/

tRNA

>1fir

GCCCGGAUAGCUCAGUCGGAGAGCAUCAGACUUUUAAUCUGAGGGUCCAGGGUUCAAGUCCCUGUUCGGGCGCCA

(((((((..((((........))))..(((.........)))......(((((.......))))))))))))....

http://iimcb.genesilico.pl/SimRNAweb/jobs/a9bc516d-e3da-489d-93ef-5eb20e3f13c3/

>AF396436.1/4744747513

GCCGCUUGGAUGGUUCCGGUGUGGGCUCAUUUCCCAUAACUAUAAAGUUCGAUUCUUUAAAGUGGCU

http://genesilico.pl/SimRNAweb/jobs/ade_pk-35b2a2c1/

http://genesilico.pl/SimRNAweb/jobs/9c6339e0-591c-498d-9745-1a828f9ee81d/

http://genesilico.pl/SimRNAweb/jobs/7bc1d432-eac8-47cf-a42e-aa3c89efc721/

http://genesilico.pl/SimRNAweb/jobs/e614e4a0-0898-45f2-9964-52db07279965/

http://genesilico.pl/SimRNAweb/jobs/2e496700-b989-4044-883d-d34257b022ab/

http://genesilico.pl/SimRNAweb/jobs/16662ebf-cf31-42d1-98a3-2aae31f28087/

http://iimcb.genesilico.pl/SimRNAweb/jobs/aed2c40b-bb70-44a7-846d-b133359fc6bd/

http://iimcb.genesilico.pl/SimRNAweb/jobs/0abbb76e-9cda-482f-abb2-94557e91acd8/

http://iimcb.genesilico.pl/SimRNAweb/jobs/6bff10d7-d4ec-43ce-8f79-8f538fa1ae65/

http://genesilico.pl/SimRNAweb/jobs/d2609d4d-bd6f-49fd-acbe-0ab278e0166b/

http://iimcb.genesilico.pl/SimRNAweb/jobs/a9bc516d-e3da-489d-93ef-5eb20e3f13c3/

88

(((((((...(((..)))..(((((.......))))).....(((((.......)))))))))))).

http://iimcb.genesilico.pl/SimRNAweb/jobs/822df074-320e-4166-9fd1-8fbcf085908a/

>M57527.1/170

ACUCUUAUAGCUUAAUAUUAAAGUAUAGCGCUGAAAACGCUAAGAUGAACCCUAAAAAGUUCUAGGGGUA

(((((((..((((.......)))).(((((.......)))))....(((((......)))))))))))).

http://iimcb.genesilico.pl/SimRNAweb/jobs/613bcfcf-f513-4945-9cf4-6df7db04545e/

>AB009835.1/171

CAUUAGAUGACUGAAAGCAAGUACUGGUCUCUUAAACCAUUUAAUAGUAAAUUAGCACUUACUUCUAAUGA

(((((((..(((.......)))..((((.......))))......(((.((.......)))))))))))).

http://iimcb.genesilico.pl/SimRNAweb/jobs/cf61bea5-88c4-4e82-8042-dc04ce5cadcf/

>M26977.1/379453

GGGGCCAUAGGGUAGCCUGGUCUAUCCUUUGGGCUUUGGGAGCCUGAGACCCCGGUUCAAAUCCGGGUGGCCCCA

(((((((..((((...........)))).(((((.......)))))....(((((.......)))))))))))).

http://iimcb.genesilico.pl/SimRNAweb/jobs/8ca21d4d-7ceb-4736-9619-7c1814c75637/

GMP

>gmp

gCGCGGAAACAAUGAUGAAUGGGUUUAAAUUGGGCACUUGACUCAUUUUGAGUUAGUAGUGCAACCGACCGUGCUgg

((((((..((......(((((((((......[[[[[[[.)))))))))...))....]]]]]..]]..))))))...

http://iimcb.genesilico.pl/SimRNAweb/jobs/faa97ed7/

>AE015927.1/474745-474827

AUUUUAAGAGGAAAUUUUGAACUAUAUACUUAUUUGGGCACUUUGUAUAUAGGGAGUUAGUAGUGCAACCGACCUUGAUUAAU

(((....((((.(((......((((((((......[[[[[[[..))))))))...)))...]]]]]..]]..))))....)))

http://genesilico.pl/SimRNAweb/jobs/e59064f8-ef9c-4c2c-864a-e20b4092cb03/

>ABFD02000011.1/154500-154585

AAAUAUUAUAGAGAUGUUGAAGUAUAUUCUAUUAUUGGGCACCUUAUGGAUAUACUGAGUCAGUGGUGCAACCGGCUAUGAAUAUA

.....((((((.(((......((((((((.......[[[[[[[....))))))))...)))...]]]]]..]]..)))))).....

http://genesilico.pl/SimRNAweb/jobs/5c0d22ec-c061-4567-aa68-3f8e5ac9ab46/

>BA000004.3/387918-388001

AAUCAAUAGGGAAGCAACGAAGCAUAGCCUUUAUAUGGACACUUGGGUUAUGUGGAGCUACUAGUGUAACCGGCCCUCCUUUAA

....(..((((.(((......((((((((.......[[[[[[[..))))))))...)))...]]]]]..]]..))))..)....

http://genesilico.pl/SimRNAweb/jobs/e5332c4d-e096-4d01-91f0-6b5ef2f92d37/

>AE000513.1/1919839-1919923

CUGUCGAAGAGACGCGAUGAAUCCCGCCCUGUAAUUCGGGCACCUCGGACGGGAGGAGCAAGUGGUGCGACCGGCUUUUCGUUGG

(((.(((((((..((......(((((.((........[[[[[[[..)).)))))...))....]]]]]..]]..))))))).)))

http://genesilico.pl/SimRNAweb/jobs/e462d8a5-7079-41df-b1bb-25edcb065cca/

THF

>thf

GGAGAGUAGAUGAUUCGCGUUAAGUGUGUGUGAAUGGGAUGUCGUCACACAACGAAGCGAGAGCGCGGUGAAUCAUUGCAUCCGCUCCA

((((....((((((((((((......(((((((...[[[[....))))))).....((....))))).)))))))))..]]]].)))).

http://genesilico.pl/SimRNAweb/jobs/7f0f8826/

>ACCL02000010.1/116901-116991

AGUAGAGUAGGUCUUAUACGUAAAGUGUCAUCGGAUGGGGAGACUUCCGGUGAACGAAGGGUUACCGCGUUAUAUGACCGCUUCCGCUACU

(((((....((((.((((......((.(((.((((..[[[[....)))).)))))...((....))....)))).))))..]]]].)))))

http://iimcb.genesilico.pl/SimRNAweb/jobs/a690ac93-1e57-4f25-9f63-aabf0700574d/

>ACKX01000080.1/10519-10620

UGCAGAGUAGAGAAUAAAGUGGUUAGUGCCCGACACACAGGGAGUUGGUGUCGAGACGAAGAGCCGAAUCGGUUCCCAGUUUUAUUUUCGCAU

CCCGCUGCC

(((((....(((((((((((.....((.((((((((...[[[[....))))))))))...((((.......))))...)))))))))))...]

]]].)))))

http://iimcb.genesilico.pl/SimRNAweb/jobs/cb6e7e4d/

> haq

UGCAAAAUAGGUUUCCAUGCGUCAAGUGUUUUGUGGAUGGGGAGUUGCCACAGAAACGAAAAGUCGGUUCGCGUGCGGACCGGACUUACGAUA

UGGUUACCGCACCCGUUGCA

(((((....(((..(((((......((.((((((((...[[[.....))))))))))...(((.....................)))....))

)))..)))...]]].)))))

http://genesilico.pl/SimRNAweb/jobs/497811c4/

> hcp

GGUAGAGUAGGUGUCUCGCGUUAAGUGCCAAGGGAUGGGACGUUGCCCUUGGACGAAAGCUAUUAAGAGCUGCGUUGGGACAUCGCGUUCGCU

AUC

(((((....((((((((((.....((.(((((((...[[[[....)))))))))...((((......))))...))))))))))..]]]].))

)))

http://iimcb.genesilico.pl/SimRNAweb/jobs/fae110a9/

RNA Puzzle 06

>4gxy AP006840.1/2074430-2074237

cggcaggugcucccgacccugcggucgggaguuaaaagggaagccggugcaaguccggcacggucccgccacugugacggggagucgccccuc

gggaugugccacuggcccgaaggccgggaaggcggaggggcggcgaggauccggagucaggaaaccugccugccg

((((((((((((((((((....))))))))))....(((...(((((.......)))))[[(((...))).(((...(((...((((((((((

......((((.((((((....))))))...))))))))))))))......)))..]])))....)))))))))))

http://genesilico.pl/SimRNAweb/jobs/9d39f986/

>BX571869.1/30799-30632

AUGGUGUGGUUGGGAAGGAGGUGAAAGUCCUCCGCAGCCCCCGCUGCUGUGAUGCUGACAACUCCGCUGAUGCCACUGGUCGGAAAGACUGGG

AAGGUUGCGGGGAAGGGUGACGCUAAGCCAGAAGACCGACCUG

http://iimcb.genesilico.pl/SimRNAweb/jobs/822df074-320e-4166-9fd1-8fbcf085908a/

http://iimcb.genesilico.pl/SimRNAweb/jobs/613bcfcf-f513-4945-9cf4-6df7db04545e/

http://iimcb.genesilico.pl/SimRNAweb/jobs/cf61bea5-88c4-4e82-8042-dc04ce5cadcf/

http://iimcb.genesilico.pl/SimRNAweb/jobs/8ca21d4d-7ceb-4736-9619-7c1814c75637/

http://iimcb.genesilico.pl/SimRNAweb/jobs/faa97ed7/

http://genesilico.pl/SimRNAweb/jobs/e59064f8-ef9c-4c2c-864a-e20b4092cb03/

http://genesilico.pl/SimRNAweb/jobs/5c0d22ec-c061-4567-aa68-3f8e5ac9ab46/

http://genesilico.pl/SimRNAweb/jobs/e5332c4d-e096-4d01-91f0-6b5ef2f92d37/

http://genesilico.pl/SimRNAweb/jobs/e462d8a5-7079-41df-b1bb-25edcb065cca/

http://genesilico.pl/SimRNAweb/jobs/7f0f8826/

http://iimcb.genesilico.pl/SimRNAweb/jobs/a690ac93-1e57-4f25-9f63-aabf0700574d/

http://iimcb.genesilico.pl/SimRNAweb/jobs/cb6e7e4d/

http://genesilico.pl/SimRNAweb/jobs/497811c4/

http://iimcb.genesilico.pl/SimRNAweb/jobs/fae110a9/

http://genesilico.pl/SimRNAweb/jobs/9d39f986/

89

..(((......((...(((((.......)))))[[(((....))).(((....((........((((.....((.(..(((.....)))..).

..))..))))...........))...]])))....)).)))..

http://genesilico.pl/SimRNAweb/jobs/ca9c767d-06b5-494d-841f-f1eb1ed904f1/

>cp771

CUUUGCAUGUUGAAAGGGAAGCCCGGUGAAAAUCCGGCGCGGGGCCGCCACCGUGAGUGGGGACGAAAUUCACAAUAUACCACUGGCCUAAUU

UUGGCUGGGAAGGUGUGAAGAGUAGGAUGAUCCACGAGUCGGGAGACCUAACAUGCAAAG

.((((((((((...(((...(.((((.......)))))[[((.....)).(((...((((........(((((.....(((.((((((.....

..))))))...))))))))............))))..]])))....))))))))))))).

http://genesilico.pl/SimRNAweb/jobs/3bbb8853-dd87-4913-acab-47caaed213ed/

>af193

uuaagguucuuugucauuggcaaagcuaagagggaaacuggugcgaaagaauuuucaaagccagugcugcccccgcaacuguaaacggcgagc

aaagaucaaaaugccacugauauuauuaucgggaaggcugaucggacgcggugacccgucaagucaggagaccugccuuaa

http://genesilico.pl/SimRNAweb/jobs/d752779c-bd51-411c-9716-064bcbd8606e/

> AM406670.1/3903431-3903207

UCAGGUGCCCGAAGGCGGUCCUCGCCCCAGGGUUAAACGGGAAACAGGUGCGCGCCUCCGGCGCAAUGCCUGUGCUGCCCCCGCAACGGUAAG

CGAGUGCAAGGCGCAUCAACAGCCACUGGGUCGUCCCCGGGAAGGCGAUGCGUCGGAGCCGGCCACAGCCGCUCCAGCCCGCGAGCCCGGAUA

CCGGCCCGA

((.(((.(((...((((.....))))...))).....(((...(((((...(((((...)))))....)))))[[(((....))).(((...(

((.........((((((....(((.(((((.....)))))...)))))))))..((((.((((....))))))))....)))..]])))....

)))))).))

http://genesilico.pl/SimRNAweb/jobs/6ab4c5c2-7605-4a81-a8b9-62cda22bb4a6/

RNA Puzzle 13 > zmp

gggucgugacuggcgaacaggugggaaaccaccggggagcgaccccggcaucgauagccgcccgccugggc

(((((((....[[[[....(((((....))))).....)))))))...........(((...]]]]..)))

http://genesilico.pl/SimRNAweb/jobs/175dd34c-100b-4a46-9aaa-e773b1468c39/

>CU234118.1/352539-352459

gcucucgcgcgacuggcgacuuuggauggagcaccaucggggagcgcgggaucgaccgccgugcgccugggc

((((((((((....[[[[......(((((....))))).....))))))))))....(((...]]]]..)))

http://genesilico.pl/SimRNAweb/jobs/0bf5c25e-4936-4da7-b145-928eea4031c7/

>BAAV01000055.1/28972982

ugaguuuucugcgacugacggauuauugcagagcacugcaagggaacagaaaaacucuuuuucagccgaccgucugggcacaccug

....(((((((.....[[[[[.....(((((....)))))......)))))))...........(((..]]]]]..))).......

http://genesilico.pl/SimRNAweb/jobs/8a418378-29f5-45df-af4a-5ecac1a5e7a4/

>CP000927.1/5164264-5164343

gcccguucgcgugacuggcgcuagugauggggaaccaucggggagcgcgaaccacaucgccgcgcgccugggcuccucga

....((((((((....[[[[[....(((((....))))).....))))))))......(((..]]]]]..))).......

http://genesilico.pl/SimRNAweb/jobs/d1969c5d-5a55-4025-944e-089de20719cf/

> AP009385.1/718103-718202

ucaccccugcgugacuggcgauagaacccucggguucaagguggagcaucccaccgugaagcgcagggcgccguuuuugccguucgccugggc

agccguu

....((((((((....[[[[[..(((((....)))))..(((((......))))).....))))))))........(((((..]]]]]..)))

)).....

http://genesilico.pl/SimRNAweb/jobs/9ca56ed4-69bb-477b-8ac2-35bfd085685f/

RNA Puzzle 14

>rp14

CGUUGACCCAGGAAACUGGGCGGAAGUAAGGCCCAUUGCACUCCGGGCCUGAAGCAACGCG

(((((.(((((....)))))........((((((..........))))))....)))))..

http://genesilico.pl/SimRNAweb/jobs/1aa9a03c-33e4-4718-899e-54ab3158d64c/

>AJ630128.1

AUCGUUCAUUCGCUAUUCGCAAAUAGCGAACGCAAAAGCCGACUGAAGGAACGGGAC

..(((((.(((((((........)))))))......((....))....)))))....

http://genesilico.pl/SimRNAweb/jobs/r14aj63pk-2f5f0e3d/

>AACY023015051.1

CGUUCAUCUUAUUUUAUUAAAUAGGACGGAAGUAGGAAGAUAGGAAAACCUCUUUCUUUUUUAAAGAAAGGCUAGCAAGUACCGCUUGGGUUA

AUUUAUCUUAGGCGGGAACGAGACCGAAUAUCUGCCGAAGGAACGC

(((((.(((((((......)))))))..[.....((..((((((....)).((((((((...))))))))(((....))).(((((((((...

......)))))))))...((....))..))))..))....))))).

RNA Puzzle 17

>rp17

CGUGGUUAGGGCCACGUUAAAUAGUUGCUUAAGCCCUAAGCGUUGAUAAAUAUCAGGUGCAA

((((..[[[[[.))))........((((.....]]]]]....(((((....)))))..))))

http://iimcb.genesilico.pl/SimRNAweb/jobs/27b5093d/

>hcf

UGCCGUUUGAGCGGCAUUAAACAGGUCUUAAGCUCAAAGCGUCACCGCCUACAAUGCUAGGCGGUGGGUGACA

((((..[[[[[.))))........(((.....]]]]]....((((((((((......))))))))))..))).

http://genesilico.pl/SimRNAweb/jobs/6d8062dd/

>s223

http://genesilico.pl/SimRNAweb/jobs/ca9c767d-06b5-494d-841f-f1eb1ed904f1/

http://genesilico.pl/SimRNAweb/jobs/3bbb8853-dd87-4913-acab-47caaed213ed/

http://genesilico.pl/SimRNAweb/jobs/d752779c-bd51-411c-9716-064bcbd8606e/

http://genesilico.pl/SimRNAweb/jobs/6ab4c5c2-7605-4a81-a8b9-62cda22bb4a6/

http://genesilico.pl/SimRNAweb/jobs/175dd34c-100b-4a46-9aaa-e773b1468c39/

http://genesilico.pl/SimRNAweb/jobs/0bf5c25e-4936-4da7-b145-928eea4031c7/

http://genesilico.pl/SimRNAweb/jobs/8a418378-29f5-45df-af4a-5ecac1a5e7a4/

http://genesilico.pl/SimRNAweb/jobs/d1969c5d-5a55-4025-944e-089de20719cf/

http://genesilico.pl/SimRNAweb/jobs/9ca56ed4-69bb-477b-8ac2-35bfd085685f/

http://genesilico.pl/SimRNAweb/jobs/1aa9a03c-33e4-4718-899e-54ab3158d64c/

http://genesilico.pl/SimRNAweb/jobs/r14aj63pk-2f5f0e3d/

http://iimcb.genesilico.pl/SimRNAweb/jobs/27b5093d/

http://genesilico.pl/SimRNAweb/jobs/6d8062dd/

90

GCUCGUCUGGGCGAGGAUAAAUAGCUGUUAGGCCCAGAGCGGCUCUUCGGAUUGUGUUCCCUCCGCAAUCCGGGGAGCGUCAGC

.(((..[[[[[.)))........((((.....]]]]]....(((((((((((((((.......)))))))))))))))..))))

http://genesilico.pl/SimRNAweb/jobs/36828e10/

>s221

AGCCGUUGCGGCGGCUAUAAAUAGGACAUUAAGCCGCAAGCGUUGCCCGGUAUACCGCCGGGCAGGUUGUC

((((..[[[[[.))))........((((.....]]]]]....(((((((((.....)))))))))..))))

http://genesilico.pl/SimRNAweb/jobs/742b47e6/

>pisol

AGCCGUUCGGGCGGCUAUAAACAGACCUCAGGCCCGAAGCGUGGCGGCGCCGCCGGUGGUA

((((..[[[[[.)))).......((((.....]]]]]....((((((()))))))..))))



http://genesilico.pl/SimRNAweb/jobs/742b47e6/


91

Table of Figures

Figure 1.2.1: Ribonucleotide - a building block of RNA. Source (Wikimedia-Commons) ..... 2

Figure 1.2.2: Leontis/Westhof classification of base pairings. (A) RNA bases - adenine (A),

cytosine (C), guanine (G) and uracil (U) - involve one of three distinct edges: the

Watson–Crick (W) edge, the Hoogsteen (H) edge, and the Sugar (S) edge. (B) Each pair

of can interact in either cis or trans orientations with respect to the glycosidic bonds. (C)

For these reasons, all base pairs can be grouped into twelve geometric base pair families

and eighteen pairing relationships (bases are represented as triangles). Each pair is

represented by a symbol that can be used in a secondary structure and a tertiary structure

diagrams. Filled symbols mean cis base pair configuration, and open symbols, trans base

pair. (D) Interestingly, bases can form triples and they have own classification devised

by Leontis and coworkers (Abu Almakarem et al. 2012)(Creative Commons License) ... 5

Figure 1.2.3: Collation of an example secondary (A) and the corresponding tertiary structure

(B) of the Pistol ribozyme (PDB code: 5K7c (Ren et al. 2016)). This riboswitch adopts a

compact tertiary architecture stabilized by an embedded pseudoknot (violet) fold and is

composed of three helical regions, P1 (green), P2 (blue), P3 (orange). This is a self-

cleaving ribozyme that is widely distributed in nature (Jimenez et al. 2015). The

cleavage site is marked in yellow. The secondary structure diagram was generated with

VARNA (Darty et al. 2009), and the tertiary structure figure was generated with

PyMOL (DELANO 2002) ................................................................................................. 8

Figure 1.4.1: RNA families tend to fold into the same 3D shape. Structures of the riboswitch

c-di-AMP solved independently by three groups: for two different sequences obtained

from Thermoanaerobacter pseudethanolicus (PDB id: 4QK8) and Thermovirga lienii

(PDB id: 4QK9) (Gao and Serganov 2014), for a sequence from Thermoanaerobacter

tengcongensis (PDB id: 4QLM) (Ren and Patel 2014) and for a sequence from Bacillus

subtilis (PDB id: 4W90) (the molecule in blue is a protein used to facilitate

crystallization) (Jones and Ferré-D'Amaré 2014). There is some variation between

structures in the peripheral parts (marked with red arrows), but the overall structure of

the core is preserved......................................................................................................... 20

92

Figure 1.4.2: According to the RNArchitecture database, there are only 3% (70) Rfam

families with known experimentally solved structures, and 97% (2,618 families) without

known structures. ............................................................................................................. 21

Figure 1.5.1: The results of RNA Puzzle 13. The second model in the ranking (sorted

according to RMSD) is a model obtained with a prototype version of EvoClustRNA

developed at the Stanford University. There is not one the way to sort the models.

Different metrics have unique properties, and a researcher should decide what is useful

for his/her application. RMSD informs about a geometrical similarity between a

prediction the crystallographic structure (the lower, the better). INF informs about the

similarity of interaction networks and ranges from 0 to 1 (the higher, the better). Several

partial INF can be computed: INF WC (the canonical interactions only), INF NWC (the

non-canonical interactions only), INF stacking (the stacking interactions only). INF

ALL takes into account all the interactions mentioned above. This RNA-Puzzle shows

one of the biggest problems in the RNA 3D structure prediction, very low INF NWC in

all submissions, which means lack of accurate prediction non-canonical interactions. .. 23

Figure 1.5.2: The detailed view of the results of the ZMP riboswitch (RNA Puzzle 13). For

each submitted model a detailed summary is available online that includes a

superposition of a prediction, in this case, the EvoClustRNA prediction (red), on the

crystallographic structure (green). Various metrics are shown in the result summary. ... 23

Figure 3.6.1: The alignment preparation. The conserved residues are marked with “x” in the

pseudo-sequence “x”. The marked as the conserved residues columns can be inspected

in an arc diagrams of RNA secondary structures (Lai et al. 2012) as the pink line (at the

very bottom). .................................................................................................................... 32

Figure 3.6.2: Each sequence and associated secondary structure was "Saved as" to a Fasta

file and used at the next stage of modeling with the use of the Jalview program. .......... 32

Figure 4.1.1: Graphical diagram of primary methods used by mqapRNA to describe the

analyzed model. (A) other methods for model quality assessment, (B) RNA modeling

software (C) Others. ......................................................................................................... 35

93

Figure 4.1.2: Example of a decoy set from the RASP dataset of the adenine riboswitch (PDB

ID: 1Y26). (A) The native structure. (B-F) A set of structures (files in the PDB format)

selected from this decoy with increasing deviation from the native (in parentheses are

RMSDs to the native). Files: (B) 1y26X_M100 (RMSD: 1.7Å), (C) 1y26X_M200

(RMSD: 2.49Å), (D) 1y26X_M300 (RMSD: 3.23Å), (E) 1y26X_M400 (RMSD:

3.31Å), (F) 1y26X_M500 (RMSD: 5.12Å). .................................................................... 36

Figure 4.1.3: Histograms of RMSDs [Å] per dataset. In red, the datasets used for training

mqapRNA; in orange, the dataset used only for testing. X: number of structures (not

scaled in the same way for all plots because of the very diverse ranges), Y: RMSDs [Å].

.......................................................................................................................................... 37

Figure 4.1.4: Histograms of Secondary Structure (INFs) per dataset. In red, the datasets used

for training mqapRNA, in orange, the dataset used only for testing. X: number of

structures (not scaled in the same way for all plots because of the very diverse ranges),

Y: Secondary Structure similarity of a given model to a secondary structure of a native

structure (INFs). ............................................................................................................... 37

Figure 4.1.5: mqapRNA is a machine learning based method. (A) First, a statistical model

was built on a training dataset of structures of known RMSD to native structures. Each

structure is described by a list of scores, results of the primary methods. Since this is the

training set, RMSD of these structure to native structures is known. This process allows

mqapRNA to detect what is the correspondence between scores and RMSDs. (B) Next,

the statistical model is applied for new cases, where RMSD is unknown. ...................... 38

Figure 4.1.6: Contribution (“Importance”) to a given subscore (“Variable”) in the final deep

learning model developed for mqapRNA (a plot generated with the H2O flow

Notebook). The higher, the more a given subscore is required for accurate predictions of

the statistical model.......................................................................................................... 39

Figure 4.1.7: Rank correlations for each decoy set and scoring method. mqapRNA (3rd

column) outperformed other scoring functions with a weighted average of rank

correlations of 0.77) ......................................................................................................... 42

94

Figure 4.1.8: Enrichment Score for each decoy set and scoring method. mqapRNA (3rd

column) is outperformed by SimRNA (10th column) by 0.1 in terms of EC. ................. 43

Figure 4.1.9: Close-up on the RNA-Puzzle 14 results in a form of RMSD [Å] vs Score plots.

The perfect method should follow a diagonal in a plot. mqapRNA achieved an EC of 7.7

and was able to identify a group of the near-native models. Other methods were not able

to rank models properly. .................................................................................................. 44

Figure 4.1.10: The homepage of the mqapRNA web server. ................................................. 46

Figure 4.1.11: A result page of mqapRNA. The page is divided into three panels: a plot of

mqapRNA score, a table of the score, and the restraints editor. The distance restraints

can be easily modified and re-submitted to the server. The results will be immediately

updated which might encourage the user to try different sets of restraints. .................... 47

Figure 4.1.12: Distance restraints editor at the bottom of the result page. The user can upload

a file with distance restraints or use an online editor to modify his/her query. After the

re-submission, the scores are re-calculated, and a new plot is generated. ....................... 48

Figure 4.2.1: The scheme of the proposed methodology. (A) Homologous sequences are

found for the target sequence, and an RNA alignment is created. (B) Using Rosetta and

SimRNA or/and Rosetta, structural models for all sequences are generated. (C) The

conserved regions are cut out and clustered. (D) The final prediction of the method is the

model containing the most commonly preserved structural arrangements in the set of

homologs. ......................................................................................................................... 49

Figure 4.2.2: The RNA-Puzzle 13 - the ZMP riboswitch. The superposition of the native

structure (green) and the EvoClustRNA prediction (blue). The RMSD between

structures is 5.55 A, the prediction was ranked as the second in the total ranking of the

RNA-Puzzles (according to the RMSD values)............................................................... 51

Figure 4.2.3: The RNA Puzzle 14 - L-glutamine riboswitch. The RMSD between the native

structure (green) and the EvoClustRNA prediction (blue) is 5.56 Å. .............................. 52

95

Figure 4.2.4: The native structure (PDB ID: 1Y26). Models generated by (B) Weinberg et al.

(C) SimRNAweb (D) EvoClustRNA|SimRNAweb (E) Rosetta (F)

EvoClustRNA|Rosetta. All models exhibit the native-like fold. However, only models

C, D exhibit similar orientation of secondary structure elements with respect to the

native structure. ................................................................................................................ 54

Figure 4.2.5: The native structure (PDB ID: 2GDI). Models generated by (B) Weinberg et al.

(C) SimRNAweb (D) EvoClustRNA|SimRNAweb (E) Rosetta (F)


structure, with RMSD of 13.92 Å. ................................................................................... 55

Figure 4.2.6: (A) The native structure (PDB ID: 1FIR). Models generated by (B) Weinberg



structure, with an RMSD of 10.35 Å. .............................................................................. 56

Figure 4.2.7: (A) The native structure (PDB ID: 3Q3Z). Models generated by (B) Weinberg


EvoClustRNA|Rosetta. The RMSDs range from 9.65 Å to 14.53 Å. .............................. 56

Figure 4.2.8: (A) The native structure (PDB ID: 4LVV). Models generated by (B) Weinberg


EvoClustRNA|Rosetta. Model E is the closest to the native structure with an RMSD

4.83 Å. .............................................................................................................................. 57

Figure 4.2.9: (A) The native structure (PDB ID: 4GXY). Models generated by (B)


Due to missing RNA-ligand interactions, none of the models is close to the native

structure (RMSDs range from 31.02 Å to 33.39 Å). ....................................................... 58

Figure 4.2.10: (A) The native structure (PDB ID: 4XW7). Models generated by (B)


.......................................................................................................................................... 58

96

Figure 4.2.11: (A) The native structure (PDB ID: 5DDO). Models generated by (B)


The most accurate model of this riboswitch was generated with

EvoClustRNA|SimRNAweb (RMSD 4.44 Å). ................................................................ 59

Figure 4.2.12: (A) The native structure (PDB ID: 5K7C). Models generated by (B)


.......................................................................................................................................... 59

Figure 4.3.1: rna-pdb-tools can be run also from Emacs. A researcher can edit a PDB file

using the text-oriented functionality of this editor and then without leaving the editor can

apply the RNApuzzle function to standardize the file. .................................................... 63

Figure 4.3.2: rna_pdb_toolsx.py is able to rebuild missing base (drawn in thin line) to

complete a structure. ........................................................................................................ 65

Figure 4.3.3: rna-pdb-tools comes with a detailed documentation that can be viewed online

or as a PDF file. ............................................................................................................... 65

Figure 4.3.4: diffpdb.py is a tool to detect differences in formatting between two PDB files.

First, the tool removes columns of coordinates, and next compares only columns with

annotation (atom naming, numbering). ............................................................................ 66

Figure 4.3.5: A fragment of the demo on the RNA alignment functionality implemented in

rna-pdb-tools. Top: a user can load a new alignment and plot an RChie plot, bottom: a

user can also get a secondary structure and a sequence for a row taken for an alignment

(gaps are removed) in the text format or get a visualization using VARNA. The

functions can be imported to a user’s own Python scripts but also to a Jupyter notebook

(as shown in the figure).................................................................................................... 67

Figure 5.3.1: The Jupyter notebook (a part of the whole notebook) for the RNA-Puzzle 18

problem. The notebook reports steps of a bioinformatical analysis to collect information

about the target sequence, such as: secondary structure predictions using three different

methods and a BLAST search on the PDB database that led to the detection of a

homolog used later for a comparative modeling.............................................................. 75

97

Figure 5.4.1. The native structure (PDB ID: 4GXY) solved with the ligand (indicated by the

arrow). .............................................................................................................................. 77

Figure 5.4.2: The results of a DCA analysis performed for the adenosylcobalamin

riboswitch. The bars represent interactions detected by DCA analysis (the structure

made transparent to highlight the bars). The red box indicates the interface between the

core and the peripheral domain with the lack of predicted interactions). ........................ 78

Figure 5.4.3: A network of canonical and non-canonical interactions depicted using the

Leontis/Westhof classification obtained with RNAView (Yang et al. 2003) for the

structure of tRNA (PDB id: 1FIR). .................................................................................. 79

Figure 5.4.4: Secondary/tertiary structure presentation in the Leontis–Westhof nomenclature.

Two non-canonical interactions A69-C38 and A69-C22 (highlighted in red) were not

predicted by SimRNA or Rosetta (Lang et al. 2007). ...................................................... 79

Figure 5.4.5: Color-coded: G53-U54 cleavage site (yellow), P1 (green), pseudoknot (violet),

P2 (blue), loops (dark blue) (A) the native structure (PDB ID: 5K7C), and models

generated by (B) SimRNAweb (C) EvoClustRNA|SimRNAweb (D) Rosetta (E)

EvoClustRNA|Rosetta. .................................................................................................... 80

Figure 5.4.6: Superposition of all predicted (A) P1 stems and pseudoknots, (B) P2 stems, (C)

P3 stems. All the fragments are of are good accuracy (RMSDs up to 3.5 Å). ................ 81

Figure 5.4.7: Fragments of stems P1 with pseudoknots and single-stranded regions extracted

from all the predictions. A conserved region with the AAA trinucleotide (red) is

interacting with the minor groove of the P1 stem (green) in the native structure.

However, the motif was not formed in any of the predictions. ........................................ 81


the close fold to the native, detected in a small cluster, (C) the biggest cluster with the

model that was returned as the final prediction. .............................................................. 83

98


the close fold to the native (C) the biggest cluster with the model that was returned as

the final prediction. .......................................................................................................... 84

Figure 5.4.1: Limitation of the predictive methods identified based on the results of this

study (A-C) and a description of the ultimate pipeline for RNA 3D structure prediction

(D). ................................................................................................................................... 86

99

Table of Tables

Table 1.2.1 Computation methods for RNA 3D structure prediction, based on (Magnus et al.

2014). ............................................................................................................................... 11

Table 1.3.1: Low-resolution experimental methods that generate particularly useful data for

computational prediction of RNA 3D structure, based on (Magnus et al. 2014). An

accurate secondary structure or/and distance restraints can be used with mqapRNA to

refine the final ranking. .................................................................................................... 18

Table 3.5.1: A list of subscores extracted from the primary methods used for training and

prediction with mqapRNA. For each analyzed structure, all these scores are provided in

a CSV output file, both in the standalone version and the web servers ........................... 29

Table 4.2.1: The performance of EvoClustRNA on the test dataset. The results for nine

RNAs. Column 1, original numeration. Column 2, RNA type and PDB ID code for each

RNA. Column 3, sequence length. Column 4, RMSD [Å] of models obtained by

Weinreb et al., only for RNAs 1-5. Column 5, RMSD of the first cluster obtained with

SimRNAweb. Column 6, RMSD [Å] of the first cluster obtained with

EvoClustRNA|SimRNAweb. Column 7, the difference between column 6 and column 5.

Column 8, RMSD [Å] of the first cluster obtained with Rosetta. Column 9, RMSD [Å]

of the first cluster obtained with EvoClustRNA|Rosetta. 10, the difference between

column 9 and column 8. The improvements in RMSDs when EvoClustRNA is used are

marked in green, the cases where EvoClustRNA worsened the results are marked in red.

.......................................................................................................................................... 53

100

Reference

Abu Almakarem, Amal S, Anton I Petrov, Jesse Stombaugh, Craig L Zirbel, and Neocles B

Leontis. 2012. “Comprehensive Survey and Geometric Classification of Base Triples in

RNA Structures..” Nucleic Acids Research 40 (4): 1407–23. doi:10.1093/nar/gkr810.

Adams, Paul D, Pavel V Afonine, Gábor Bunkóczi, Vincent B Chen, Ian W Davis, Nathaniel

Echols, Jeffrey J Headd, et al. 2010. “PHENIX: a Comprehensive Python-Based System

for Macromolecular Structure Solution..” Acta Crystallographica. Section D, Biological

Crystallography 66 (Pt 2). International Union of Crystallography: 213–21.

doi:10.1107/S0907444909052925.

Akiyama, Benjamin M, Hannah M Laurence, Aaron R Massey, David A Costantino, Xuping

Xie, Yujiao Yang, Pei-yong Shi, Jay C Nix, J David Beckham, and Jeffrey S Kieft. 2016.

“Zika Virus Produces Noncoding RNAs Using a Multi-Pseudoknot Structure That

Confounds a Cellular Exonuclease..” Science (New York, N.Y.) 354 (6316): 1148–52.

doi:10.1126/science.aah3963.

Albrecht, Mario, Silvio C E Tosatto, Thomas Lengauer, and Giorgio Valle. 2003. “Simple

Consensus Procedures Are Effective and Sufficient in Secondary Structure Prediction..”

Protein Engineering 16 (7): 459–62.

Anfinsen, C B. 1973. “Principles That Govern the Folding of Protein Chains..” Science (New

York, N.Y.) 181 (4096): 223–30.

Aw, Sherry S, Melissa XM Tang, Yin Nah Teo, and Stephen M Cohen. 2016. “A

Conformation-Induced Fluorescence Method for microRNA Detection.” Nucleic Acids

Research 44 (10): e92–e92. doi:10.1093/nar/gkw108.

Berens, Christian, Florian Groher, and Beatrix Suess. 2015. “RNA Aptamers as Genetic

Control Devices: the Potential of Riboswitches as Synthetic Elements for Regulating

Gene Expression.” Biotechnology Journal 10 (2). WILEY‐VCH Verlag: 246–57.

doi:10.1002/biot.201300498.

Berman, H M, J Westbrook, Z Feng, G Gilliland, T N Bhat, H Weissig, I N Shindyalov, and

P E Bourne. 2000. “The Protein Data Bank..” Nucleic Acids Research 28 (1). Oxford

University Press: 235–42.

Bernauer, Julie, Xuhui Huang, Adelene Y L Sim, and Michael Levitt. 2011. “Fully

Differentiable Coarse-Grained and All-Atom Knowledge-Based Potentials for RNA

Structure Evaluation..” RNA (New York, N.Y.) 17 (6): 1066–75.

doi:10.1261/rna.2543711.

Boccaletto, Pietro, Marcin Magnus, Catarina Almeida, Adriana Zyła, Astha, Radosław Pluta,

Blazej Bagiński, et al. 2017. “RNArchitecture: a Database and a Classification System of

RNA Families, with a Focus on Structural Information.” Submitted for Review.

Boniecki, Michal J, Grzegorz Lach, Wayne K Dawson, Konrad Tomala, Pawel Lukasz,

Tomasz Soltysinski, Kristian M Rother, and Janusz M Bujnicki. 2016. “SimRNA: a

Coarse-Grained Method for RNA Folding Simulations and 3D Structure Prediction..”

Nucleic Acids Research 44 (7): e63–e63. doi:10.1093/nar/gkv1479.

Bonneau, Richard, Charlie E M Strauss, and David Baker. 2001. “Improving the Performance

of Rosetta Using Multiple Sequence Alignment Information and Global Measures of

Hydrophobic Core Formation.” Proteins: Structure, Function, and Bioinformatics 43 (1).

John Wiley & Sons, Inc.: 1–11. doi:10.1002/1097-0134(20010401)43:1<1::AID-

PROT1012>3.0.CO;2-A.

Bonneau, Richard, Charlie E M Strauss, Carol A Rohl, Dylan Chivian, Phillip Bradley, Lars

101

Malmström, Tim Robertson, and David Baker. 2002. “De Novo Prediction of Three-

Dimensional Structures for Major Protein Families..” Journal of Molecular Biology 322

(1): 65–78.

Bottaro, Sandro, Francesco Di Palma, and Giovanni Bussi. 2014. “The Role of Nucleobase

Interactions in RNA Structure and Dynamics..” Nucleic Acids Research 42 (21): 13306–

14. doi:10.1093/nar/gku972.

Brooks, B R, C L Brooks, A D Mackerell, L Nilsson, R J Petrella, B Roux, Y Won, et al.

2009. “CHARMM: the Biomolecular Simulation Program..” Edited by Charles L Brooks

III and David A Case. Journal of Computational Chemistry 30 (10). Wiley Subscription

Services, Inc., A Wiley Company: 1545–1614. doi:10.1002/jcc.21287.

Burks, Jody, Christian Zwieb, Florian Müller, Iwona Wower, and Jacek Wower. 2005.

“Comparative 3-D Modeling of tmRNA..” BMC Molecular Biology 6 (1). BioMed

Central: 14. doi:10.1186/1471-2199-6-14.

Capriotti, Emidio, Tomas Norambuena, Marc A Marti-Renom, and Francisco Melo. 2011.

“All-Atom Knowledge-Based Potential for RNA Structure Prediction and Assessment..”

Bioinformatics (Oxford, England) 27 (8): 1086–93. doi:10.1093/bioinformatics/btr093.

Case, David A, Thomas E Cheatham, Tom Darden, Holger Gohlke, Ray Luo, Kenneth M

Merz, Alexey Onufriev, Carlos Simmerling, Bing Wang, and Robert J Woods. 2005.

“The Amber Biomolecular Simulation Programs.” Journal of Computational Chemistry

26 (16). Wiley Subscription Services, Inc., A Wiley Company: 1668–88.

doi:10.1002/jcc.20290.

Chapman, Erich G, David A Costantino, Jennifer L Rabe, Stephanie L Moon, Jeffrey Wilusz,

Jay C Nix, and Jeffrey S Kieft. 2014. “The Structural Basis of Pathogenic Subgenomic

Flavivirus RNA (sfRNA) Production..” Science (New York, N.Y.) 344 (6181): 307–10.

doi:10.1126/science.1250897.

Chapman, Michael S, Se Won Suh, Paul M G Curmi, Duilio Cascio, Ward W Smith, and

David S Eisenberg. 1988. “Tertiary Structure of Plant RuBisCO: Domains and Their

Contacts.” Science (New York, N.Y.) 241 (4861). American Association for the

Advancement of Science: 71–75.

Cheng, Clarence Yu, Fang-Chieh Chou, and Rhiju Das. 2015. “Modeling Complex RNA

Tertiary Folds with Rosetta.” In Computational Methods for Understanding

Riboswitches, 553:35–64. Methods in Enzymology. Elsevier.

doi:10.1016/bs.mie.2014.10.051.

Chworos, Arkadiusz, Isil Severcan, Alexey Y Koyfman, Patrick Weinkam, Emin Oroudjev,

Helen G Hansma, and Luc Jaeger. 2004. “Building Programmable Jigsaw Puzzles with

RNA..” Science (New York, N.Y.) 306 (5704). American Association for the

Advancement of Science: 2068–72. doi:10.1126/science.1104686.

Cock, Peter J A, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, Andrew

Dalke, Iddo Friedberg, et al. 2009. “Biopython: Freely Available Python Tools for

Computational Molecular Biology and Bioinformatics..” Bioinformatics 25 (11): 1422–

23. doi:10.1093/bioinformatics/btp163.

Cruz, José Almeida, and Eric Westhof. 2011. “Sequence-Based Identification of 3D

Structural Modules in RNA with RMDetect.” Nature Methods 8 (6): 513–19.

doi:10.1038/nmeth.1603.

Czerwoniec, Anna, Stanislaw Dunin-Horkawicz, Elżbieta Purta, Katarzyna H Kaminska,

Joanna M Kasprzak, Janusz M Bujnicki, Henri Grosjean, and Kristian Rother. 2009.

“MODOMICS: a Database of RNA Modification Pathways. 2008 Update..” Nucleic

Acids Research 37 (Database issue): D118–21. doi:10.1093/nar/gkn710.

102

Darty, Kévin, Alain Denise, and Yann Ponty. 2009. “VARNA: Interactive Drawing and

Editing of the RNA Secondary Structure..” Bioinformatics (Oxford, England) 25 (15):

1974–75. doi:10.1093/bioinformatics/btp250.

Das, Rhiju, and David Baker. 2007. “Automated De Novo Prediction of Native-Like RNA

Tertiary Structures..” Proceedings of the National Academy of Sciences 104 (37).

National Acad Sciences: 14664–69. doi:10.1073/pnas.0703836104.

Das, Rhiju, John Karanicolas, and David Baker. 2010. “Atomic Accuracy in Predicting and

Designing Noncanonical RNA Structure.” Nature Methods 7 (4): 291–94.

doi:10.1038/nmeth.1433.

Das, Rhiju, Madhuri Kudaravalli, Magdalena Jonikas, Alain Laederach, Robert Fong, Jason P

Schwans, David Baker, Joseph A Piccirilli, Russ B Altman, and Daniel Herschlag. 2008.

“Structural Inference of Native and Partially Folded RNA by High-Throughput Contact

Mapping.” Proceedings of the National Academy of Sciences 105 (11). National Acad

Sciences: 4144–49. doi:10.1073/pnas.0709032105.

De Leonardis, Eleonora, Benjamin Lutz, Sebastian Ratz, Simona Cocco, Rémi Monasson,

Alexander Schug, and Martin Weigt. 2015. “Direct-Coupling Analysis of Nucleotide

Coevolution Facilitates RNA Secondary and Tertiary Structure Prediction.” Nucleic

Acids Research, September, gkv932–12. doi:10.1093/nar/gkv932.

DELANO, W L. 2002. “The PyMOL Molecular Graphics System.” Pymol.org 52 (1).

DeLano Scientific: 62–67. doi:10.5940/jcrsj.52.62.

Ding, Feng, Shantanu Sharma, Poornima Chalasani, Vadim V Demidov, Natalia E Broude,

and Nikolay V Dokholyan. 2008. “Ab Initio RNA Folding by Discrete Molecular

Dynamics: From Structure Prediction to Folding Mechanisms..” Rna 14 (6). Cold Spring

Harbor Lab: 1164–73. doi:10.1261/rna.894608.

Dunbrack, Roland. 2004. Whatcheck. Vol. 13. Chichester, UK: John Wiley & Sons, Ltd.

doi:10.1002/9780471650126.dob0791.pub2.

Dunin-Horkawicz, Stanislaw, Anna Czerwoniec, Michal J Gajda, Marcin Feder, Henri

Grosjean, and Janusz M Bujnicki. 2006. “MODOMICS: a Database of RNA

Modification Pathways..” Nucleic Acids Research 34 (Database issue): D145–49.

doi:10.1093/nar/gkj084.

Eisenberg, David, Roland Lüthy, and James U Bowie. 1997. “VERIFY3D: Assessment of

Protein Models with Three-Dimensional Profiles.” In Macromolecular Crystallography

Part B, 277:396–404. Methods in Enzymology. Elsevier. doi:10.1016/S0076-

6879(97)77022-8.

Eriksson, Emma S E, Lokesh Joshi, Martin Billeter, and Leif A Eriksson. 2014. “De Novo

Tertiary Structure Prediction Using RNA123--Benchmarking and Application to

Macugen..” Journal of Molecular Modeling 20 (8). Springer Berlin Heidelberg: 2389.

doi:10.1007/s00894-014-2389-z.

Flores, Samuel C, Yaqi Wan, Rick Russell, and Russ B Altman. 2010. “Predicting RNA

Structure by Multiple Template Homology Modeling..” Pacific Symposium on

Biocomputing. Pacific Symposium on Biocomputing. NIH Public Access, 216–27.

Frickey, T, and A Lupas. 2004. “CLANS: a Java Application for Visualizing Protein Families

Based on Pairwise Similarity.” Bioinformatics 20 (18): 3702–4.

doi:10.1093/bioinformatics/bth444.

Gajda, Michał Jan. 2013. “HPDB-Haskell Library for Processing Atomic Biomolecular

Structures in Protein Data Bank Format..” BMC Research Notes 6 (1). BioMed Central:

483. doi:10.1186/1756-0500-6-483.

Gao, Ang, and Alexander Serganov. 2014. “Structural Insights Into Recognition of C-Di-

103

AMP by the ydaO Riboswitch..” Nature Chemical Biology 10 (9): 787–92.

doi:10.1038/nchembio.1607.

Ginalski, K, A Elofsson, D Fischer, and L Rychlewski. 2003. “3D-Jury: a Simple Approach

to Improve Protein Structure Predictions.” Bioinformatics 19 (8): 1015–18.

doi:10.1093/bioinformatics/btg124.

Grant, Barry J, Rodrigues, Ana P. C., Karim M ElSawy, J Andrew McCammon, and Leo S D

Caves. 2006. “Bio3d: an R Package for the Comparative Analysis of Protein Structures.”

Bioinformatics 22 (21). Oxford University Press: 2695–96.

doi:10.1093/bioinformatics/btl461.

Griffiths-Jones, Sam. 2005. “RALEE—RNA ALignment Editor in Emacs.” Bioinformatics

21 (2). Oxford University Press: 257–59. doi:10.1093/bioinformatics/bth489.

Hanson, Robert M, Jaime Prilusky, Zhou Renjian, Takanori Nakane, and Joel L Sussman.

2013. “JSmol and the Next‐Generation Web‐Based Representation of 3D Molecular

Structure as Applied to Proteopedia.” Israel Journal of Chemistry 53 (3‐4). WILEY‐VCH

Verlag: 207–16. doi:10.1002/ijch.201300024.

Hayes, Josie, Pier Paolo Peruzzi, and Sean Lawler. 2014. “MicroRNAs in Cancer:

Biomarkers, Functions and Therapy.” Trends in Molecular Medicine 20 (8): 460–69.

doi:10.1016/j.molmed.2014.06.005.

Hunt, Andrew, and David Thomas. 1999. The Pragmatic Programmer. Addison-Wesley

Professional.

Jimenez, Randi M, Julio A Polanco, and Andrej Lupták. 2015. “Chemistry and Biology of

Self-Cleaving Ribozymes.” Trends in Biochemical Sciences 40 (11): 648–61.

doi:10.1016/j.tibs.2015.09.001.

Jones, Christopher P, and Adrian R Ferré-D'Amaré. 2014. “Crystal Structure of a C-Di-AMP

Riboswitch Reveals an Internally Pseudo-Dimeric RNA..” The EMBO Journal 33 (22).

EMBO Press: 2692–2703. doi:10.15252/embj.201489209.

Jonikas, Magdalena A, Randall J Radmer, and Russ B Altman. 2009. “Knowledge-Based

Instantiation of Full Atomic Detail Into Coarse-Grain RNA 3D Structural Models.”

Bioinformatics 25 (24): 3259–66. doi:10.1093/bioinformatics/btp576.

Jossinet, Fabrice, and Eric Westhof. 2005. “Sequence to Structure (S2S): Display,

Manipulate and Interconnect RNA Data From Sequence to Structure..” Bioinformatics 21

(15): 3320–21. doi:10.1093/bioinformatics/bti504.

Jossinet, Fabrice, Thomas E Ludwig, and Eric Westhof. 2010. “Assemble: an Interactive

Graphical Tool to Analyze and Build RNA Architectures at the 2D and 3D Levels..”

Bioinformatics (Oxford, England) 26 (16): 2057–59. doi:10.1093/bioinformatics/btq321.

Kellenberger, Colleen A, Chen Chen, Aaron T Whiteley, Daniel A Portnoy, and Ming C

Hammond. 2015. “RNA-Based Fluorescent Biosensors for Live Cell Imaging of Second

Messenger Cyclic Di-AMP.” Journal of the American Chemical Society 137 (20).

American Chemical Society: 6432–35. doi:10.1021/jacs.5b00275.

Kerpedjiev, Peter, Christian Höner Zu Siederdissen, and Ivo L Hofacker. 2015. “Predicting

RNA 3D Structure Using a Coarse-Grain Helix-Centered Model..” RNA (New York, N.Y.)

21 (6). Cold Spring Harbor Lab: 1110–21. doi:10.1261/rna.047522.114.

Kieft, Jeffrey S, Kaihong Zhou, Angie Grech, Ronald Jubin, and Jennifer A Doudna. 2002.

“Crystal Structure of an RNA Tertiary Domain Essential to HCV IRES-Mediated

Translation Initiation.” Nature Structural Biology, April. doi:10.1038/nsb781.

Kim, Peter B, James W Nelson, and Ronald R Breaker. 2015. “An Ancient Riboswitch Class

in Bacteria Regulates Purine Biosynthesis and One-Carbon Metabolism..” Molecular

104

Cell 57 (2): 317–28. doi:10.1016/j.molcel.2015.01.001.

Kirmizialtin, Serdal, Scott P Hennelly, Alexander Schug, Jose N Onuchic, and Karissa Y

Sanbonmatsu. 2015. “Integrating Molecular Dynamics Simulations with Chemical

Probing Experiments Using SHAPE-FIT..” Methods in Enzymology 553. Elsevier: 215–

34. doi:10.1016/bs.mie.2014.10.061.

Kladwang, Wipapat, Christopher C VanLang, Pablo Cordero, and Rhiju Das. 2011. “A Two-

Dimensional Mutate-and-Map Strategy for Non-Coding RNA Structure..” Nature

Chemistry 3 (12): 954–62. doi:10.1038/nchem.1176.

Kladwang, Wipapat, Fang-Chieh Chou, and Rhiju Das. 2012. “Automated RNA Structure

Prediction Uncovers a Kink-Turn Linker in Double Glycine Riboswitches.” Journal of

the American Chemical Society 134 (3): 1404–7. doi:10.1021/ja2093508.

Klostermeier, D, and D P Millar. 2001. “Time-Resolved Fluorescence Resonance Energy

Transfer: a Versatile Tool for the Analysis of Nucleic Acids..” Biopolymers 61 (3). Wiley

Subscription Services, Inc., A Wiley Company: 159–79. doi:10.1002/bip.10146.

Knight, Rob, Peter Maxwell, Amanda Birmingham, Jason Carnes, J Gregory Caporaso, Brett

C Easton, Michael Eaton, et al. 2007. “PyCogent: a Toolkit for Making Sense From

Sequence.” Genome Biology 8 (8). BioMed Central: R171. doi:10.1186/gb-2007-8-8-

r171.

Kryshtafovych, Andriy, Bohdan Monastyrskyy, Krzysztof Fidelis, Torsten Schwede, and

Anna Tramontano. 2017. “Assessment of Model Accuracy Estimations in CASP12.”

Proteins: Structure, Function, and Bioinformatics 84 (Suppl 1): 349.

doi:10.1002/prot.25371.

Kulik, Marta, Anna M Goral, Maciej Jasiński, Paulina M Dominiak, and Joanna Trylska.

2015. “Electrostatic Interactions in Aminoglycoside-RNA Complexes..” Biophysical

Journal 108 (3): 655–65. doi:10.1016/j.bpj.2014.12.020.

Kurowski, Michal A, and Janusz M Bujnicki. 2003. “GeneSilico Protein Structure Prediction

Meta-Server..” Nucleic Acids Research 31 (13). Oxford University Press: 3305–7.

Lai, D, J R Proctor, JYA Zhu, and I M Meyer. 2012. “R-CHIE: a Web Server and R Package

for Visualizing RNA Secondary Structures.” Nucleic Acids Research.

Laing, Christian, and Tamar Schlick. 2010. “Computational Approaches to 3D Modeling of

RNA.” Journal of Physics: Condensed Matter 22 (28): 283101–19. doi:10.1088/0953-

8984/22/28/283101.

Lang, Kathrin, Renate Rieder, and Ronald Micura. 2007. “Ligand-Induced Folding of the

thiM TPP Riboswitch Investigated by a Structure-Based Fluorescence Spectroscopic

Approach..” Nucleic Acids Research 35 (16): 5370–78. doi:10.1093/nar/gkm580.

Laskowski, R A, M W MacArthur, D S Moss, and J M Thornton. 1993. “PROCHECK: a

Program to Check the Stereochemical Quality of Protein Structures.” Journal of Applied

Crystallography 26 (2). International Union of Crystallography: 283–91.

doi:10.1107/S0021889892009944.

Lavender, Christopher A, Feng Ding, Nikolay V Dokholyan, and Kevin M Weeks. 2010.

“Robust and Generic RNA Modeling Using Inferred Constraints: a Structure for the

Hepatitis C Virus IRES Pseudoknot Domain..” Biochemistry 49 (24): 4931–33.

doi:10.1021/bi100142y.

Leaver-Fay, Andrew, Michael Tyka, Steven M Lewis, Oliver F Lange, James Thompson,

Ron Jacak, Kristian Kaufman, et al. 2011. “ROSETTA3: an Object-Oriented Software

Suite for the Simulation and Design of Macromolecules..” Methods in Enzymology 487.

Elsevier: 545–74. doi:10.1016/B978-0-12-381270-4.00019-6.

Leontis, Neocles B, and Eric Westhof. 2001. “Geometric Nomenclature and Classification of

105

RNA Base Pairs.” Rna 7 (4). Cambridge University Press: 499–512.

Li, He, Si-Qing Ma, Jin Huang, Xiao-Ping Chen, and Hong-Hao Zhou. 2017. “Roles of Long

Noncoding RNAs in Colorectal Cancer Metastasis..” Oncotarget 8 (24). Impact Journals:

39859–76. doi:10.18632/oncotarget.16339.

Liu, Yijin, Timothy J Wilson, and David M J Lilley. 2017. “The Structure of a Nucleolytic

Ribozyme That Employs a Catalytic Metal Ion.” Nature Chemical Biology 13 (5). Nature

Research: 508–13. doi:10.1038/nchembio.2333.

Lu, H, and J Skolnick. 2001. “A Distance-Dependent Atomic Knowledge-Based Potential for

Improved Protein Structure Selection..” Proteins 44 (3): 223–32.

Lundström, Jesper, Leszek Rychlewski, Arne Elofsson, and Janusz M Bujnicki. 2008.

“Pcons: a Neural-Network-Based Consensus Predictor That Improves Fold Recognition.”

Protein Science 10 (11). Cold Spring Harbor Laboratory Press: 2354–62.

doi:10.1110/ps.08501.

Machnicka, Magdalena A, Kaja Milanowska, Okan Osman Oglou, Elżbieta Purta,

Malgorzata Kurkowska, Anna Olchowik, Witold Januszewski, et al. 2013.

“MODOMICS: a Database of RNA Modification Pathways--2013 Update..” Nucleic

Acids Research 41 (Database issue): D262–67. doi:10.1093/nar/gks1007.

Macke, Thomas J, and David A Case. 2009. “Modeling Unusual Nucleic Acid Structures.” In

Molecular Modeling of Nucleic Acids, 682:379–93. ACS Symposium Series.

Washington, DC: American Chemical Society. doi:10.1021/bk-1998-0682.ch024.

Magnus, Marcin, Dorota Matelska, Grzegorz Lach, Grzegorz Chojnowski, Michal J

Boniecki, Elżbieta Purta, Wayne Dawson, Stanislaw Dunin-Horkawicz, and Janusz M

Bujnicki. 2014. “Computational Modeling of RNA 3D Structures, with the Aid of

Experimental Restraints..” RNA Biology 11 (5): 522–36. doi:10.4161/rna.28826.

Magnus, Marcin, Marcin Pawlowski, and Janusz M Bujnicki. 2012. “MetaLocGramN: a

Meta-Predictor of Protein Subcellular Localization for Gram-Negative Bacteria.” BBA -

Proteins and Proteomics 1824 (12). Elsevier B.V.: 1425–33.

doi:10.1016/j.bbapap.2012.05.018.

Magnus, Marcin, Michał J Boniecki, Wayne Dawson, and Janusz M Bujnicki. 2016.

“SimRNAweb: a Web Server for RNA 3D Structure Modeling with Optional

Restraints..” Nucleic Acids Research 44 (W1): W315–19. doi:10.1093/nar/gkw279.

Martin, Robert C. 2008. Clean Code. Pearson Education.

Martinez, Hugo M, Jacob V Maizel, and Bruce A Shapiro. 2008. “RNA2D3D: a Program for

Generating, Viewing, and Comparing 3-Dimensional Models of RNA..” Journal of

Biomolecular Structure & Dynamics 25 (6): 669–83.

doi:10.1080/07391102.2008.10531240.

Massire, C, and E Westhof. 1998. “MANIP: an Interactive Tool for Modelling RNA..”

Journal of Molecular Graphics & Modelling 16 (4-6): 197–205–255–7.

Mathews, David H, Matthew D Disney, Jessica L Childs, Susan J Schroeder, Michael Zuker,

and Douglas H Turner. 2004. “Incorporating Chemical Modification Constraints Into a

Dynamic Programming Algorithm for Prediction of RNA Secondary Structure..”

Proceedings of the National Academy of Sciences 101 (19): 7287–92.

doi:10.1073/pnas.0401799101.

McCown, Phillip J, Keith A Corbino, Shira Stav, Madeline E Sherlock, and Ronald R

Breaker. 2017. “Riboswitch Diversity and Distribution..” Rna 23 (7): 995–1011.

doi:10.1261/rna.061234.117.

McGuffin, L J. 2008. “The ModFOLD Server for the Quality Assessment of Protein

Structural Models.” Bioinformatics 24 (4): 586–87. doi:10.1093/bioinformatics/btn014.

106

Merali, Zeeya. 2010. “Computational Science: ...Error..” Nature, October 14.

doi:10.1038/467775a.

Merino, Edward J, Kevin A Wilkinson, Jennifer L Coughlan, and Kevin M Weeks. 2005.

“RNA Structure Analysis at Single Nucleotide Resolution by Selective 2'-Hydroxyl

Acylation and Primer Extension (SHAPE)..” Journal of the American Chemical Society

127 (12): 4223–31. doi:10.1021/ja043822v.

Miao, Zhichao, Ryszard W Adamiak, Maciej Antczak, Robert T Batey, Alexander J Becka,

Marcin Biesiada, Michał J Boniecki, et al. 2017. “RNA-Puzzles Round III: 3D RNA

Structure Prediction of Five Riboswitches and One Ribozyme..” RNA (New York, N.Y.)

23 (5): 655–72. doi:10.1261/rna.060368.116.

Miao, Zhichao, Ryszard W Adamiak, Marc-Frédérick Blanchet, Michal Boniecki, Janusz M

Bujnicki, Shi-Jie Chen, Clarence Cheng, et al. 2015. “RNA-Puzzles Round II:

Assessment of RNA Structure Prediction Programs Applied to Three Large RNA

Structures..” RNA (New York, N.Y.) 21 (6). Cold Spring Harbor Lab: 1066–84.

doi:10.1261/rna.049502.114.

Michel, F, and E Westhof. 1990. “Modelling of the Three-Dimensional Architecture of

Group I Catalytic Introns Based on Comparative Sequence Analysis..” Journal of

Molecular Biology 216 (3): 585–610. doi:10.1016/0022-2836(90)90386-Z.

Mlynsky, Vojtech, and Giovanni Bussi. 2017. “Understanding in-Line Probing Experiments

by Modeling Cleavage of Non-Reactive RNA Nucleotides..” Rna 23 (5). Cold Spring

Harbor Lab: rna.060442.116–720. doi:10.1261/rna.060442.116.

Moretti, Rocco, Sergey Lyskov, Rhiju Das, Jens Meiler, and Jeffrey J Gray. 2017. “Web-

Accessible Molecular Modeling with Rosetta: the Rosetta Online Server That Includes

Everyone (ROSIE)..” Protein Science : a Publication of the Protein Society, September.

doi:10.1002/pro.3313.

Nahvi, Ali, and Rachel Green. 2013. “Structural Analysis of RNA Backbone Using in-Line

Probing..” Methods in Enzymology 530. Elsevier: 381–97. doi:10.1016/B978-0-12-

420037-1.00022-1.

Nawrocki, E P, S W Burge, A Bateman, J Daub, R Y Eberhardt, S R Eddy, E W Floden, et al.

2015. “Rfam 12.0: Updates to the RNA Families Database.” Nucleic Acids Research 43

(D1): D130–37. doi:10.1093/nar/gku1063.

Nawrocki, Eric P, Diana L Kolbe, and Sean R Eddy. 2009. “Infernal 1.0: Inference of RNA

Alignments..” Bioinformatics (Oxford, England) 25 (10): 1335–37.

doi:10.1093/bioinformatics/btp157.

Norambuena, T, J F Cares, E Capriotti, and F Melo. 2013. “WebRASP: a Server for

Computing Energy Scores to Assess the Accuracy and Stability of RNA 3D Structures.”

Bioinformatics (Oxford, England). doi:10.1093/bioinformatics/btt441.

Nussinov, Ruth, George Pieczenik, Jerrold R Griggs, and Daniel J Kleitman. 1978.

“Algorithms for Loop Matchings.” SIAM Journal on Applied Mathematics 35 (1): 68–82.

doi:10.1137/0135006.

Parisien, Marc, and François Major. 2008. “The MC-Fold and MC-Sym Pipeline Infers RNA

Structure From Sequence Data” 452 (7183): 51–55. doi:10.1038/nature06684.

Pawlowski, Marcin, Albert Bogdanowicz, and Janusz M Bujnicki. 2013. “QA-RecombineIt:

a Server for Quality Assessment and Recombination of Protein Models.” Nucleic Acids

Research 41 (W1). Oxford University Press: W389–97. doi:10.1093/nar/gkt408.

Pawlowski, Marcin, Michal J Gajda, Ryszard Matlak, and Janusz M Bujnicki. 2008.

“MetaMQAP: a Meta-Server for the Quality Assessment of Protein Models..” BMC

Bioinformatics 9 (1). BioMed Central: 403. doi:10.1186/1471-2105-9-403.

107

Pennisi, Elizabeth. 2013. “The CRISPR Craze..” Science (New York, N.Y.), August 23.

doi:10.1126/science.341.6148.833.

Pérez, F, and B E Granger. 2007. “IPython: a System for Interactive Scientific Computing.”

Computing in Science & Engineering 9 (3): 21–29.

doi:10.1109/MCSE.2007.53&orderBeanReset=true&volumeNum=9&issueNum=3","dis

playPublicationTitle“:”Computing.

Piatkowski, Pawel, Joanna M Kasprzak, Deepak Kumar, Marcin Magnus, Grzegorz

Chojnowski, and Janusz M Bujnicki. 2016. “RNA 3D Structure Modeling by

Combination of Template-Based Method ModeRNA, Template-Free Folding with

SimRNA, and Refinement with QRNAS..” Methods in Molecular Biology (Clifton, N.J.)

1490 (Suppl). New York, NY: Springer New York: 217–35. doi:10.1007/978-1-4939-

6433-8_14.

Popenda, M, M Szachniuk, M Antczak, K J Purzycka, P Lukasiak, N Bartol, J Blazewicz,

and R W Adamiak. 2012. “Automated 3D Structure Composition for Large RNAs.”

Nucleic Acids Research 40 (14): e112–12. doi:10.1093/nar/gks339.

Puton, Tomasz, Lukasz P Kozlowski, Kristian M Rother, and Janusz M Bujnicki. 2013.

“CompaRNA: a Server for Continuous Benchmarking of Automated Methods for RNA

Secondary Structure Prediction..” Nucleic Acids Research 41 (7): 4307–23.

doi:10.1093/nar/gkt101.

Qin, Peter Z, and Thorsten Dieckmann. 2004. “Application of NMR and EPR Methods to the

Study of RNA.” Current Opinion in Structural Biology 14 (3): 350–59.

doi:10.1016/j.sbi.2004.04.002.

Ren, Aiming, and Dinshaw J Patel. 2014. “C-Di-AMP Binds the ydaO Riboswitch in Two

Pseudo-Symmetry-Related Pockets..” Nature Chemical Biology 10 (9): 780–86.

doi:10.1038/nchembio.1606.

Ren, Aiming, Kanagalaghatta R Rajashankar, and Dinshaw J Patel. 2015. “Global RNA Fold

and Molecular Recognition for a Pfl Riboswitch Bound to ZMP, a Master Regulator of

One-Carbon Metabolism.” Structure 23 (8): 1375–81. doi:10.1016/j.str.2015.05.016.

Ren, Aiming, Nikola Vušurović, Jennifer Gebetsberger, Pu Gao, Michael Juen, Christoph

Kreutz, Ronald Micura, and Dinshaw J Patel. 2016. “Pistol Ribozyme Adopts a

Pseudoknot Fold Facilitating Site-Specific in-Line Cleavage..” Nature Chemical Biology

12 (9): 702–8. doi:10.1038/nchembio.2125.

Rivas, Elena, and Sean R Eddy. 1999. “A Dynamic Programming Algorithm for RNA

Structure Prediction Including Pseudoknots 1 1Edited by I. Tinoco.” Journal of

Molecular Biology 285 (5): 2053–68. doi:10.1006/jmbi.1998.2436.

Rother, Kristian. 2017. Pro Python Best Practices. Berkeley, CA: Apress. doi:10.1007/978-1-

4842-2241-6.

Rother, M, K Milanowska, T Puton, J Jeleniewicz, K Rother, and Janusz M Bujnicki. 2011.

“ModeRNA Server: an Online Tool for Modeling RNA 3D Structures.” Bioinformatics

27 (17): 2441–42. doi:10.1093/bioinformatics/btr400.

Rother, Magdalena, Kristian Rother, Tomasz Puton, and Janusz M Bujnicki. 2011.

“ModeRNA: a Tool for Comparative Modeling of RNA 3D Structure.” Nucleic Acids

Research 39 (10). Oxford University Press: 4007–22. doi:10.1093/nar/gkq1320.

Saini, Harpreet Kaur, and Daniel Fischer. 2005. “Meta-DP: Domain Prediction Meta-

Server..” Bioinformatics 21 (12): 2917–20. doi:10.1093/bioinformatics/bti445.

Sali, A, and T L Blundell. 1993. “Comparative Protein Modelling by Satisfaction of Spatial

Restraints..” Journal of Molecular Biology 234 (3): 779–815.

doi:10.1006/jmbi.1993.1626.

108

Sandve, Geir Kjetil, Anton Nekrutenko, James Taylor, and Eivind Hovig. 2013. “Ten Simple

Rules for Reproducible Computational Research..” Edited by Philip E Bourne. PLoS

Computational Biology 9 (10). Public Library of Science: e1003285.

doi:10.1371/journal.pcbi.1003285.

Seemann, Stefan E, Jan Gorodkin, and Rolf Backofen. 2008. “Unifying Evolutionary and

Thermodynamic Information for RNA Folding of Multiple Alignments..” Nucleic Acids

Research 36 (20): 6355–62. doi:10.1093/nar/gkn544.

Siebert, S, and R Backofen. 2005. “MARNA: Multiple Alignment and Consensus Structure

Prediction of RNAs Based on Sequence Structure Comparisons.” Bioinformatics 21 (16):

3352–59. doi:10.1093/bioinformatics/bti550.

Simons, Kim T, Ingo Ruczinski, Charles Kooperberg, Brian A Fox, Chris Bystroff, and

David Baker. 1999. “Improved Recognition of Native‐Like Protein Structures Using a

Combination of Sequence‐Dependent and Sequence‐Independent Features of Proteins.”

Proteins: Structure, Function, and Bioinformatics 34 (1). John Wiley & Sons, Inc.: 82–

95. doi:10.1002/(SICI)1097-0134(19990101)34:1<82::AID-PROT7>3.0.CO;2-A.

Strack, Rita L, Wenjiao Song, and Samie R Jaffrey. 2013. “Using Spinach-Based Sensors for

Fluorescence Imaging of Intracellular Metabolites and Proteins in Living Bacteria.”

Nature Protocols 9 (1): 146–55. doi:10.1038/nprot.2014.001.

Szabo, Linda, and Julia Salzman. 2016. “Detecting Circular RNAs: Bioinformatic and

Experimental Challenges.” Nature Reviews Genetics 17 (11): 679–92.

doi:10.1038/nrg.2016.114.

Trausch, J J, J G Marcano-Velázquez, and M M Matyjasik. 2015. “Metal Ion-Mediated

Nucleobase Recognition by the ZTP Riboswitch.” doi:10.1016/j.chembiol.2015.06.007.

Tuszyńska, Irina, and Janusz M Bujnicki. 2011. “DARS-RNP and QUASI-RNP: New

Statistical Potentials for Protein-RNA Docking..” BMC Bioinformatics 12 (1). BioMed

Central: 348. doi:10.1186/1471-2105-12-348.

Tuszyńska, Irina, Marcin Magnus, Katarzyna Jonak, Wayne Dawson, and Janusz M Bujnicki.

2015. “NPDock: a Web Server for Protein-Nucleic Acid Docking..” Nucleic Acids

Research 43 (W1): W425–30. doi:10.1093/nar/gkv493.

Van Der Spoel, David, Erik Lindahl, Berk Hess, Gerrit Groenhof, Alan E Mark, and Herman

J C Berendsen. 2005. “GROMACS: Fast, Flexible, and Free.” Journal of Computational

Chemistry 26 (16). Wiley Subscription Services, Inc., A Wiley Company: 1701–18.

doi:10.1002/jcc.20291.

Waleń, Tomasz, Grzegorz Chojnowski, Przemysław Gierski, and Janusz M Bujnicki. 2014.

“ClaRNA: a Classifier of Contacts in RNA 3D Structures Based on a Comparative

Analysis of Various Classification Schemes.” Nucleic Acids Research 42 (19). Oxford

University Press: e151–51. doi:10.1093/nar/gku765.

Wang, J, Y Zhao, C Zhu, and Y Xiao. 2015. “3dRNAscore: a Distance and Torsion Angle

Dependent Evaluation Function of 3D RNA Structures.” Nucleic Acids Research 43 (10):

e63–e63. doi:10.1093/nar/gkv141.

Wang, Jian, and Yi Xiao. 2002. Using 3dRNA for RNA 3-D Structure Prediction and

Evaluation. Vol. 17. Hoboken, NJ, USA: John Wiley & Sons, Inc. doi:10.1002/cpbi.21.

Washietl, Stefan, Ivo L Hofacker, Peter F Stadler, and Manolis Kellis. 2012. “RNA Folding

with Soft Constraints: Reconciliation of Probing Data and Thermodynamic Secondary

Structure Prediction..” Nucleic Acids Research 40 (10): 4261–72.

doi:10.1093/nar/gks009.

Waterhouse, A M, J B Procter, and DMA Martin. 2009. “Jalview Version 2—a Multiple

109

Sequence Alignment Editor and Analysis Workbench.” ….

Weinreb, Caleb, Adam J Riesselman, John B Ingraham, Torsten Gross, Chris Sander, and

Debora S Marks. 2016. “3D RNA and Functional Interactions From Evolutionary

Couplings.” Cell, October. Elsevier Inc., 1–14. doi:10.1016/j.cell.2016.03.030.

Westhof, Eric. 2010. “The Amazing World of Bacterial Structured RNAs..” Genome Biology

11 (3). BioMed Central: 108. doi:10.1186/gb-2010-11-3-108.

Wikimedia-Commons. 2017. “File:RNA_Chemical_Structure.GIF.”

Commons.Wikimedia.org. Accessed September 16.

https://commons.wikimedia.org/wiki/File:RNA_chemical_structure.GIF.

Xu, Shouping, Dejia Kong, Qianlin Chen, Yanyan Ping, and Da Pang. 2017. “Oncogenic

Long Noncoding RNA Landscape in Breast Cancer..” Molecular Cancer 16 (1). BioMed

Central: 129. doi:10.1186/s12943-017-0696-6.

Yang, Huanwang, Fabrice Jossinet, Neocles Leontis, Li Chen, John Westbrook, Helen

Berman, and Eric Westhof. 2003. “Tools for the Automatic Identification and

Classification of RNA Base Pairs..” Nucleic Acids Research 31 (13). Oxford University

Press: 3450–60.

Zuker, Michael, and Patrick Stiegler. 1981. “Optimal Computer Folding of Large RNA

Sequences Using Thermodynamics and Auxiliary Information.” Nucleic Acids Research

9 (1): 133–48. doi:10.1093/nar/9.1.133.

Zwieb, C, and F Müller. 1997. “Three-Dimensional Comparative Modeling of RNA..”

Nucleic Acids Symposium Series, no. 36: 69–71.

Ph.D. Thesis Development of computational tools for RNA ... Magnus.pdf · Ph.D. Thesis Development...

Documents

Transcript of Ph.D. Thesis Development of computational tools for RNA ... Magnus.pdf · Ph.D. Thesis Development...