Welcome to
RNAr
python library for structural RNA!
The goal of this series of labs is to build a library that allows easy manipulation and study of RNA sequences.
This library can be installed directly from github through pip:
pip install git+https://github.com/rna-oop/2425-m1-geniomhe-group-6.git
or simply clone the repository and install it locally:
git clone https://github.com/rna-oop/2425-m1geniomhe-group-6
cd 2425-m1geniomhe-group-6
pip install .
To make sure it’s installed correctly, you can run the following command in your terminal:
python -c "import RNAr; print(RNAr.__version__)"
Documentation of the various functions and classes can be found in on github.io
description | report | contents
description | report | contents
description | report | contents
description | report | contents
The library is designed to manipulate and study RNA sequences. It provides functionalities for:
Each color represents a different module, and opacity variations indicate different submodules.
The classes are organized in modules and submodules as follows:
.
├── Families
│ ├── __init__.py
│ ├── clan.py
│ ├── family.py
│ ├── species.py
│ └── tree.py
├── IO
│ ├── RNA_IO.py
│ ├── __init__.py
│ ├── parsers
│ │ ├── PDB_Parser.py
│ │ ├── RNA_Parser.py
│ │ └── __init__.py
│ └── visitor_writers
│ ├── __init__.py
│ ├── pdb_visitor.py
│ ├── visitor.py
│ └── xml_visitor.py
├── Processing
│ ├── ArrayBuilder.py
│ ├── Builder.py
│ ├── Director.py
│ ├── ObjectBuilder.py
│ └── __init__.py
├── Structure
│ ├── Atom.py
│ ├── Chain.py
│ ├── Model.py
│ ├── RNA_Molecule.py
│ ├── Residue.py
│ ├── Structure.py
│ └── __init__.py
├── Transformations
│ ├── Pipeline.py
│ ├── __init__.py
│ └── transformers
│ ├── BaseTransformer.py
│ ├── Distogram.py
│ ├── Kmers.py
│ ├── Normalize.py
│ ├── OneHotEncoding.py
│ ├── SecondaryStructure.py
│ ├── TertiaryStructure.py
│ ├── Transformer.py
│ └── __init__.py
├── utils.py
└── viz.py
The Structure
module is responsible for representing the RNA molecule and its components. It contains the hierarchical structure of the RNA molecule, including models, chains, residues, and atoms. The classes in this module are designed to work together to provide a comprehensive representation of the RNA structure.
Structure class:
An interface for all the classes in the Structure module. It enforces the implementation of the accept
method, which is part of the Visitor design pattern that we will discuss later.
Common Implementation:
getters
and setters
, ensuring data integrity and type validation.__repr__
method provides a clear textual representation of objects.dictionaries
, enabling efficient access and manipulation.reference to their parent
, ensuring bidirectional navigation of the structure.add
, remove
, and get
children while maintaining structural consistency.initialized with existing children
, allowing flexible structure creation.Atom class:
Residue:
Chain:
Model:
RNA_Molecule:
The Families
module represents the evolutionary and comparitive relationships between RNA sequences. It’s a module composed of several modules itself, where each contain a class of the same name.
Family:
Rfam database
.rfam api
to access information while creating an object found in the database.Clan:
PhyloTree:
tree
submodule, on top of a helper TreeNode class and can be created from a Newick string, with available api functionality to access the rfam’s tree of a particular family. Several trees can belong to teh saem familty depnding on the source/algorithm usd, thus family has its tree attribute as a dict whose values are PhyloTree objects.TreeNode:
tree
submodule, representing a node in the phylogenetic tree. Supports node traversal algorithms, with necessary dunder methods for string representation,indexibality and equality checks.Species:
The IO
module is responsible for reading and writing RNA structures from and to various file formats.
RNA_IO class:
PDB
format for reading and PDB, XML
or PDBML formats for writing.read
method reads a file of a specific format and returns either a numpy array or an RNA molecule object, depending on the array
parameter.write
method writes an RNA molecule object to a file of a specific format.RNA_Parser class:
read
method.PDB_Parser class:
read(path_to_file, coarse_grained=False, atom_name="C1'", array=True)
_extract_molecule_info()
._extract_atom_info()
.This module is home to the Visitor design pattern, part of teh IO subpackage due to its involvement in writing and exporting files from the RNA_Molecule object.
The Visitor pattern is used to export an RNA molecule object into different file formats:
PDB
PDBML/XML
(more about the format in lab3 writing section)Visitor interface:
visit_Atom()
, visit_Residue()
, visit_Chain()
, visit_Model()
, and visit_RNA_Molecule()
methods to format data.export(rna: RNA_Molecule)
method calls the visit methods to write the file.Structure interface:
Structure
module by implementing the it.accept(visitor)
, implemented by Atom
, Residue
, Chain
, Model
, and RNA_Molecule
.visit_*()
method in the visitor.PDBExportVisitor class:
Visitor
interface.PDB
format.export(rna)
writes the PDB
file using visit methods.XMLExportVisitor class:
Visitor
interface.PDBML/XML
format.export(rna)
writes the XML
file using visit methods.This design separates export functionality from the RNA_Molecule
class, ensuring modularity and flexibility.
Advantages of the Visitor Pattern:
RNA_Molecule
class handle both data representation and output formatting, the visitor encapsulates format-specific logic, keeping RNA_Molecule
focused on molecular structure representation.RNA_IO
class manages the writing process directly.Disadvantages of the Visitor Pattern:
RNA_Molecule
, similar to the visitor pattern, but in a more direct manner. By flattening the molecule object into a list of atoms and formatting it for output, it kept the molecule representation decoupled from the writing process. The visitor pattern, in contrast, integrates traversal and formatting, making the design more structured but also more intricate.The Processing
module is responsible for building RNA molecules and arrays from PDB files. It uses the Builder design pattern to create complex objects step by step.
The Builder pattern is used to construct different representations of an RNA molecule: 1- Object-Oriented Representation (ObjectBuilder) 2- NumPy Array Representation (ArrayBuilder)
Director class:
Director
class serves as a director for the Builder
classes.__builder
: The builder object that will be used to build the object. Initialized to None
.add_atom_info(model_id, *atom_info)
for the builder classes to follow.Builder class:
molecule
(property) → Returns the final structure.reset()
→ Resets the builder.add_model()
, add_chain()
, add_residue()
, add_atom()
→ Methods for constructing the hierarchy.ObjectBuilder class:
__molecule
→ Stores the RNA molecule being built.__model_id
, __chain_id
, __residue_id
→ Track the current model, chain, and residue IDs.add_molecule_info(entry_id, experiment, species)
→ Stores general metadata (entry ID, experiment type, and species).ArrayBuilder class:
__array
→ Stores atom coordinates for each residue.__sequence
→ Stores residue names for sequence representation.__model_id
, __residue_id
→ Track the current model and residue IDs.__prev_atom
→ Tracks the last atom name and occupancy to handle alternate locations.molecule
(property) → Converts stored data into two numpy arrays:
(models, max_residues)
array storing residue names.(models, max_residues, max_atoms, 3)
array storing atom coordinates.Disadvantages of the Builder Pattern:
Advantages of the Builder Pattern:
The Transformations
module is responsible for applying various transformations to RNA sequences and coordinates. It uses the Chain of Responsibility design pattern to handle a series of transformations in a flexible and extensible manner.
The Chain of Responsibility pattern allows multiple handlers to process a request without the sender needing to know which handler will ultimately handle it. In this module, each transformation is represented as a handler in the chain. Each transformer transforms the data and passes it to the next transformer in the chain.
Pipeline class:
Pipeline
class is the main entry point for applying transformations to RNA sequences and coordinates.Normalize
transformer (if present).transform
method starts the transformation process from the first transformer in the chain, passing the input data (X, Y) through each transformer in sequence.__repr__
method provides a string representation of the pipeline, including the names and parameters of each transformer in the chain.Transformer class:
Transformer
class is an interface for all transformers.set_next
method to set the next transformer in the chain and the transform
method to perform the transformation.BaseTransformer class:
BaseTransformer
class is an abstract base class for all transformers.set_next
method to set the next transformer in the chain.transform
method as an abstract method, which must be implemented by concrete transformer classes.Concrete Transformers:
Order Constraints:
Normalize
transformer (if present).Kmers
transformer must be before the OneHotEncoding
transformer.SecondaryStructure
transformer must be before the TertiaryMotifs
transformer.Kmers
cannot be before the SecondaryStructure
transformer.For more details on each transformer, please refer to lab4/README.md.
Disadvantages of the Chain of Responsibility Pattern:
Advantages of the Chain of Responsibility Pattern:
For a better understanding of the data, analysis, explanation and how it can be used, we have added a viz
module at the root of the library containing functions to plot different representation of RNA, from the object 3D representation to raw and processed array representation.
We have used more than 5 plotting libraries, including matplotlib
, plotly
, networkx
, graphviz
and pyvis
mainly and generated different spatial, interactive and network plots.
The latter image creates new horizon to view where this library is heading. Machine learning and deep learning are the new trends in bioinformatics, particulalry in structural prediction in the last couple of years, and this library is a step towards that direction. RNA structure prediction is still an open challenge today, many efforts are being made to present different types of features or labels to this model. The transformations in this library provide a good starting point to train models on different kinds of features and be assessed by the way the model is able to come near one of the RNA structural representation.
This project was developed as part of the course OOP2 at Université Paris-Saclay, M1 GENIOMHE 2024/25.
This project is licensed under the MIT License. See the LICENSE file for details. For errors, suggestions, or contributions, please open an issue.