According to this lab description, the first part is about RNA sequences and their spatial conformations. From the details and examples provided, we assumed that the RNA sequence that we want to model has a structure and the purpose is the manipulation of this structure.
root
relationship in the PhyloTree
class with TreeNode
, that will be represented as an attribute during implementation.[!NOTE] This class is a helper class for PhyloTree. While it’s not explicitly mentioned in the description, we decided to include it in the model to represent the nodes in the tree. It was a crucial addition in order to describe the class a tree data structure and allow for the implementation of the tree traversal methods.
The TreeNode
class serves to represent nodes in a phylogenetic tree within Pylotree, hence Phylotree
has attribute of type TreeNode
. It functions as the fundamental unit of the nodes list attribute, storing information in a graph-based data structure (by having a recurive link to other nodes, parent and children, in its attributes). Each node holds an RNA type as its data attribute and maintains a list of child nodes, this is shown through the self relationship in the class diagram, where a node can have multiple children nodes. Key attribute is branch_length
, which represents the distance to the parent node.
It is described that species refer to the organisms that contain RNA sequences belonging to a particular RNA family. So a family can be distributed across multiple species, and a species can contain multiple families. But we did not include it as a separate class but rather as an attribute of RNA_Molecule, and so we can obtain the distribution of the species for a particular family by looking at the species attribute of the RNA_Molecules that are part of that family. And in this way, the species description can also be used in the phylogenetic tree.
This object diagram portrays the example found in here. It shows the following objects:
object | class | description |
---|---|---|
a1 | Atom | An oxygen atom labeled “OP3” with specific coordinates (3rd oxygen of a phosphate molecule from a residue r1, found through the link) |
a2 | Atom | A phosphorus atom labeled “P” with specific coordinates. |
a3 | Atom | An oxygen atom labeled “OP1” with specific coordinates. |
a4 | Atom | An oxygen atom labeled “OP2” with specific coordinates. |
a25 | Atom | A phosphorus atom labeled “P” positioned further in the structure. |
a26 | Atom | An oxygen atom labeled “OP1” linked to atom a25. |
a48 | Atom | A phosphorus atom labeled “P” in another part of the structure. |
a49 | Atom | An oxygen atom labeled “OP1” linked to atom a48. |
r1 | Residue | A guanine residue positioned first in the sequence (has linked atoms a1, a2, a3…) |
r2 | Residue | A guanine residue positioned second in the sequence. |
r3 | Residue | A cytosine residue positioned third in the sequence. |
ch1 | Chain | A chain labeled “X” that links residues together. |
model | Model | The structural model representation, labeled with ID 0 (since x-ray structure, normally structure is one model, 0) |
rna1 | RNA_Molecule | An RNA molecule identified by entry “7EAF,” studied via X-ray diffraction. |
rna2 | RNA_Molecule | Another RNA molecule identified by entry “5KF6,” analyzed similarly. |
sam_fam | Family | a SAM riboswitch family named “SAM” which is the one 7eaf molecule belong to, thus the relationship |
sam1_4_fam | Family | SAM riboswitch family named “SAM-I/IV”, belonging to SAM clan |
sam4_fam | Family | a riboswitch family named “SAM-IV”, belonging to SAM clan |
sam_clan | Clan | clan of riboswitch families named “SAM”, linked to 3 families that belong to it |
tree1 | Phylotree | phylogenetic tree representation of related sequences for the family SAM (thus the link) |
root | TreeNode | The root node of the phylogenetic tree, attribute of tree1, thus the link with the PhyloTree object, it’s connected to all other through parent-child links |
n1 | TreeNode | An internal tree node, conceptually a common ancestor, with a specific branch length away from root (since its linked directly to root) |
n2 | TreeNode | tree node representing a RNA molecule of a particular species (leaf) |
n3 | TreeNode | Another tree node representing a RNA molecule of a particular species |
n4 | TreeNode | Another leaf representing a rna seq labeled with accession “92. CP018551.1” |
List of modules:
Atom.py
Residue.py
Chain.py
Model.py
RNA_Molecule.py
Family.py
Clan.py
PhyloTree.py
utils.py
this includes helper functions for database calls, data extraction, and file handling…AtomName
and Element
for validation.atom_name
and element
attributes are string values, but they were later converted to the corresponding enumeration values in the setter methods.@attribute.setter
is used to validate the input values for the attributes. For the coordinates it validates that the values are float, and for the atom_name and element it validates that the values are part of the enumeration. For example:
def element(self, element):
if not isinstance(element, str):
raise TypeError(f"element must be a string, got {type(element)}")
#Check if the string is a valid Element
if not element in Element.__members__:
raise ValueError(f"{element} is not a valid Element value")
self._element=Element.__members__[element]
It is used to validate that the input value is a string and that the string is part of the Element enumeration.
@property
decorator was used to define getter and setter methods for the attributes, allowing for encapsulation and validation while providing a simple interface for accessing and modifying the attributes.__repr__
is used to return the string representation of the object as: atom_name x y z element
.type
and position
, with enum implementation for NBase
for validation of the residue type.type
attribute is a string value, but it was later converted to the corresponding enumeration value in the setter method. It also includes ` atoms=None` as a default value for the atoms attribute, which is a list that will store the atoms that are part of the residue, it is initialized as an empty list if no atoms are provided.@property
and @attribute.setter
are used for the attributes, with setter used for validation of the input values.add_atom()
with validations that the input is an instance of Atom
and that it does not already exist and removing atoms remove_atom
from the residue are included, with a method to also get the list of atoms in the residue get_atoms()
.__repr__
is used to return the string representation of the object as: type position
.id
, with residues=None
as a default value for the residues attribute, which is a list that will store the residues that are part of the chain, it is initialized as an empty list if no residues are provided.add_residue()
with validations and removing residues remove_residue
from the chain are also included, with a method to get the list of residues in the chain get_residues()
.__repr__
is used to return the string representation of the object as: id
.id
, with chains=None
as a default value for the chains attribute, which is a list that will store the chains that are part of the model, it is initialized as an empty list if no chains are provided.add_chain()
with validations and removing chains remove_chain
from the model are also included, with a method to get the list of chains in the model get_chains()
.__repr__
is used to return the string representation of the object as: id
.entry_id
, experiment
, species
, with models=None
as a default value for the models attribute, which is a list that will store the models that are part of the RNA_Molecule, it is initialized as an empty list if no models are provided.add_model()
with validations and removing models remove_model
from the RNA_Molecule are also included, with a method to get the list of models in the RNA_Molecule get_models()
.__repr__
is used to return the string representation of the object as: entry_id experiment species
.@property
and @attribute.setter
are used for the attributes, with setter used for validation of the input values.print_all()
method is used to print all the models, chains, residues, and atoms that are part of the RNA_Molecule similar to a pdb file format, and saves the output to a file. The format is as follows: ATOM <atom_number> <atom_name> <residue_type> <chain_id> <residue_position> <x> <y> <z> <element>
Family
Class Overview:
The Family
class represents a family of RNA molecules, particularly those in the Rfam database. It ensures that each family is uniquely identified, maintains a list of RNA molecules as its members, and optionally includes a phylogenetic tree representation. The class prevents duplicate instances and provides structured methods for adding, removing, and retrieving RNA families.
Key Features
utils.py
, using rfam api)Phylotree
) to represent evolutionary relationships which can be retrieved from various data types: newick, dict and json.Attributes
Class Attributes
entries
: List of all created Family
objects to track and avoid duplicates.Instance Attributes
id
(str
): Unique identifier for the family.name
(str
): Name of the RNA family.type
(str
, optional): Type of RNA (e.g., rRNA, tRNA, miRNA).members
(list
): List of RNA_Molecule
objects representing thehelper methods (private):
dunders:
__init__(id, name, type=None, members=[], from_database=False)
: from_database is a flag to indicate if the object is created using the generator function from the database__del__(self)
__eq__(self, other)
__len__(self)
__getitem__(self, key)
__setattr__(self, name, value)
__str__(self)
__repr__(self)
Clan
Class Overview
The Clan
class represents a group of RNA families that share common ancestry or biological significance. It ensures unique identification of clans, prevents duplicates, and provides structured methods for managing RNA families (Family
objects).
Key Features
Family
objects).Class Attributes
entries
: List of all created Clan
objects to track and avoid duplicates.Instance Attributes
id
(str
): Unique identifier for the clan (immutable).name
(str
, optional): Name of the clan.members
(list
): List of Family
objects that belong to the clan.Class Methods
get_instances()
: Returns a list of all created Clan
objects.get_clan(id)
: Retrieves an existing clan by its ID.Instance Methods
add_family(family)
: Adds a Family
object to the clan.remove_family(family)
: Removes a Family
object from the clan.Private Methods
__validate_member(member)
: Ensures that only Family
objects are added to the clan.Magic Methods
__str__()
: Returns a formatted string representation of the clan and its families.__repr__()
: Returns a structured representation of the clan instance.__eq__(other)
: Checks equality based on clan ID.__setattr__(name, value)
: Ensures controlled attribute setting, preventing ID modification and enforcing type validation.This module tree.py
defines a TreeNode
class and a Phylotree
class for constructing and managing a phylogenetic tree.
In the object diagram we see a phylotree is diretly linked to one node, which is the root (will be portrayed as an attribute) and each node is recursively linked to other nodes, in its attributes, as seen in the following implementation:
The TreeNode
class represents a node in the tree with attributes:
name
: stores the RNA type.branch_length
: stores the distance to the parent node.parent
: stores the parent node.children
: stores child nodes as a dictionary.Methods in TreeNode
:
add_child(child, weight)
: adds a child node with a given branch length.preorder_traversal(level=0)
: performs a preorder traversal and returns a string representation.__repr__
and __str__
: provide readable representations of the node.__getitem__
: retrieves a child node by name.The Phylotree
class represents a phylogenetic tree for RNA sequences, constructed using computational phylogenetics. It consists of:
root
node.Methods in Phylotree
:
build_tree(tree_dict, parent=None)
: builds a tree from a dictionary.from_dict(tree_dict, parent=None)
: constructs a tree from a dictionary.from_json(json_str)
: constructs a tree from a JSON string or file.from_newick(newick_str)
: parses a Newick-formatted string or file to build the tree.__str__
and __repr__
: return a string representation of the tree.The module demonstrates tree construction using different input formats:
Example usage:
tree_dict = {
"children": [
{"name": "a", "branch_length": 0.05592},
{"name": "b", "branch_length": 0.08277},
{
"children": [
{"name": "c", "branch_length": 0.11049},
{"name": "d", "branch_length": 0.31409}
],
"branch_length": 0.340
}
],
"branch_length": 0.03601
}
tree = Phylotree.from_dict(tree_dict)
print(tree)
or
newick_str = '''
(87.4_AE017263.1/29965-30028_Mesoplasma_florum_L1[265311].1:0.05592,
_URS000080DE91_2151/1-68_Mesoplasma_florum[2151].1:0.08277,
(90_AE017263.1/668937-668875_Mesoplasma_florum_L1[265311].2:0.11049,
81.3_AE017263.1/31976-32038_Mesoplasma_florum_L1[265311].3:0.31409)
0.340:0.03601);
'''
tree=Phylotree.from_newick(newick_str) #success
or from files:
tree=Phylotree.from_newick('lab1/examples/RF00162.nhx') #newick
tree=Phylotree.from_json('lab1/examples/RF00162.json') #json
fetch_pdb_file(pdb_entry_id, save_directory=CACHE_DIR)
function written in utils.py
is used to fetch the pdb file from the RCSB PDB database using the pdb entry id. It saves the file in the specified directory. It uses the Biopython
library to fetch the file.create_RNA_Molecule(pdb_entry_id)
function written in utils.py
is used to create an RNA_Molecule object from the pdb file accessed through the pdb_entry using the first function. It reads the pdb file, extracts the necessary information first about the experiment
and species
to create the specific RNA_molecule
object, and then creates the corresponding objects (models, chains, residues, atoms) while adding them in the hierarchical order to the RNA_Molecule object. It returns the RNA_Molecule object.[!IMPORTANT] The
utils.py
file contains helper functions for database calls, data extraction, and file handling. It includes interaction with the Rfam database API to retrieve information about RNA families and phylogenetic trees, automate etxraction, and manipulate various file formats likenewick
trees. In this module, if any files have to be downloaded as intermediary steps, they are saved in a CACHE directory defaulted to a hidden directory.rnalib_cache/
in teh working dir to avoid repeated downloads.
get_rfam(q:str)
function is used to get the information about an RNA family from the Rfam database using the family ID. It returns the information in JSON format.get_family_attributes(q:str)
function is used to get the name, identity, and curation type of an RNA family from the Rfam database using the family ID. It returns the information as a tuple.get_pdb_ids_from_fam(fam_id)
function is used to get the PDB IDs associated with an RNA family from the Rfam database using the family ID. It returns the PDB IDs as a list (helpful to automate the extraction of all RNA molecules found on Rfam from the PDB database).get_tree_newick_from_fam(fam_id)
function is used to get the Newick tree from the Rfam database given the RNA family ID. It returns the Newick tree as a string.parse_newick(newick)
function is used to parse a Newick string into a nested dictionary. It is used to parse the Newick tree obtained from the Rfam database, using regex to extract the tree structure.Kindly find an example of the implementation of the classes in this notebook
Atom
class:
atom = Atom("C1'", 1.0, 2.0, 3.0, "C")
print(atom) #output: C1' 1.0 2.0 3.0 C
Residue
class:
```python
r = Residue(“A”, 1)
print(r) #output: A 1
atom1 = Atom(“C1’”, 1.0, 2.0, 3.0, “C”)
atom2 = Atom(“N9”, 4.0, 5.0, 6.0, “N”)
r.add_atom(atom1)
r.add_atom(atom2)
print(r.get_atoms()) #output: [C1’ 1.0 2.0 3.0 C, N9 4.0 5.0 6.0 N]
r.remove_atom(atom1)
print(r.get_atoms()) #output: [N9 4.0 5.0 6.0 N]atom3 = Atom(“C4”, 7.0, 8.0, 9.0, “C”) r2 = Residue(“G”, 2) r2.add_atom(atom3) print(r2.get_atoms()) #output: [C4 7.0 8.0 9.0 C]
- in the `Chain` class:
```python
c = Chain("A")
print(c) #output: A
r = Residue("A", 1)
c.add_residue(r)
print(c.get_residues()) #output: [A 1]
c.remove_residue(r)
print(c.get_residues()) #output: []
Model
class:
m = Model(1)
print(m) #output: 1
c = Chain("A")
m.add_chain(c)
print(m.get_chains()) #output: [A]
m.remove_chain(c)
print(m.get_chains()) #output: []
RNA_Molecule
class:
rna1 = RNA_Molecule("1A9N", "NMR", "Homo sapiens")
print(rna1) #Output 1A9N NMR Homo sapiens
m1 = Model(1)
m2 = Model(2)
m3 = Model(3)
rna1.add_model(m1)
rna1.add_model(m2)
rna1.add_model(m3)
print(rna1.get_models()) #Output [Model 1, Model 2, Model 3]
rna1.remove_model(m3)
print(rna1.get_models()) #Output [Model 1, Model 2]
rna1.print_all()
#Output 1A9N NMR Homo sapiens
#Model 1
#Model 2
Example.py
. #Creating an RNA molecule
rna_molecule = RNA_Molecule("1JAT", "X-RAY DIFFRACTION", "Homo sapiens")
#Creating a model
model1 = Model(1)
#Creating a chain
ch1 = Chain('A')
#Adding the model to the RNA molecule
rna_molecule.add_model(model1)
#Adding the chain to the model
model1.add_chain(ch1)
#Creating Residues
res1=Residue("A", 1)
res2=Residue("U", 2)
res3=Residue("C",3)
#Adding Residues
ch1.add_residue(res1)
ch1.add_residue(res2)
ch1.add_residue(res3)
#Creating Atoms
a1=Atom("OP1", 0.1, 0.2, 0.3, "O")
a2=Atom("P", 0.4, -0.5, 0.6, "P")
a3=Atom("N1", 0.25, 0.54, 0.23, "N")
a4=Atom("C4", 0.21, 0.76, -0.93, "C")
#Adding Atoms
res1.add_atom(a1)
res2.add_atom(a2)
res3.add_atom(a4)
res3.add_atom(a3)
print(rna_molecule.get_models())
print(rna_molecule.get_models()[0].get_chains())
print(rna_molecule.get_models()[0].get_chains()[0].get_residues())
print(rna_molecule.get_models()[0].get_chains()[0].get_residues()[0].get_atoms())
print_all()
method
rna_molecule.print_all()
provides as output the following format, saved in a file:
1JAT X-RAY DIFFRACTION Homo sapiens
Model 1
ATOM 1 OP1 A A 1 0.1 0.2 0.3 O
ATOM 2 P U A 2 0.4 -0.5 0.6 P
ATOM 3 C4 C A 3 0.21 0.76 -0.93 C
ATOM 4 N1 C A 3 0.25 0.54 0.23 N
In this example we proceeded with the RNA molecule 7EAF
, and retrieved its RNA family which is teh SAM
family, by just providing SAM as query and all info is automatically stored within the family object up until its own tree of type PhyloTree (see more in this notebook)
In this example, with only information regarding a family, we retrieved all family attributes from database and have a fully functional Family object with name, id, type and tree of type PhyloTree. We also automated the extraction of all RNA molecules found for the family from the PDB database, by first retriving all pdb ids through rfam api, then fetching them through PDB biopython’s api and creating the RNA_Molecule objects which we added as members of the family. Thus a fully declared family with all attributes, more in here