
According to this lab description, the first part is about RNA sequences and their spatial conformations. From the details and examples provided, we assumed that the RNA sequence that we want to model has a structure and the purpose is the manipulation of this structure.
root relationship in the PhyloTree class with TreeNode, that will be represented as an attribute during implementation.[!NOTE] This class is a helper class for PhyloTree. While it’s not explicitly mentioned in the description, we decided to include it in the model to represent the nodes in the tree. It was a crucial addition in order to describe the class a tree data structure and allow for the implementation of the tree traversal methods.
The TreeNode class serves to represent nodes in a phylogenetic tree within Pylotree, hence Phylotree has attribute of type TreeNode. It functions as the fundamental unit of the nodes list attribute, storing information in a graph-based data structure (by having a recurive link to other nodes, parent and children, in its attributes). Each node holds an RNA type as its data attribute and maintains a list of child nodes, this is shown through the self relationship in the class diagram, where a node can have multiple children nodes. Key attribute is branch_length, which represents the distance to the parent node.
It is described that species refer to the organisms that contain RNA sequences belonging to a particular RNA family. So a family can be distributed across multiple species, and a species can contain multiple families. But we did not include it as a separate class but rather as an attribute of RNA_Molecule, and so we can obtain the distribution of the species for a particular family by looking at the species attribute of the RNA_Molecules that are part of that family. And in this way, the species description can also be used in the phylogenetic tree.

This object diagram portrays the example found in here. It shows the following objects:
| object | class | description |
|---|---|---|
| a1 | Atom | An oxygen atom labeled “OP3” with specific coordinates (3rd oxygen of a phosphate molecule from a residue r1, found through the link) |
| a2 | Atom | A phosphorus atom labeled “P” with specific coordinates. |
| a3 | Atom | An oxygen atom labeled “OP1” with specific coordinates. |
| a4 | Atom | An oxygen atom labeled “OP2” with specific coordinates. |
| a25 | Atom | A phosphorus atom labeled “P” positioned further in the structure. |
| a26 | Atom | An oxygen atom labeled “OP1” linked to atom a25. |
| a48 | Atom | A phosphorus atom labeled “P” in another part of the structure. |
| a49 | Atom | An oxygen atom labeled “OP1” linked to atom a48. |
| r1 | Residue | A guanine residue positioned first in the sequence (has linked atoms a1, a2, a3…) |
| r2 | Residue | A guanine residue positioned second in the sequence. |
| r3 | Residue | A cytosine residue positioned third in the sequence. |
| ch1 | Chain | A chain labeled “X” that links residues together. |
| model | Model | The structural model representation, labeled with ID 0 (since x-ray structure, normally structure is one model, 0) |
| rna1 | RNA_Molecule | An RNA molecule identified by entry “7EAF,” studied via X-ray diffraction. |
| rna2 | RNA_Molecule | Another RNA molecule identified by entry “5KF6,” analyzed similarly. |
| sam_fam | Family | a SAM riboswitch family named “SAM” which is the one 7eaf molecule belong to, thus the relationship |
| sam1_4_fam | Family | SAM riboswitch family named “SAM-I/IV”, belonging to SAM clan |
| sam4_fam | Family | a riboswitch family named “SAM-IV”, belonging to SAM clan |
| sam_clan | Clan | clan of riboswitch families named “SAM”, linked to 3 families that belong to it |
| tree1 | Phylotree | phylogenetic tree representation of related sequences for the family SAM (thus the link) |
| root | TreeNode | The root node of the phylogenetic tree, attribute of tree1, thus the link with the PhyloTree object, it’s connected to all other through parent-child links |
| n1 | TreeNode | An internal tree node, conceptually a common ancestor, with a specific branch length away from root (since its linked directly to root) |
| n2 | TreeNode | tree node representing a RNA molecule of a particular species (leaf) |
| n3 | TreeNode | Another tree node representing a RNA molecule of a particular species |
| n4 | TreeNode | Another leaf representing a rna seq labeled with accession “92. CP018551.1” |
List of modules:
Atom.pyResidue.pyChain.pyModel.pyRNA_Molecule.pyFamily.pyClan.pyPhyloTree.pyutils.py this includes helper functions for database calls, data extraction, and file handling…AtomName and Element for validation.atom_name and element attributes are string values, but they were later converted to the corresponding enumeration values in the setter methods.@attribute.setter is used to validate the input values for the attributes. For the coordinates it validates that the values are float, and for the atom_name and element it validates that the values are part of the enumeration. For example:
def element(self, element):
if not isinstance(element, str):
raise TypeError(f"element must be a string, got {type(element)}")
#Check if the string is a valid Element
if not element in Element.__members__:
raise ValueError(f"{element} is not a valid Element value")
self._element=Element.__members__[element]
It is used to validate that the input value is a string and that the string is part of the Element enumeration.
@property decorator was used to define getter and setter methods for the attributes, allowing for encapsulation and validation while providing a simple interface for accessing and modifying the attributes.__repr__ is used to return the string representation of the object as: atom_name x y z element.type and position, with enum implementation for NBase for validation of the residue type.type attribute is a string value, but it was later converted to the corresponding enumeration value in the setter method. It also includes ` atoms=None` as a default value for the atoms attribute, which is a list that will store the atoms that are part of the residue, it is initialized as an empty list if no atoms are provided.@property and @attribute.setter are used for the attributes, with setter used for validation of the input values.add_atom() with validations that the input is an instance of Atom and that it does not already exist and removing atoms remove_atom from the residue are included, with a method to also get the list of atoms in the residue get_atoms().__repr__ is used to return the string representation of the object as: type position.id, with residues=None as a default value for the residues attribute, which is a list that will store the residues that are part of the chain, it is initialized as an empty list if no residues are provided.add_residue() with validations and removing residues remove_residue from the chain are also included, with a method to get the list of residues in the chain get_residues().__repr__ is used to return the string representation of the object as: id.id, with chains=None as a default value for the chains attribute, which is a list that will store the chains that are part of the model, it is initialized as an empty list if no chains are provided.add_chain() with validations and removing chains remove_chain from the model are also included, with a method to get the list of chains in the model get_chains().__repr__ is used to return the string representation of the object as: id.entry_id, experiment, species, with models=None as a default value for the models attribute, which is a list that will store the models that are part of the RNA_Molecule, it is initialized as an empty list if no models are provided.add_model() with validations and removing models remove_model from the RNA_Molecule are also included, with a method to get the list of models in the RNA_Molecule get_models().__repr__ is used to return the string representation of the object as: entry_id experiment species.@property and @attribute.setter are used for the attributes, with setter used for validation of the input values.print_all() method is used to print all the models, chains, residues, and atoms that are part of the RNA_Molecule similar to a pdb file format, and saves the output to a file. The format is as follows: ATOM <atom_number> <atom_name> <residue_type> <chain_id> <residue_position> <x> <y> <z> <element>
FamilyClass Overview:
The Family class represents a family of RNA molecules, particularly those in the Rfam database. It ensures that each family is uniquely identified, maintains a list of RNA molecules as its members, and optionally includes a phylogenetic tree representation. The class prevents duplicate instances and provides structured methods for adding, removing, and retrieving RNA families.
Key Features
utils.py, using rfam api)Phylotree) to represent evolutionary relationships which can be retrieved from various data types: newick, dict and json.Attributes
Class Attributes
entries: List of all created Family objects to track and avoid duplicates.Instance Attributes
id (str): Unique identifier for the family.name (str): Name of the RNA family.type (str, optional): Type of RNA (e.g., rRNA, tRNA, miRNA).members (list): List of RNA_Molecule objects representing thehelper methods (private):
dunders:
__init__(id, name, type=None, members=[], from_database=False): from_database is a flag to indicate if the object is created using the generator function from the database__del__(self)__eq__(self, other)__len__(self)__getitem__(self, key)__setattr__(self, name, value)__str__(self)__repr__(self)Clan Class Overview
The Clan class represents a group of RNA families that share common ancestry or biological significance. It ensures unique identification of clans, prevents duplicates, and provides structured methods for managing RNA families (Family objects).
Key Features
Family objects).Class Attributes
entries: List of all created Clan objects to track and avoid duplicates.Instance Attributes
id (str): Unique identifier for the clan (immutable).name (str, optional): Name of the clan.members (list): List of Family objects that belong to the clan.Class Methods
get_instances(): Returns a list of all created Clan objects.get_clan(id): Retrieves an existing clan by its ID.Instance Methods
add_family(family): Adds a Family object to the clan.remove_family(family): Removes a Family object from the clan.Private Methods
__validate_member(member): Ensures that only Family objects are added to the clan.Magic Methods
__str__(): Returns a formatted string representation of the clan and its families.__repr__(): Returns a structured representation of the clan instance.__eq__(other): Checks equality based on clan ID.__setattr__(name, value): Ensures controlled attribute setting, preventing ID modification and enforcing type validation.This module tree.py defines a TreeNode class and a Phylotree class for constructing and managing a phylogenetic tree.
In the object diagram we see a phylotree is diretly linked to one node, which is the root (will be portrayed as an attribute) and each node is recursively linked to other nodes, in its attributes, as seen in the following implementation:
The TreeNode class represents a node in the tree with attributes:
name: stores the RNA type.branch_length: stores the distance to the parent node.parent: stores the parent node.children: stores child nodes as a dictionary.Methods in TreeNode:
add_child(child, weight): adds a child node with a given branch length.preorder_traversal(level=0): performs a preorder traversal and returns a string representation.__repr__ and __str__: provide readable representations of the node.__getitem__: retrieves a child node by name.The Phylotree class represents a phylogenetic tree for RNA sequences, constructed using computational phylogenetics. It consists of:
root node.Methods in Phylotree:
build_tree(tree_dict, parent=None): builds a tree from a dictionary.from_dict(tree_dict, parent=None): constructs a tree from a dictionary.from_json(json_str): constructs a tree from a JSON string or file.from_newick(newick_str): parses a Newick-formatted string or file to build the tree.__str__ and __repr__: return a string representation of the tree.The module demonstrates tree construction using different input formats:
Example usage:
tree_dict = {
"children": [
{"name": "a", "branch_length": 0.05592},
{"name": "b", "branch_length": 0.08277},
{
"children": [
{"name": "c", "branch_length": 0.11049},
{"name": "d", "branch_length": 0.31409}
],
"branch_length": 0.340
}
],
"branch_length": 0.03601
}
tree = Phylotree.from_dict(tree_dict)
print(tree)
or
newick_str = '''
(87.4_AE017263.1/29965-30028_Mesoplasma_florum_L1[265311].1:0.05592,
_URS000080DE91_2151/1-68_Mesoplasma_florum[2151].1:0.08277,
(90_AE017263.1/668937-668875_Mesoplasma_florum_L1[265311].2:0.11049,
81.3_AE017263.1/31976-32038_Mesoplasma_florum_L1[265311].3:0.31409)
0.340:0.03601);
'''
tree=Phylotree.from_newick(newick_str) #success
or from files:
tree=Phylotree.from_newick('lab1/examples/RF00162.nhx') #newick
tree=Phylotree.from_json('lab1/examples/RF00162.json') #json
fetch_pdb_file(pdb_entry_id, save_directory=CACHE_DIR) function written in utils.py is used to fetch the pdb file from the RCSB PDB database using the pdb entry id. It saves the file in the specified directory. It uses the Biopython library to fetch the file.create_RNA_Molecule(pdb_entry_id) function written in utils.py is used to create an RNA_Molecule object from the pdb file accessed through the pdb_entry using the first function. It reads the pdb file, extracts the necessary information first about the experiment and species to create the specific RNA_molecule object, and then creates the corresponding objects (models, chains, residues, atoms) while adding them in the hierarchical order to the RNA_Molecule object. It returns the RNA_Molecule object.[!IMPORTANT] The
utils.pyfile contains helper functions for database calls, data extraction, and file handling. It includes interaction with the Rfam database API to retrieve information about RNA families and phylogenetic trees, automate etxraction, and manipulate various file formats likenewicktrees. In this module, if any files have to be downloaded as intermediary steps, they are saved in a CACHE directory defaulted to a hidden directory.rnalib_cache/in teh working dir to avoid repeated downloads.
get_rfam(q:str) function is used to get the information about an RNA family from the Rfam database using the family ID. It returns the information in JSON format.get_family_attributes(q:str) function is used to get the name, identity, and curation type of an RNA family from the Rfam database using the family ID. It returns the information as a tuple.get_pdb_ids_from_fam(fam_id) function is used to get the PDB IDs associated with an RNA family from the Rfam database using the family ID. It returns the PDB IDs as a list (helpful to automate the extraction of all RNA molecules found on Rfam from the PDB database).get_tree_newick_from_fam(fam_id) function is used to get the Newick tree from the Rfam database given the RNA family ID. It returns the Newick tree as a string.parse_newick(newick) function is used to parse a Newick string into a nested dictionary. It is used to parse the Newick tree obtained from the Rfam database, using regex to extract the tree structure.Kindly find an example of the implementation of the classes in this notebook
Atom class:
atom = Atom("C1'", 1.0, 2.0, 3.0, "C")
print(atom) #output: C1' 1.0 2.0 3.0 C
Residue class:
```python
r = Residue(“A”, 1)
print(r) #output: A 1
atom1 = Atom(“C1’”, 1.0, 2.0, 3.0, “C”)
atom2 = Atom(“N9”, 4.0, 5.0, 6.0, “N”)
r.add_atom(atom1)
r.add_atom(atom2)
print(r.get_atoms()) #output: [C1’ 1.0 2.0 3.0 C, N9 4.0 5.0 6.0 N]
r.remove_atom(atom1)
print(r.get_atoms()) #output: [N9 4.0 5.0 6.0 N]atom3 = Atom(“C4”, 7.0, 8.0, 9.0, “C”) r2 = Residue(“G”, 2) r2.add_atom(atom3) print(r2.get_atoms()) #output: [C4 7.0 8.0 9.0 C]
- in the `Chain` class:
```python
c = Chain("A")
print(c) #output: A
r = Residue("A", 1)
c.add_residue(r)
print(c.get_residues()) #output: [A 1]
c.remove_residue(r)
print(c.get_residues()) #output: []
Model class:
m = Model(1)
print(m) #output: 1
c = Chain("A")
m.add_chain(c)
print(m.get_chains()) #output: [A]
m.remove_chain(c)
print(m.get_chains()) #output: []
RNA_Molecule class:
rna1 = RNA_Molecule("1A9N", "NMR", "Homo sapiens")
print(rna1) #Output 1A9N NMR Homo sapiens
m1 = Model(1)
m2 = Model(2)
m3 = Model(3)
rna1.add_model(m1)
rna1.add_model(m2)
rna1.add_model(m3)
print(rna1.get_models()) #Output [Model 1, Model 2, Model 3]
rna1.remove_model(m3)
print(rna1.get_models()) #Output [Model 1, Model 2]
rna1.print_all()
#Output 1A9N NMR Homo sapiens
#Model 1
#Model 2
Example.py. #Creating an RNA molecule
rna_molecule = RNA_Molecule("1JAT", "X-RAY DIFFRACTION", "Homo sapiens")
#Creating a model
model1 = Model(1)
#Creating a chain
ch1 = Chain('A')
#Adding the model to the RNA molecule
rna_molecule.add_model(model1)
#Adding the chain to the model
model1.add_chain(ch1)
#Creating Residues
res1=Residue("A", 1)
res2=Residue("U", 2)
res3=Residue("C",3)
#Adding Residues
ch1.add_residue(res1)
ch1.add_residue(res2)
ch1.add_residue(res3)
#Creating Atoms
a1=Atom("OP1", 0.1, 0.2, 0.3, "O")
a2=Atom("P", 0.4, -0.5, 0.6, "P")
a3=Atom("N1", 0.25, 0.54, 0.23, "N")
a4=Atom("C4", 0.21, 0.76, -0.93, "C")
#Adding Atoms
res1.add_atom(a1)
res2.add_atom(a2)
res3.add_atom(a4)
res3.add_atom(a3)
print(rna_molecule.get_models())
print(rna_molecule.get_models()[0].get_chains())
print(rna_molecule.get_models()[0].get_chains()[0].get_residues())
print(rna_molecule.get_models()[0].get_chains()[0].get_residues()[0].get_atoms())
print_all() method
rna_molecule.print_all()
provides as output the following format, saved in a file:
1JAT X-RAY DIFFRACTION Homo sapiens
Model 1
ATOM 1 OP1 A A 1 0.1 0.2 0.3 O
ATOM 2 P U A 2 0.4 -0.5 0.6 P
ATOM 3 C4 C A 3 0.21 0.76 -0.93 C
ATOM 4 N1 C A 3 0.25 0.54 0.23 N
In this example we proceeded with the RNA molecule 7EAF, and retrieved its RNA family which is teh SAM family, by just providing SAM as query and all info is automatically stored within the family object up until its own tree of type PhyloTree (see more in this notebook)
In this example, with only information regarding a family, we retrieved all family attributes from database and have a fully functional Family object with name, id, type and tree of type PhyloTree. We also automated the extraction of all RNA molecules found for the family from the PDB database, by first retriving all pdb ids through rfam api, then fetching them through PDB biopython’s api and creating the RNA_Molecule objects which we added as members of the family. Thus a fully declared family with all attributes, more in here