This demo shows how the user will use the code.
Demo directory:
lab2/
└── demo
├── demo.ipynb # main demo demonstrating reading and writing
└── demo-extensions.ipynb # demo for extension features
As a minor enhancement to the previous lab design, we added Species
entity to represent a class Species that is associated with RNA_Molecule
. Instead of using attribute species
in RNA_Molecule
class as string type, it is now of Species
type. An RNA_Molecule
can have 1 Species
or none (e.g., if it is synthetic). A Species
can have many RNA_Molecule
instances.
For the purpose of this lab (reading/writing to a file), new classes have been introduced in yellow in this diagram:
RNA_IO
(User Interface for I/O Operations)
read(path, format, coarse_grained=False, atom_name=None)
→ RNA_Molecule
coarse_grained
: If True, extracts only a subset of atoms for a simplified representation.atom_name
: Allows specifying a particular atom type to extract.write(rna_molecule, file_path, format)
RNA_Molecule
instance to a file.parsers
and writers
for format-specific processing → can have many parsers and writers.RNA_Parser
(Abstract Class)
read()
, enforcing child classes to implement format-specific parsing.PDB_Parser
(Concrete Class)
read()
, processing PDB files to create an RNA_Molecule instance.RNA_Writer
(Abstract Class)
write()
, ensuring all writers implement format-specific writing.PDB_Writer
(Concrete Class)
write()
, converting an RNA_Molecule instance into a PDB file.Processor
(RNA Structure Representation Handler)
PDB_Parser
uses a Processor
to construct RNA_Molecule
.PDB_Writer
uses a Processor
to extract relevant data for writing.RNA_Molecule
can be associated with multiple Processor
instances.Processor
can belong to at most one parser
or one writer
(0..1 relationship).
rna_io
object instantiated by the user to read and write.pdb_parser
object created by rna_io
to parse PDB files.pdb_writer
object created by rna_io
to write PDB files.p1
object of class Processor
created by pdb_parser
to handle RNA structure representation and create an RNA_Molecule
object.p2
object of class Processor
created by pdb_writer
to extract atoms list from RNA_Molecule
for writing (by handling the RNA structure representation).The implementation of the classes is available in the src directory.
The classes are organized in modules and submodules as follows:
src/
├── Families
│ ├── __init__.py
│ ├── clan.py
│ ├── family.py
│ ├── species.py
│ └── tree.py
├── IO
│ ├── RNA_IO.py
│ ├── __init__.py
│ ├── parsers
│ │ ├── PDB_Parser.py
│ │ ├── RNA_Parser.py
│ │ └── __init__.py
│ └── writers
│ ├── PDB_Writer.py
│ ├── RNA_Writer.py
│ └── __init__.py
├── Structure
│ ├── Atom.py
│ ├── Chain.py
│ ├── Model.py
│ ├── RNA_Molecule.py
│ ├── Residue.py
│ └── __init__.py
├── processor.py
└── utils.py
Some enhancements on the library design that are worth of mention:
In lab1, we had a flat directory structure. In lab2, we have introduced a new directory structure to better organize the code into modules and submodules, as seen in Implementation Section. This structure helps in managing the codebase effectively and allows for better organization of related classes and functionalities.
The interdependencies between modules have been handled by appending the src
directory to the pythonpath and importing the modules using absolute imports. Another alternative during the development stage (not a deployable library yet) is tu sue the set-pythonpath.sh
script in dev/
directory.
Originally, if we have a 1-N relationship between two classes, e.g., one Family have many RNA Molecules, we would store a list of RNA Molecules in the Family class. This is a simple and straightforward approach. However, it has some drawbacks, as it makes us unable to tag each RNA Molecule with the Family it belongs to. To address this issue, we have added an attribute “family” to the RNA Molecule class, that the user has no interaction whatsoever with, but is rather set automatically through the code when the RNA Molecule is added to a Family.
[!IMPORTANT] To ensure this behavior we have either not provided a setter for this attribute or raised a warning message if the user tries to set it manually. A private method has been implemented that acts as a setter for this attribute in the other class (e.g., RNA_Molecule’s
_add_family()
method will be used in Family’s add_RNA() method through:fam1.add_RNA(rna1)
; behind the scenes:rna1._add_family(self)
).
This was done in all classes that have a 1-N relationship with another class.
In order to account for instanciated species, and because we dont wanna lose track of what we have created and the connection to each species with their types, we have implemenented a class attribute declared_species
that ensures:
This is particularly helpful with the 1-N relationship Species
have with RNA_Molecule
instances. 1 RNA-Molecule instance can be found in 1 Species, but 1 Species can have multiple RNA_Molecule instances. Thus we have added an attribute in Species that stores a dictionary of RNA_Molecule instances, with the key being the RNA_Molecule instance’s id (to enhance lookup speed using the dict’s inherent hashing compared to looping through a list). And for the relationship to be well implemented, we have added a species
attribute in the RNA_Molecule class that points to the Species instance it belongs to. The latter is only set by appending an RNA_Molecule instance to a Species instance, and not by setting it directly (automatic handling). In this way, the declared_species
class attribute ensures that one species instance can have multiple RNA_Molecule instances.
e.g., rna molecule 1 associated with E.coli species, rna molecule 2 associated with E.coli species, we would find only one E.coli species instance in the declared_species attribute, with the two RNA_Molecule instances in its rna molecules dictionary corresponding to rna1 and rna2
A species class have been implemented to account for them as entities that might have several RNA moelcules associated with them. To keep track of previously created species, we have implemented a class attribute declared_species
that ensures unicity of species instances and ability to access and point to equivalent species instances. This is particularly helpful with the 1-N relationship between Species and RNA molecule. 1 RNA molecule instance can be found in 1 Species, but 1 Species can have multiple RNA molecule instances, so an instanciated species would have a dictionary of RNA molecule instances, with the key being the RNA molecule instance’s id, and value of type RNA_Molecule
. This way, we can easily access the RNA molecule instances associated with a species instance ( a dict instead of a list to enhance lookup speed using the dict’s inherent hashing compared to looping through a list).
And for the relationship to be well implemented, we have added a species
attribute in the RNA_Molecule class that points to the Species instance it belongs to. The latter is only set by appending an RNA_Molecule instance to a Species instance, and not by setting it directly (automatic handling). In this way, the declared_species
class attribute ensures that one species instance can have multiple RNA_Molecule instances.
e.g., rna molecule 1 associated with E.coli species, rna molecule 2 associated with E.coli species, we would find only one E.coli species instance in the declared_species attribute, with the two RNA_Molecule instances in its rna molecules dictionary corresponding to rna1 and rna2
Since Species has a relationship with RNA_Molecule, and Family has a relationship with RNA_Molecule, an additional functionality that is now avaliable to teh user is the ability to see the distribution of species in a family. This is done by calling the distribution()
method in the Family class, which returns a data frame of species and the number of RNA_Molecule instances associated with them. For visualiztion, the user can call the plot_distribution()
method in the Family class, which plots a pie plot of the species distribution.
fam=Family(id='SAM',name='SAM')
fam.plot_distribution()
RNA_IO
ClassResponsible for managing the input and output operations of RNA molecule data. It provides methods for reading RNA molecule representations from files and writing them back to files in various formats.
This is the class that the user interacts with to read and write RNA molecule data.
Constructor:
The class is initialized with two private dictionaries:
__parsers
: Contains instances of parsers for different file formats. Currently, it includes the PDB format parser.__writers
: Contains instances of writers for different file formats. Currently, it includes the PDB format writer. def __init__(self):
self.__parsers = {"PDB": PDB_Parser()}
self.__writers = {"PDB": PDB_Writer()}
Methods:
read(path_to_file, format, coarse_grained=False, atom_name="C1'")
RNA_Molecule
object.path_to_file
: The path to the file to be read.format
: The format of the file being read (e.g., “PDB”).coarse_grained
: A boolean flag indicating whether to use a coarse-grained representation, defaults to False
.atom_name
: The name of the atom to be read, defaults to "C1'"
because:
RNA_Molecule
instance.ValueError
if the specified format is not supported. def read(self, path_to_file, format, coarse_grained=False, atom_name="C1'"):
if format not in self.__parsers:
raise ValueError(f"Format {format} is not supported.")
parser = self.__parsers[format]
return parser.read(path_to_file, coarse_grained, atom_name)
The method first checks if the specified format is supported by the RNA_IO instance. If the format is supported, it retrieves the corresponding parser from the __parsers
dictionary and calls its read
method to parse the file and return an RNA_Molecule
instance.
write(rna_molecule, path_to_file, format)
RNA_Molecule
object to a file of the specified format.rna_molecule
: The RNA molecule object to be written to the file.path_to_file
: The path where the file will be written.format
: The format of the file to be written (e.g., “PDB”).ValueError
if the specified format is not supported.write
method of the corresponding writer for the specified format to write the RNA molecule to the file.RNA_Parser
and RNA_Writer
ClassesThese are abstract classes that define the interface for parsers and writers, respectively. They enforce the implementation of the read
and write
methods in concrete subclasses.
class RNA_Parser(ABC):
@abstractmethod
def read(self, path_to_file):
pass
class RNA_Writer(ABC):
@abstractmethod
def write(self, rna_molecule, path_to_file):
pass
PDB_Parser
ClassConcrete subclass of RNA_Parser
that implements the read
method for parsing PDB files and creating an RNA_Molecule
instance.
read
method:
def read(self, path_to_file, coarse_grained=False, atom_name="C1'"):
"""
Reads a PDB file and returns the RNA molecule object.
"""
processor=Processor() #To handle the molecule representation in the processor class
#Extract RNA_Molecule Attributes and store them in the processor object
molecule_info = self._extract_molecule_info(path_to_file)
processor.molecule_info(*molecule_info)
#Extract the atoms and store them in the processor object
with open(path_to_file, 'r') as pdb_file:
model_id = 0
for line in pdb_file:
if line.startswith("MODEL"):
model_id = int(line.split()[1]) #Extract model ID
elif line.startswith("ATOM"):
if coarse_grained:
if line[12:16].strip() == atom_name:
atom_info = self._extract_atom_info(line)
if atom_info is not None:
processor.atom_info(*atom_info, model_id)
else:
atom_info = self._extract_atom_info(line)
if atom_info is not None: #It is None if the residue is not a nucleotide
processor.atom_info(*atom_info, model_id)
return processor.createMolecule() #Create the RNA_Molecule object
Processor
instance is created to handle the molecule representation._extract_molecule_info
is called to extract the relevant information about the RNA molecule from the PDB file.Processor
object.coarse_grained
is True
, only atoms with the specified atom_name
are extracted.Processor
object.Processor
object is used to create an RNA_Molecule
instance.RNA_Molecule
instance is returned._extract_molecule_info
private method:
None
.read
method to extract the molecule information before extracting the atoms.Processor
object for creating the RNA_Molecule
instance.RNA_Molecule
object.read
method._extract_atom_info
private method:
def _extract_atom_info(self, line):
residue_name = line[17:20].strip()
if residue_name not in ['A', 'C', 'G', 'U']:
return None #Not a nucleotide
residue_id = int(line[22:26].strip())
i_code = line[26:27].strip()
atom_name = line[12:16].strip()
altloc = line[16:17].strip()
x, y, z = map(float, [line[30:38], line[38:46], line[46:54]])
occupancy = float(line[54:60].strip())
temp_factor = float(line[60:66].strip()) if line[60:66].strip() else None
element = line[76:78].strip()
charge = line[78:80].strip()
chain_id = line[21]
return atom_name, x, y, z, element, residue_name, residue_id, chain_id, altloc, occupancy, temp_factor, i_code, charge
None
.read
method to extract atom information for creating the RNA_Molecule
instance.read
method.PDB_Writer
ClassConcrete subclass of RNA_Writer
that implements the write
method for writing RNA_Molecule
objects to PDB files.
write
method:
def write(self, rna_molecule, path_to_file):
"""
Writes the RNA molecule object to a PDB-like file.
Format:
<Record name> <Serial> <Atom name> <AltLoc> <Residue name> <ChainID> <Residue sequence number> <ICode> <X> <Y> <Z> <Occupancy> <TempFactor> <Element> <Charge>
"""
processor=Processor()
atoms = processor.flattenMolecule(rna_molecule) #Get a flat list of atoms
with open(path_to_file, "w") as f:
#Write molecule information
molecule_info = self._format_molecule_info(rna_molecule)
f.write(molecule_info)
#Write atom information
current_model = None
for model_id, *atom_info in atoms:
#Write MODEL record when a new model starts
#If the model ID is 0, it means that there is only one model and no MODEL record is needed
if model_id !=0 and model_id != current_model:
if current_model is not None:
f.write("ENDMDL\n") #Close previous model
f.write(f"MODEL {model_id}\n")
current_model = model_id
#Write the formatted atom line
pdb_line = self._format_atom_info(*atom_info)
f.write(pdb_line)
if model_id!=0:
f.write("ENDMDL\n") #Close the last model
f.write("END\n") #End of PDB file
print(f"RNA molecule written to {path_to_file}")
Processor
instance is created because it handles the molecule representation.Processor
object is used to flatten the RNA_Molecule
object into a list of atoms._format_molecule_info
is called to format the molecule information for writing to the PDB file._format_atom_info
for writing to the PDB file._format_molecule_info
private method:
RNA_Molecule
object.write
method to write the molecule information to the PDB file.write
method._format_atom_info
private method:
write
method to format the atom information for writing to the PDB file.write
method.Processor
ClassProcessor
class acts as an intermediary between the parsers/writers and the RNA_Molecule
class.molecule_info
and atom_info
.RNA_Molecule
instance: createMolecule()
method.RNA_Molecule
into a list of atoms for writing: flattenMolecule(rna_molecule)
.The decoupling of parsing/writing from the RNA structure representation was demonstrated in the design and the implementation.