2425-m1-geniomhe-group-6

Lab 2 Report

Table of contents

Demo test on python notebook

This demo shows how the user will use the code.

Demo directory:

lab2/
└── demo
   ├── demo.ipynb              # main demo demonstrating reading and writing
   └── demo-extensions.ipynb   # demo for extension features

Demo Demo2

viz

Class Diagram

class diagram

Class Diagram

As a minor enhancement to the previous lab design, we added Species entity to represent a class Species that is associated with RNA_Molecule. Instead of using attribute species in RNA_Molecule class as string type, it is now of Species type. An RNA_Molecule can have 1 Species or none (e.g., if it is synthetic). A Species can have many RNA_Molecule instances.

For the purpose of this lab (reading/writing to a file), new classes have been introduced in yellow in this diagram:

  1. RNA_IO (User Interface for I/O Operations)
    • Serves as the interface for reading and writing RNA sequence files.
    • Provides two methods:
      • read(path, format, coarse_grained=False, atom_name=None)RNA_Molecule
        • Parses a file and returns an RNA_Molecule instance.
        • Optional parameters:
          • coarse_grained: If True, extracts only a subset of atoms for a simplified representation.
          • atom_name: Allows specifying a particular atom type to extract.
      • write(rna_molecule, file_path, format)
        • Writes an RNA_Molecule instance to a file.
    • Handles multiple file formats by relying on specialized parsers and writers for format-specific processing → can have many parsers and writers.
  2. Parsing
    • RNA_Parser (Abstract Class)
      • Defines the abstract method read(), enforcing child classes to implement format-specific parsing.
    • PDB_Parser (Concrete Class)
      • Implements read(), processing PDB files to create an RNA_Molecule instance.
  3. Writing
    • RNA_Writer (Abstract Class)
      • Defines the abstract method write(), ensuring all writers implement format-specific writing.
    • PDB_Writer (Concrete Class)
      • Implements write(), converting an RNA_Molecule instance into a PDB file.
  4. Processor (RNA Structure Representation Handler)
    • An intermediary between parsers/writers and RNA_Molecule.
    • Converts parsed content into an RNA_Molecule instance.
    • Flattens an RNA_Molecule into a list of atoms for writing.
    • Associations:
      • PDB_Parser uses a Processor to construct RNA_Molecule.
      • PDB_Writer uses a Processor to extract relevant data for writing.
      • An RNA_Molecule can be associated with multiple Processor instances.
      • A Processor can belong to at most one parser or one writer (0..1 relationship).
  5. Design Choice
    • Decoupling:
      • RNA_IO provides a simple interface for users.
      • Parsers and Writers handle format-specific operations.
      • Processor ensures proper RNA representation.
    • Extensibility:
      • New formats (e.g., FASTA) can be supported by adding corresponding RNA_Parser and RNA_Writer subclasses.

Object Diagram

object diagram

Object Diagram


Implementation

The implementation of the classes is available in the src directory.

The classes are organized in modules and submodules as follows:

src/
├── Families
│   ├── __init__.py
│   ├── clan.py
│   ├── family.py
│   ├── species.py
│   └── tree.py
├── IO
│   ├── RNA_IO.py
│   ├── __init__.py
│   ├── parsers
│   │   ├── PDB_Parser.py
│   │   ├── RNA_Parser.py
│   │   └── __init__.py
│   └── writers
│       ├── PDB_Writer.py
│       ├── RNA_Writer.py
│       └── __init__.py
├── Structure
│   ├── Atom.py
│   ├── Chain.py
│   ├── Model.py
│   ├── RNA_Molecule.py
│   ├── Residue.py
│   └── __init__.py 
├── processor.py
└── utils.py

Extensions

Some enhancements on the library design that are worth of mention:

Directory Structure

In lab1, we had a flat directory structure. In lab2, we have introduced a new directory structure to better organize the code into modules and submodules, as seen in Implementation Section. This structure helps in managing the codebase effectively and allows for better organization of related classes and functionalities.

The interdependencies between modules have been handled by appending the src directory to the pythonpath and importing the modules using absolute imports. Another alternative during the development stage (not a deployable library yet) is tu sue the set-pythonpath.sh script in dev/ directory.

Handling 1-N Relationships

Originally, if we have a 1-N relationship between two classes, e.g., one Family have many RNA Molecules, we would store a list of RNA Molecules in the Family class. This is a simple and straightforward approach. However, it has some drawbacks, as it makes us unable to tag each RNA Molecule with the Family it belongs to. To address this issue, we have added an attribute “family” to the RNA Molecule class, that the user has no interaction whatsoever with, but is rather set automatically through the code when the RNA Molecule is added to a Family.

[!IMPORTANT] To ensure this behavior we have either not provided a setter for this attribute or raised a warning message if the user tries to set it manually. A private method has been implemented that acts as a setter for this attribute in the other class (e.g., RNA_Molecule’s _add_family() method will be used in Family’s add_RNA() method through: fam1.add_RNA(rna1); behind the scenes: rna1._add_family(self) ).

This was done in all classes that have a 1-N relationship with another class.

Species class

In order to account for instanciated species, and because we dont wanna lose track of what we have created and the connection to each species with their types, we have implemenented a class attribute declared_species that ensures:

This is particularly helpful with the 1-N relationship Species have with RNA_Molecule instances. 1 RNA-Molecule instance can be found in 1 Species, but 1 Species can have multiple RNA_Molecule instances. Thus we have added an attribute in Species that stores a dictionary of RNA_Molecule instances, with the key being the RNA_Molecule instance’s id (to enhance lookup speed using the dict’s inherent hashing compared to looping through a list). And for the relationship to be well implemented, we have added a species attribute in the RNA_Molecule class that points to the Species instance it belongs to. The latter is only set by appending an RNA_Molecule instance to a Species instance, and not by setting it directly (automatic handling). In this way, the declared_species class attribute ensures that one species instance can have multiple RNA_Molecule instances.
e.g., rna molecule 1 associated with E.coli species, rna molecule 2 associated with E.coli species, we would find only one E.coli species instance in the declared_species attribute, with the two RNA_Molecule instances in its rna molecules dictionary corresponding to rna1 and rna2

A species class have been implemented to account for them as entities that might have several RNA moelcules associated with them. To keep track of previously created species, we have implemented a class attribute declared_species that ensures unicity of species instances and ability to access and point to equivalent species instances. This is particularly helpful with the 1-N relationship between Species and RNA molecule. 1 RNA molecule instance can be found in 1 Species, but 1 Species can have multiple RNA molecule instances, so an instanciated species would have a dictionary of RNA molecule instances, with the key being the RNA molecule instance’s id, and value of type RNA_Molecule. This way, we can easily access the RNA molecule instances associated with a species instance ( a dict instead of a list to enhance lookup speed using the dict’s inherent hashing compared to looping through a list).

And for the relationship to be well implemented, we have added a species attribute in the RNA_Molecule class that points to the Species instance it belongs to. The latter is only set by appending an RNA_Molecule instance to a Species instance, and not by setting it directly (automatic handling). In this way, the declared_species class attribute ensures that one species instance can have multiple RNA_Molecule instances.

e.g., rna molecule 1 associated with E.coli species, rna molecule 2 associated with E.coli species, we would find only one E.coli species instance in the declared_species attribute, with the two RNA_Molecule instances in its rna molecules dictionary corresponding to rna1 and rna2

Since Species has a relationship with RNA_Molecule, and Family has a relationship with RNA_Molecule, an additional functionality that is now avaliable to teh user is the ability to see the distribution of species in a family. This is done by calling the distribution() method in the Family class, which returns a data frame of species and the number of RNA_Molecule instances associated with them. For visualiztion, the user can call the plot_distribution() method in the Family class, which plots a pie plot of the species distribution.

fam=Family(id='SAM',name='SAM')
fam.plot_distribution()

dist

Code Explanation

RNA_IO Class


RNA_Parser and RNA_Writer Classes

PDB_Parser Class


PDB_Writer Class


Processor Class


Decoupling Analysis

The decoupling of parsing/writing from the RNA structure representation was demonstrated in the design and the implementation.