Simplifying research code with modern python features
Cyril Matthey-Doret January 13, 2023 [programming] #python #genomicsMost scientific software is developed by researchers whose main area of expertise is not software engineering, and who may not have much time to dedicate to development. This often results in a tendency to ignore more advanced language features, such as failing to create custom types and instead over-rely on the language's built-in types.
While it is arguably better to avoid over-engineering to keep the code simple and accessible to a wider audience, using more advanced features can also improve readability and maintainability in larger codebases. In such situations, sticking to the built-ins can result in a code smell known as primitive obsession.
Fortunately, python provides several features to easily take advantage of custom types and interfaces with little overhead. In this post, I'll be showcasing a few modern(ish) python features that can help keep your code readable, and save you time in the long term.
I thank Fabian Egli for his feedback, which helped improve this post.
🪆 The problem
A few issues may arise when over-relying on built-ins to represent complex domain-specific models:
- Lack of custom interfaces can force more complex logic or repetitions
- Code becomes harder to read due to the absence of explicitly named fields and methods
- Maintainability suffers: changing a field may require replacing it at multiple places, as opposed to a single definition.
Let's take a look at an example that uses only built-ins organized into nested structures:
=
=
=
= <= <=
= <= <=
= ==
return True
return False
The code in this post was tested under python 3.10.4
We define a genome as a complex nested structure of built-in types (list, tuple, str, int). Then we write a function which will check whether an input read falls on a gene from a given genome. All intervals are 0-based and open.
Even with intuitive variable names (gene
and not g
) it takes a lot of mental effort to understand the nesting, and translate the numeric indices (e.g. gene[1][1]
) into their meaning 🪆. This can make the code harder to understand and maintain.
🥉 Type hints
TypeAlias was introduced by PEP-613 in python 3.10 (2020)
Despite being a dynamically typed language, python provides type hints to specify the types of variables. Type hints are not enforced at runtime, but they go a long way to help the user (or yourself) understand how to use your code and structures. They are also well integrated with common IDEs, which will display hints and warn you in cases of inconsistencies.
Tip: If you use the Pylance extension in VScode, static analysis does not use type checking by default. To enable it, go to
settings > extensions > Pylance
and setAnalysis: Type checking mode
to "strict" or "basic".
Type hints can be provided to specify the function signature. In the example above, this would be very hard to read due to the nesting:
...
Moreover, if we decided to make a change in one of those structures, we would need to edit the type hints in the signature of every function that uses them. This is where TypeAlias
and NewType
can be useful:
# Use NewType to denote the meaning of a type
=
=
# Use TypeAlias to define equivalents to complex signatures
: =
: =
: =
...
The function signature is much easier to read, and we know that the genome is represented as a list of chromosomes right away. In an IDE such as VScode or PyCharm, placing the mouse over Chrom
will show the alias definition:
(type alias) Chrom: Type[tuple[Name, Size, list[tuple[Name, tuple[str, int, int, Literal['+', '-']]]]]]
If we needed to change some data structures, we would only need to edit the hints in the alias definition.
Type hints are a great way of documenting code, and they can even be combined with tools like mypy or pyright to check for errors! In this example, however, the hints themselves are also hard to read because of the heavily nested structures, we need a better solution.
🥈 Named tuples
NamedTuple was introduced by PEP-526 in python 3.6 (2016)
Named tuples improve code readability by allowing name-based field access in your tuples. They also inherit regular tuple features:
=
=
# Allows standard tuple operations
, , =
# Adds name-based access semantics
-
typing.NamedTuple
is a typed variant of namedtuple
which can be subclassed to take advantage of type annotations mentioned earlier. The definitions are more verbose, but much easier to understand:
"""An optionally stranded, 0-based, open genomic interval."""
:
:
:
: = None
"""A gene, consisting of a named entity with genomic coordinates."""
:
:
"""A chromosome represented as a collection of genes"""
:
:
:
=
=
=
In this case, named tuple semantics also make it easier to figure out what's going on in the logic, but it still feels verbose and repetitive:
= <= <=
= <= <=
= ==
return True
return False
🥇 Data classes
dataclass was introduced by PEP-557 in python 3.7 (2017)
Dataclasses offer the flexibility of python classes with minimal administrative overhead.
First, we can define an interval dataclass. The simplest definition we can write looks almost identical to the previous NamedTuple
subclass:
"""An optionally stranded, 0-based, open genomic interval."""
:
:
:
: = None
Again, placing the mouse over any occurrence of Interval
should show the type hints, along with the docstring annotation:
`(class) Interval(chrom: str, start: int, end: int, sign: Literal['+', '-'] | None = None)`
`An optionally stranded genomic interval.`
Now, let's pretend that we want the ability to sort intervals (i.e. define interval1 > interval2
). With a traditional class, this would require writing dunder methods __lt__
and/or __gt__
to define the behaviours of <
and >
operators.
With dataclasses, most dunder methods (__init__
, __repr__
, __eq__
, ...) are automatically generated. Their behavior can be customized with arguments to the @dataclass
decorator and the field()
helper function.
For example, below, we use order=True
to say that instances of Interval
can be ordered. This will automatically define dunder methods __lt__
and __gt__
. We then use field(compare=True)
to specify which fields should be used in the comparison.
"""An optionally stranded, 0-based, open genomic interval."""
: =
: =
: =
: =
Dunder methods can also be manually implemented for specific operators. For the sake of this example, we are going to define the in
operator by defining __contains__
. This will be used to define whether another interval (at least partially) falls in the current interval.
For the sake of showing dataclasses, I overrid the
in
operator, but in practice, this can be confusing. A method likeInterval.overlaps(self, other)
would be a better choice.
It makes sense to move this logic into the Interval
definition, because we will likely need to reuse it in many different functions.
"""An optionally stranded, 0-based, open genomic interval."""
: =
: =
: =
: =
"""Checks if another interval overlaps with self."""
= <= <=
= <= <=
= ==
return and
Now we can sort intervals and check for overlap!
>>> =
>>> =
>>> in
True
>>>
We can reuse this Interval
to define Gene
and Chrom
. Since a gene is just a named interval here, we can subclass Interval
to inherit all its attributes and methods, and simply add an optional name
attribute (or any gene-specific behavior we may want):
Since a chromosome contains a set of genes (which are a subclass of interval), we can easily define a helper method to check if a given interval overlaps any gene in a chromosome. This is shown below in Chromosome.gene_overlaps()
.
"""A named genomic interval."""
: =
"""A chromosome, represented as a collection of named genomic intervals."""
:
:
:
return True
return False
With the main overlap logic taken out, this function is much easier to read than the original version.
=
=
=
=
return True
return False
dataclasses are more flexible than named tuples. However, they are not as fast and don't inherit the default tuple behaviors (hashable, iterable, unpackable). For a more detailed comparison, see this stackoverflow answer by Oleksandr Yarushevski's.
Here is the original code for comparison:
original:
=
=
=
= <= <=
= <= <=
= ==
return True
return False
In summary, the complicated logic has been moved into the structure definitions, and for the better: each function is more modular (easier to read), and we can reuse them anytime we need new functions, instead of rewriting similar logic.
Using these features introduces the overhead of defining data structures. In this example, named tuples would probably be a good middle ground. For larger projects and algorithms, these definitions and the associated documentation (docstrings and type hints) represent a small cost compared to the long term gain in efficiency and clarity.
🍒 pydantic
pydantic (1.10.4) compatible with python 3.7-3.11
This package is not part of python's standard library, but I really like it, and it is a very useful package if you need to serialize your classes.
It provides a BaseModel
class with serialization methods such as .dict()
, .json()
, as well as additional constrained types to validate instances.
In the example below, we define Interval
as a subclass of BaseModel
to access its neat features:
- Constrained types: note
start
andend
are of typepos
, which we define as a constrained non-negative integer type withconint
. - Instance validation: Until now, type hints were only informative. pydantic actually enforces type hints at instantiation time. If we attempt to create a gene with a negative coordinates, or add genes in a chromosome with a different
chrom
attribute than the chromosomename
, it will raise an exception with a clear message.
Here is the full example rewritten with pydantic:
# Genomic positions can not be negative
=
"""An optionally stranded, 0-based, open genomic interval."""
:
:
:
:
"""Checks if another interval overlaps with self."""
= <= <=
= <= <=
= ==
return or and
"""A named genomic interval."""
:
"""A chromosome, represented as a collection of named genomic intervals."""
:
:
: =
"""Checks that input genes are on the right chromosome."""
=
=
return
return
If we try to enter a gene with negative coordinates, we get a clear error:
>>> =
: 1
is or 0
Similarly, we cannot enter a gene from the chr1 into chr2:
>>> =
>>> =
: 1
is
Finally, pydantic allows serializing the instance to json. This is great for interoperability, as many tools are compatible with this format, and it is also simple to convert common genomic formats to or from json:
>>> =
>>> =
>>> =
>>>
Similarly, the parse_file method can instantiate the class directly from a json file!
>>>
The object can also be converted into a json schema using Chrom.schema_json(). This can be very useful for validating data in other tools.
json schema