Introducing BioCantor, a Python Library to Navigate Data Coordinate Systems

We’re excited to release a new tool designed by Inscripta’s scientists to help bioinformaticians and genomic scientists work more easily with DNA, RNA, and protein data. It demonstrates the work of our stellar bioinformatics team and our relentless quest to make genomic research accessible to more scientists.

The Inscripta team faced a common challenge in the genomics community: navigating multiple coordinate systems to match up genomic sequence data with transcripts and proteins. There is no widely available free solution for this. So we built a fix, and in keeping with our commitment to publicly releasing tools that enable the research community, we have now shared it through a preprint with anyone who needs it.

BioCantor is a Python library designed to make it easier to navigate a diverse set of coordinate systems: for example, when converting between genomic, transcript, and protein coordinates. The problem, as the authors Pam Russell and Ian Fiddes summarize, is that “no publicly available software library offers fully featured interoperable support for multiple coordinate systems.”

Now, BioCantor can fill that role for scientists in need of integrated library support and rich operations on genomic features, available for a variety of file formats. With this tool, programmers no longer need to keep track of many different coordinate systems, which often require frequent calculations to match. “BioCantor enables elegant genomic workflows to be expressed in custom Python code through full end-to-end support of rich feature operations,” the authors write.

With BioCantor, data structures can be represented in JSON format, and the library includes parsers for GenBank and GFF3(+FASTA) files, Russell and Fiddes note. “BioCantor data models can also be exported to GFF3, GenBank, BED, and NCBI TBL format.” In short, this tool is flexible enough for use with any situation where genomic feature arithmetic is required.