/ Projects2018.txt


8 file(s)


hackseq18 project list

{hackseq18’s bio-hackers supreme: Team beRi}

You can also watch the video-stream of the final presentations.

  1. hackseq18 project list
    1. Project 1: REUSE
    2. Project 2: Genome Assembler Components
    3. Project 3: TaraCyc
    4. Project 4: beRi: Virtual Environments for R
    5. Project 5: Alignment-free Pathogen Genomics
    6. Project 6: Simulating Transcriptome Structural Variants
    7. Project 7: Blockchain for Infectious Disease
    8. Project 8: Anatomy of Morbidity

Project 1: REUSE

Team Lead: Sam Chorlton (sam.chorlton [AT] pm.me)

Help change the world by filtering unneeded sequences from a next-generation sequencing dataset, enriching signal from noise and enabling rapid pathogen discovery, isolation of sequence types (eg. rRNA), contaminant removal and more.


Rapid Elimination of Useless SEquences

REUSE is a k-mer based tool for filtration of reads in sequencing datasets that match a reference sequence. reuse build takes a reference fasta file as input, and ouputs a hashed index file. reuse filter takes FASTA/FASTQ file inputs, along with the hashed index file, and outputs k-mer filtered reads in a user-specified format. Common applications of REUSE include filtration of host, contamination, PhiX or ribosomal sequences.

Project 2: Genome Assembler Components

Team Leads: Shaun Jackman, Rayan Chikhi and Paul Medvedev

Our hackseq project aims to create a better assembly tool by mixing and matching components of various assembly tools. We will pit these Frankenstein assembly pipelines against each other in a fantastic assembler bake off!


Genome sequencing yields many short reads of DNA from a genome. Genome assembly attempts to reconstruct the original genome from these reads. Most genome assembler software tools are pipelines of many stages. These tools all work towards the same goal using different methods, and new tools allow for interoperability between all the different stages of different assemblers.

Problem: It is not common to mix and match stages from all the different assembly tools into one and understand the benefits or the disadvantages of doing that, specifically towards understanding the improvement of the assembly quality. Therefore our Hackseq project aims to create a better assembly tool by mixing and matching components of various assembly tools, and visualising the results of this work. The assembly tools that we are considering are displayed here.

Solution: We have attempted to solve this problem by creating modular assembler components for AssemblerFlow, a tool which builds pipelines of tools for genome assembly using Nextflow, and we have created components for a number of assembly tools.

Project 3: TaraCyc

Team Lead: Arjun Baghela (ECOSCOPE)

This project will aim to construct a website to represent thousands of ocean virus genomes. The goal of the website is to allow world citizens to play, explore and interact with the genome dataset to gain understanding into the microscopic life living in Earth’s oceans.


TaraCyc contains the first ever large-scale mapping of viral metabolomics data collected from ocean environments around the world. The Tara Oceans Project travelled worldwide collecting water samples for shotgun metagenomic sequencing. This genomic data has been mapped to metabolomic pathways. Recently, viruses have been shown to encode proteins that interact with host metabolic pathways in ways not previously explored, such as a set of cyanophage that modified photosynthetic activity in infected cyanobacteria in order to promote viral replication.

The TaraCyc website contains all of this analysed metaviriome data with an interactive, user-friendly interface. The website can be accessed at: Tara Viral Voyager.

Project 4: beRi: Virtual Environments for R

Team Lead: Rob Gilmore (robgilmore127 [AT] gmail.com)

Are you a bioinformatician frustrated by R’s inability to resolve common dependency issues? Help develop beRi, a package management, reproducible workflow, and installation toolkit for the R programming language.


The R programming language is a popular choice for statisticians, bioinformaticians, and researchers performing data analysis work. However, it lacks management tools that ensure reproducibility and integration with command line interfaces (CLI).

Problem: R lacks proper management tools such as a project manager, a virtual environment manager, or a package manager. Many available tools are R packages, making it laborious to manage R from the command line. While Rstudio has developed packrat, a project management tool, it has performance issues including lengthy build times. packrat also works poorly with Bioconductor, one of the most widely used R package repositories outside of CRAN. Building an R CLI is also a daunting task even for advanced R users.

Solution: We propose the beRi suite of tools for managing R. beRi is a set of Python packages which is composed of renv - an R virtual environment manager, rinse - an R installation manager and utility tool for installing packages, and rut - a manager for native R configuration files and local CRAN-like repositories. Since these packages will be developed as standalone or fully independent CLI’s, beRi will integrate these three packages and complement their functionalities in a separate higher function interface.

Project 5: Alignment-free Pathogen Genomics

Team Lead: Alex Sweeten (alex.sweeten [AT] gmail.com)

Interested in developing a novel bioinformatics pipeline? Given the unprecedented size of genomic information, new methods are required to organize and manage this data. The goal of this project will be to develop and test an alignment-free method, and apply it towards datasets of pathogenic organisms.


Determining similarity between bacterial isolates is an important requirement for epidemiological analysis. Alignment-based genomic methods are commonly used to tackle this problem. However, in cases of low sequence homology, horizontal gene transfer, or lack of a priori information, as is common when dealing with pathogenic bacteria, alignment-based methods pose significant problems. The normalized compression distance (NCD) is a parameter and alignment-free distance metric, which has shown recent success in genomics, specifically in classifying viral sequences. Here, I propose to develop a pipeline for allowing users to apply NCD to their genomic data, as well as for users to visualize their results in a presentable and easy to read format. This application will be built using a Python framework and developed into a Conda package and/or Galaxy workflow (depending on time required & participant experience). Furthermore, I propose to test/benchmark our pipeline using datasets of pathogenic bacteria and viral sequences. This project will be a novel community-based application, potentially facilitating further research into alignment-free similarity methods.

Project 6: Simulating Transcriptome Structural Variants

Team Leads: Cara Reisle (caralynreisle [AT] gmail.com) & Morgan Bye (morgan [AT] morganbye.com)

While there are tools to simulate structural variants in genomic data, currently no such applications exist for transcriptomes. Simulated data is required to properly evaluate structural variant callers.


In multicellular organisms, nearly every cell contains the same genome and thus the same genes. However, different cells show different gene expression underlying the different functions and tissues. Recent work in the cancer field, has shown that gene expression is more significant to disease outcomes than the genetic mutation 1.

Structural variants of genetic information are primary drivers of many diseases 2, making them a prime target for bioinformatic tools, but new tools require test datasets. Datasets are frequently large, ambiguous and subject to restrictive sharing permissions. However, sufficient numbers of publicly available genomes has allowed for the establishment of simulation tools of SVs.

Transcriptome sequencing is a much newer field, with a very low number of publicly available datasets — particularly in rare variants, significant to disease 3. Here we present a tool which injects SVs into a variety of different expression profiles to allow for the training and testing of bioinformatic tools. Simulating the transcriptome reads allows for a known ground truth so that we accurately quantify performance of tools not only in terms of sensitivity but also specificity.

Project 7: Blockchain for Infectious Disease

Team Lead: Veena Ghorakavi (vghorakavi [AT] gmail.com)

Infectious diseases are widespread and are difficult to track. Blockchain has the capabilities to track infectious diseases without revealing information about the people impacted and with location data.


Understanding how to track and prevent the spread of infectious diseases is a task that needs to both protect the identity of the individual impacted and understand the prognosis of the spread. As a result, combining the tracking of infectious diseases with blockchain technology is an important albeit unexplored ideal. The idea is to allow people to mark themselves as infected and for data scientists and medical professionals to use this information.

Project 8: Anatomy of Morbidity

Team Lead: Noushin Nabavi (nabavinoushin [AT] gmail.com)

We are going to work together in order to analyze an open Government dataset released by Canadian Vital Statistics. We will make hypotheses about the causes of death across Canada over time and test our hypotheses through generating informative plots. This is a great opportunity for those who are new to computational biology to learn some new skills and meet other scientists in the field.


Since 2016, the world has been implementing United Nation’s Agenda for Sustainable Development, by addressing urgent global challenges over the next 15 years. This mandate is undertaken differently in various countries. In Canada, data on deaths and causes of death are collected by the Canadian Vital Statistics and released as a ‘Death Database’. This database is an administrative survey that collects demographic and medical (cause of death) information annually from all provincial vital statistics registries on all deaths in Canada. The problem with the existing open datasets, however, are that they are too large and the categories are not well defined nor useful for non-data experts. Our hypothesis-driven solution enables interactive visualizations so that the information within the databases are more readily accessible, user-friendly, and interactive. We employ the free and open-source integrated development environment for R as part of our workflow to categorize diseases based on gender and age in time. This data is then summarized in the form of static and dynamic plots as well as interactive maps.