hackseq18 project list
- hackseq18 project list
* : “I can work with this langauge, if I have access to Stack Overflow …“
* * : “I know this language and use it regularily.”
* * * : “I am one with the language.”
Team Lead: Sam Chorlton (sam.chorlton [AT] pm.me)
Help change the world by filtering unneeded sequences from a next-generation sequencing dataset, enriching signal from noise and enabling rapid pathogen discovery, isolation of sequence types (eg. rRNA), contaminant removal and more.
- C/C++: * *
- (or) Python: * *
- Next-Gen Sequencing: *
- Fun: * * *
- Evaluate Software
- Writing and/or science
Filtering unwanted sequences from nucleic acid sequencing data is an important step in many analyses. It has been used to remove technical artefacts (eg. PhiX), discover known and novel pathogens, isolate nucleic acid types (eg rRNA), and remove noise in metagenomic studies. This step significantly improves the speed and quality of subsequent analyses.
Here I propose an end-to-end pipeline (REUSE) for Rapidly Eliminating Unwanted SEquences from large sequencing datasets. The result of REUSE will be sequences that do not belong to a reference sequence. This pipeline will be based on previously established techniques for isolating known and novel pathogens among sequencing data. It will seek to dramatically speed up the process, optimize flaws in other pipelines, and automate it from start to finish. It will likely include a k-mer filter, read alignment, read assembly, and contig alignment. Some of these steps will be based on publicly available tools, such as RNA-STAR and Trinity, whereas others will need to be programmed from the ground up.
The work at hackseq18 will focus on development of the most novel and needed module, the k-mer filter (k-REUSE). Previous evidence indicates that k-mers can be used to rapidly screen and filter sequences, and that a k-mer of 21 basepairs is sufficient to discriminate between unrelated species.(1) Currently published applications, such as Kontaminant(1), Cookiecutter(2), BBDuk(3) and others have several limitations, including lack of parallelization, high memory requirements (>50gb for the human genome), and lack of ability to save the reference index to disk. Other techniques, such a read alignment, are too slow to use on large datasets.
The goal of HackSeq18 will be the development of k-REUSE and comparison to other filters. Further development will likely be needed after the hackathon for integration of k-REUSE into the complete REUSE pipeline and ultimate application to extremely large datasets.
Team Leads: Shaun Jackman, Rayan Chikhi and Paul Medvedev
Our hackseq project aims to create a better assembly tool by mixing and matching components of various assembly tools. We will pit these Frankenstein assembly pipelines against each other in a fantastic assembler bake off!
- Scripting: * *
- R / Python: *
- Genome Assembly
- AssemblerFlow / Nextflow Language
Genome sequencing yields many short reads of DNA from a genome. Genome assembly attempts to reconstruct the original genome from these reads. Most genome assembler software tools are pipelines of many stages. It’s not however typical to mix and match stages from different tools. Our Hackseq project aims to create a better assembly tool by mixing and matching components of various assembly tools.
We will work together to create modular assembler components for AssemblerFlow, a tool which builds pipelines of tools for genome assembly using Nextflow. Each participant will create and run a genome assembly pipeline using AssemblerFlow. We will assess the quality of each assembly, using Quast. Finally, we’ll create a leader board of awesomeness to compare the assembly results!
Project 3: TaraCyc
Team Lead: Arjun Baghela (ECOSCOPE)
This project will aim to construct a website to represent thousands of ocean virus genomes. The goal of the website is to allow world citizens to play, explore and interact with the genome dataset to gain understanding into the microscopic life living in Earth’s oceans.
- Web development
- Gene pathway analysis
- Environmental genomics
TARA Oceans represents one of the world’s greatest coordinated research and sampling efforts and has collected years of standardized and detailed information about Earth’s largest cohesive eco-system providing insights crucial for the preservation of mankind our planet. Of particular focus is microbial community metabolism, which drives biogeochemical cycling in the ocean through networks of metabolite exchange and gene transfer. Viruses can reprogram these networks by co-opting host metabolism during infection using auxiliary metabolic genes (AMGs) encoded in viral genomes. Metabolic reprogramming has the potential to alter metabolic flux at the individual, population and community levels with resulting feedback on biogeochemical cycling at ecosystem scales. While previous gene-centric studies have identified viral AMGs associated with carbon and energy metabolism in the sunlit and dark ocean, the extent to which reprogramming targets the metabolic repertoire of marine microbial communities remains unconstrained. Here we use a pathway-centric approach to identify known and novel metabolic hotspots associated with global AMG distribution patterns and bound viral reprogramming potential in the ocean.
We used MetaPathways v2.5 in combination with Pathway Tools and the MetaCyc hierarchy of metabolic pathways to construct an environmental pathway genome database (ePGDBs) from hundreds of thousand viral contigs assembled from the Tara Oceans transect . This global ocean virome (GOV) was assembled from a subset of 90 viral fractions from a total of 288 size-fractioned samples spanning the entire transect. For Hackseq, we wish to construct a website called TaraCyc, which would represent a compendium of these ePGDBs and provide a comprehensive collection of MetaPathways output and extracted Environmental Pathway/Genome Databases (ePGDBs). Further, the TaraCyc website would allow world citizens to play, explore and interact with the dataset and gain understanding into the microscopic life living in Earth’s ocean.
Team Lead: Rob Gilmore (robgilmore127 [AT] gmail.com)
Are you a bioinformatician frustrated by R’s inability to resolve common dependency issues? Help develop beRi, a package management, reproducible workflow, and installation toolkit for the R programming language.
- Communication: * *
- Python3: * * +
- R: *
- Bash: *
- PyCharm / Rstudio / Atom IDE
- R Functions: devtools, packrat, miniCRAN, tidyverse
- Python Libs: click, yaml, cookiecutter
- Knowledge of local craft beers
Project 5: Alignment-free Pathogen Genomics
Team Lead: Alex Sweeten (alex.sweeten [AT] gmail.com)
Interested in developing a novel bioinformatics pipeline? Given the unprecedented size of genomic information, new methods are required to organize and manage this data. The goal of this project will be to develop and test an alignment-free method, and apply it towards datasets of pathogenic organisms.
- Python : * *
- GitHub: *
- Bash: *
- Data visualization
- Parallel programming
- Conda package development
- Galaxy package development
- Clustering algorithms
Determining similarity between bacterial isolates is an important requirement for epidemiological analysis. Alignment-based genomic methods are commonly used to tackle this problem. However, in cases of low sequence homology, horizontal gene transfer, or lack of a priori information, as is common when dealing with pathogenic bacteria, alignment-based methods pose significant problems. The normalized compression distance (NCD) is a parameter and alignment-free distance metric, which has shown recent success in genomics, specifically in classifying viral sequences. Here, I propose to develop a pipeline for allowing users to apply NCD to their genomic data, as well as for users to visualize their results in a presentable and easy to read format. This application will be built using a Python framework and developed into a Conda package and/or Galaxy workflow (depending on time required & participant experience). Furthermore, I propose to test/benchmark our pipeline using datasets of pathogenic bacteria and viral sequences. This project will be a novel community-based application, potentially facilitating further research into alignment-free similarity methods.
Project 6: Simulating Transcriptome Structural Variants
Team Leads: Cara Reisle (caralynreisle [AT] gmail.com) & Morgan Bye (morgan [AT] morganbye.com)
While there are tools to simulate structural variants in genomic data, currently no such applications exist for transcriptomes. Simulated data is required to properly evaluate structural variant callers.
- Programming: * *
- Python : *
- Genetics/Sequencing: *
- Structural Variation
- BAM/SAM file formats
- Pysam library
The goal of this project is to produce an application that is able to simulate structural variants in transcriptome data. The application will be written in python. It will be a pipeline consisting of several steps:
- Specification of the structural variants to be simulated
- Generation of the resulting transcriptome sequence
- Simulating paired-end reads covering the structural variant breakpoints
- Alignment of the simulated reads to the reference genome to produce a BAM file
If you’re new to structural variants or working with sequencing data, it would be a good idea to do a bit of background reading. I’ve listed a some potentially relevant literature below. deFuse (PMID 21625565), Chimerascan (PMID 21840877), STAR (PMID 26334920), BAMSurgeon (PMID 25984700), Flux Simulator (PMID 22962361), MAVIS (PMID 30016509), SQUID (PMID 29650026)
Project 7: Blockchain for Infectious Disease
Team Lead: Veena Ghorakavi (vghorakavi [AT] gmail.com)
Infectious diseases are widespread and are difficult to track. Blockchain has the capabilities to track infectious diseases without revealing information about the people impacted and with location data.
- Python : *
- Epidemiology: *
Understanding how to track and prevent the spread of infectious diseases is a task that needs to both protect the identity of the individual impacted and understand the prognosis of the spread. As a result, combining the tracking of infectious diseases with blockchain technology is an important albeit unexplored ideal. The idea is to allow people to mark themselves as infected and for data scientists and medical professionals to use this information.
Team Lead: Noushin Nabavi (nabavinoushin [AT] gmail.com)
We are going to work together in order to analyze an open Government dataset released by Canadian Vital Statistics. We will make hypotheses about the causes of death across Canada over time and test our hypotheses through generating informative plots. This is a great opportunity for those who are new to computational biology to learn some new skills and meet other scientists in the field.
- none, this project is great to learn about R programming
- Epidemiology / Biostatistics
- Public Health
- Familiarity with basic programming
With the rise of aging population and chronic diseases, Canada continues to suffer from the high costs of disease management [1-3]. Additionally, studies on prevalence and patterns of causes of morbidity or co-morbidity and associated determinants in Canada remain scarce and outdated . Canadian Vital Statistics has recently released a comprehensive and well-annotated dataset on causes of death and mortality of all Canadians from 2012 until 2016. We hypothesize that understanding the major causes of mortality (i.e. various diseases) among males and females in different age categories across all provinces can inform health care and governmental policies. In this project, we will formulate hypotheses about what we most die of as a nation and test them by exploring these datasets. The overarching aim is to generate a final report on the most common causes of death across Canada that can at its best inform health care professionals and government agencies in developing evidence-base policies.