Computer scientists have developed Emu, an algorithm that uses long reads of genomes to identify the species of bacteria in a community. The programme could simplify sorting harmful from helpful bacteria in microbiomes like those in the gut or in agriculture and the environment. Emu, their microbial community profiling software, effectively identifies bacterial species by leveraging long DNA sequences that span the entire length of the gene under study. The Emu project, led by computer scientist Todd Treangen and graduate student Kristen Curry of Rice’s George R. Brown School of Engineering, facilitates the analysis of a key gene microbiome researchers use to sort out species of bacteria that could be harmful or helpful to humans and the environment. Their target, 16S, is a subunit of the rRNA (ribosomal ribonucleic acid) gene, whose usage was pioneered by Carl Woese in 1977. This region is highly conserved in bacteria and archaea and also contains variable regions that are critical for separating distinct genera and species.
“It’s commonly used for microbiome analysis because it’s present in all bacteria and most archaea,” said Curry, in her third year in the Treangen group. “Because of that, there are regions that have been conserved over the years that make it easy to target.” “In DNA sequencing, we need parts of it to be the same in all bacteria so we know what to look for, and then we need parts to be different so we can tell bacteria apart.” “Years ago, we tended to focus on bad bacteria or what we thought was bad, and we didn’t care about the others,” Curry said. But there’s been a shift in the last 20 years to where we think maybe some of those other bacteria hanging out mean something.
“Commonly studied environments include water, soil, and the intestinal tract, and microbes have been shown to affect crops, carbon sequestration, and human health.” Emu, the name drawn from its task of “expectation-maximization,” analyses full-length 16S sequences from bacteria processed by an Oxford Nanopore MinION handheld sequencer and uses sophisticated error correction to identify species based upon nine distinct “hypervariable regions.” “With previous technology, we could only read part of the 16S gene,” Curry explained. “It has roughly 1,500 base pairs, and with short-read sequencing, you can only sequence up to 25%–30% of this gene.” However, you need the full-length gene to attain species-level precision.” But even the newest technology isn’t perfect, allowing errors to slip into sequences.
While error rates have dropped in recent years, they can still have up to 10% error inside an individual DNA sequence, while species can be separated by a handful of differences in their 16S gene,” said Treangen, an assistant professor of computer science who specialises in tracking infectious diseases. Distinguishing sequencing errors from true differences represented the main computational challenge of this research project.