Computational tool enables scientists to evaluate and enhance quality of viral genomes | AGÊNCIA FAPESP

Computational tool enables scientists to evaluate and enhance quality of viral genomes The program can be used to control the quality of the data obtained, sequencing environmental samples by means of a technique known as metagenomics, which shows how complete the sequences mapped are, and whether they are contaminated by genetic material from other microorganisms (image: Zosia Rostomian/Berkeley Lab).

Computational tool enables scientists to evaluate and enhance quality of viral genomes

January 26, 2022

By Karina Toledo | Agência FAPESP – Researchers working in the United States have developed a computational tool that can be used to evaluate and enhance the quality of viral genomes obtained by means of metagenomics, a technique for sequencing all the genetic material present in environmental samples (of soil or water, for example), and identifying by bioinformatics the organisms to which the sequences belong.

An article published in Nature Biotechnology reports the findings of the research, which was conducted at the US Department of Energy’s Joint Genome Institute and led by Nikos Kyrpides and Stephen Nayfach. One of its co-authors was Antônio Pedro Camargo, a doctoral fellow at the State University of Campinas’s Institute of Biology (IB-UNICAMP) in Brazil with a scholarship from FAPESP. The tool is called CheckV, and can be downloaded free of charge from

“Metagenomics has permitted a huge advance in the knowledge of microbial biodiversity,” Camargo told Agência FAPESP. “More than 40% of known bacterial species have been described recently thanks to this type of approach. However, we still know very little about viruses. Most of the genomes sequenced are from human pathogens. We want to know more about bacteriophages [viruses that infect bacteria]. They’re the most abundant biological entities on Earth and play an important role in regulating ecosystems.”

A large proportion of the microorganisms that participate in the cycling of carbon, nitrogen, sulfur, and other nutrients are regulated by viruses in their natural environments. Metagenomics can help scientists retrieve the genomes of these viruses and associate them with their microbial hosts.

“However, the technique is subject to errors and incompleteness,” Camargo said. “The sequences obtained for the analysis of environmental samples are often fragmented, incomplete, or contaminated by genetic material from host bacteria. For this reason, for several years scientists have been looking for ways to control the quality of the results obtained by metagenomics.”

Viruses evolve far more quickly than cellular organisms and have highly plastic genomes, he added. A bacteriophage will often “steal” genes from its host bacterium or from another virus by which the host is also infected.

“Most bacteriophages have a circular genome that’s integrated into the host bacterium’s genome,” he said. “The tool can measure the degree to which the viral sequences are contaminated and separate the genetic material of the virus from that of the bacterium by analyzing an environmental sample using metagenomics.”

CheckV can also show how complete the sequence obtained is. According to Camargo, this is possible because the genome’s size is relatively conserved among similar viruses as it is limited by the capsid, the protein envelope that protects their genetic material.

“The tool compares the size of the sequence obtained by means of metagenomics with the expected template for the group. If half of it is detected, for example, the tool shows us that it’s 50% complete,” he explained.

To validate CheckV, the group led by Kyrpides and Nayfach used two databases containing sequences for viruses not cultured in a laboratory but obtained by sequencing environmental samples: Integrated Microbial Genomes (IMG/VR), and Global Ocean Virome 2.0. The tool identified a total of 44,652 complete or almost complete viral genomes in both collections (3.6% of the volume analyzed), separating them from the vast majority of other sequences, which were fragmented or incomplete.

CheckV also identified some 17,000 viral sequences in IMG/VR that were integrated into the genomes of host bacteria, separating viral and host genetic material to find out which genes encoded by viruses modulated bacterial metabolisms.

Beneficial microbes

In his Ph.D. research, Camargo has focused on studying bacteria that both thrive in adverse environments and somehow help plants survive in situations where nutrients are limited. The principal investigators for the project are Marcelo Falsarella Carazzolle and Paulo Arruda, professors at IB-USP and members of the Genomics for Climate Change Research Center (GCCRC), an Engineering Research Center (ERC) established by FAPESP and the Brazilian Agricultural Research Corporation (EMBRAPA) at UNICAMP.

“Plants and animals are full of microorganisms that create a functional link with their hosts and can benefit the individual’s defense system, nutrition, and overall health. One of the main challenges of researching the microbiome is identifying the mechanisms whereby beneficial microbial communities interact with host organisms to modulate their performance,” Arruda told Agência FAPESP.

Mass sequencing of metagenomes and metatranscriptomes (sets of RNA molecules expressed in environmental samples) has revealed significant diversity of DNA and RNA viruses that belong to the microbiota of plants and humans, as well as many new functions regulated by the microbial communities of their hosts, Arruda explained.

“Understanding the role of these viral communities in the functioning and evolution of organisms is the next challenge. New computational tools that enable us to analyze organisms’ viromes more accurately and efficiently are fundamental to the development of this vital area of the life sciences,” said Arruda, who is Project Leader at GRCCRC.

The article “CheckV assesses the quality and completeness of metagenome-assembled viral genomes” is at:



Agência FAPESP licenses news reports under Creative Commons license CC-BY-NC-ND so that they can be republished free of charge and in a straightforward manner by other digital media or by print media. The name of the author or reporter (when applied) must be cited, as must the source (Agência FAPESP). Using the button HTML below ensures compliance with the rules described in Agência FAPESP’s Digital Content Republication Policy.

Topics most popular