Metagenomic analysis and functional characterization of the biogas microbiome using high throughput shotgun sequencing and a novel binning strategy

Background Biogas production is an economically attractive technology that has gained momentum worldwide over the past years. Biogas is produced by a biologically mediated process, widely known as “anaerobic digestion.” This process is performed by a specialized and complex microbial community, in which different members have distinct roles in the establishment of a collective organization. Deciphering the complex microbial community engaged in this process is interesting both for unraveling the network of bacterial interactions and for applicability potential to the derived knowledge. Results In this study, we dissect the bioma involved in anaerobic digestion by means of high throughput Illumina sequencing (~51 gigabases of sequence data), disclosing nearly one million genes and extracting 106 microbial genomes by a novel strategy combining two binning processes. Microbial phylogeny and putative taxonomy performed using >400 proteins revealed that the biogas community is a trove of new species. A new approach based on functional properties as per network representation was developed to assign roles to the microbial species. The organization of the anaerobic digestion microbiome is resembled by a funnel concept, in which the microbial consortium presents a progressive functional specialization while reaching the final step of the process (i.e., methanogenesis). Key microbial genomes encoding enzymes involved in specific metabolic pathways, such as carbohydrates utilization, fatty acids degradation, amino acids fermentation, and syntrophic acetate oxidation, were identified. Additionally, the analysis identified a new uncultured archaeon that was putatively related to Methanomassiliicoccales but surprisingly having a methylotrophic methanogenic pathway. Conclusion This study is a pioneer research on the phylogenetic and functional characterization of the microbial community populating biogas reactors. By applying for the first time high-throughput sequencing and a novel binning strategy, the identified genes were anchored to single genomes providing a clear understanding of their metabolic pathways and highlighting their involvement in anaerobic digestion. The overall research established a reference catalog of biogas microbial genomes that will greatly simplify future genomic studies. Electronic supplementary material The online version of this article (doi:10.1186/s13068-016-0441-1) contains supplementary material, which is available to authorized users.

. Number of reads obtained from the TruSeq and Nextera libraries for each sample after quality filtering. All the "TruSeq reads" and the "Nextera" paired reads combined with Flash software of the samples CSTR01a, CSTR01b CSTR02a, CSTR02b, CSTR03a and CSTR03b were used to generate the assembly. All the other "Nextera reads" were used for the binning process.

Comparison between number of genes belonging to each KEGG pathways module and coverage.
Genes belonging to each KEGG pathways module are reported in the Additional File 2: Table  S1 and details regarding the genes annotation procedure are reported in the Materials and methods section. For each KEGG pathways the total number of genes was determined. The average coverage for each class was determined for CSTR01a, CSTR02a and CSTR03a samples (chosen because they had similar manure composition and because they were operated for the same time before sampling) and was obtained aligning their sequencing reads on the global assembly using Bowtie2 software [1]; the coverage was calculated using the genomecov software of the bedtools package [2]. The average coverage was determined taking into account the total number of genes for each KEGG pathways module. In Figure S2 only the more relevant KEGG pathways modules are reported and they were ranked considering the ratio "coverage/number of genes". Classes with the higher ratio are those more represented in the most abundant GBs.

Figure S2
Graph reporting the comparison between number of genes for each KEGG pathway module and their coverage. Histogram reports the number of genes (grey bars) and coverage is represented by the red line. Coverage is the average among samples obtained from bioreactors CSTR01a-03a. Only some selected KEGG pathways modules are reported.

Binning strategy.
Identification of the GBs was performed with a "two-steps strategy" ( Figure S3). All the extracted GBs and the scripts used for the binning procedure can be downloaded from http://www.biogasmicrobiome.com. In the first step, GBs were separated considering both their tetranucleotide composition and also a pairwise comparison of the scaffolds coverage; this procedure was performed by means of the "metagenome.workflow" script [3] (metagenome-workflow strategy). In this step, the scaffolds coverage determined in the first three CSTRs (samples 01a, 02a, 03a, 01b, 02b, 03b) were those considered for extraction of the GBs. In this procedure, different pairwise comparisons were considered for the binning. Using the procedure described by Albertsen et al. [3], the essential genes were identified and taxonomically assigned considering the BLASTP results filtered using MEGAN software [4]. The second step of the clustering strategy was based on binning co-abundant scaffolds across all the metagenomic samples collected ( Figure S1) (named "coverage strategy"). This procedure was implemented due to difficulties in the application of the "metagenomeworkflow strategy" to numerous different samples. The main difficulty was due to the binning of the same GBs in different pairwise comparisons but also to the excessive time requested by the procedure. Despite these limitations, the GBs extracted from the metagenome workflow strategy are of high quality and useful to verify the rest of the binning procedure based on the coverage strategy. All the scaffolds encoding essential genes were extracted and their coverage in the 18 samples analyzed was recovered. To verify if the scaffolds assigned to each genome bin using the same procedure have similar behavior, the file was uploaded in MeV software [5] and clustered (Euclidean distance and single linkage). Clustering procedure cannot be applied to the entire population of 409,831 scaffolds because this would require an enormous computational effort. For this reason, the clustering procedure was applied only to the 12,345 scaffolds taxonomically assigned. Using MeV, a color was assigned to each GB (extracted in the first step) and to each of the 12,345 taxonomically assigned scaffolds. After the hierarchical clustering procedure, these colors allowed a simple verification of the correctness of the GBs extracted in the first step and of the presence of GBs not previously extracted with the metagenome workflow strategy. With this "coverage strategy" 61 additional GBs were identified. The script "extract_scaffold_euclidean.pl" was used to extract from the entire list those belonging to the GBs identified. This script calculates the average coverage profile (the "canopy profile") and the SD of the Euclidean distance for each GB identified with MeV software and then it extracts from the entire scaffold list those having Euclidean distance smaller than 1 SD from the canopy profile. After these two steps, the paired-ends (PEs) connecting other scaffolds to those identified were recovered using a procedure previously proposed [3] (PE strategy). In this final process, the binned scaffolds (named "baits") allowed the recovery of others scaffolds (named "preys") using "recover_interacting_scaffold.pl". The preys can be recovered considering the PE connections, only if they have a coverage similar to the baits (ratio <= 3 folds or >= 0.333 folds) and if the number of PEs was >= to at least 1/3 rd All the parameters were chosen checking the procedure on the same GBs identified with the metagenome-workflow strategy and testing the completeness and the number of duplicated essential genes identified with the new procedure. At each step of the procedure, genome size of the bins, number of the essential genes and number of duplicated essential genes was of the coverage of the nodes (i.e. at least 10 "PE connections" between a "bait scaffold" having coverage 30 and a prey scaffold are needed to collect the "prey" scaffold). The last parameter was introduced to avoid the recovery of high number of scaffolds connected by very low number of PE connections to the scaffolds with very high coverage (spurious connections). estimated using "extract_data_from_contigs_list.pl" in order to verify differences between the new strategy and the metagenome-workflow strategy. The final parameters used were those generating the best results in terms of GBs completeness and number of duplicated essential genes. At the end of the binning all the scaffold lists (one for each bin) were merged into one single table using "compare_lists.pl", and each scaffold was labeled as "cyto" (if it was recovered by the paired-end strategy) or "binned" (if it was recovered with the metagenome-workflow strategy or with the coverage strategy). The scaffolds assigned to more than one GB were discarded or retained on the basis of three rules. (1) They were removed from all the GBs if they were binned considering co-abundant genes coverage in two or more GBs, because they cannot be assigned due to this unreliable coverage. (2) The scaffolds were also removed if they were recovered by the paired-end strategy in more than one GB. (3) They were maintained if the scaffolds were assigned to two different GBs with different strategies (paired-end and coverage profile) but they were assigned only according to the coverage profile strategy (which is considered more reliable). The selection process was performed using "separate_table_binning.pl" script which also generates multiple lists with scaffolds IDs (one for each GB) that can be used to obtain multifasta files used in the next steps. Some GBs having high (>1000x), medium (500-1000x) and low (<500) coverage were selected for reassembly using only reads belonging to the GB scaffolds. Only in a few cases the reassembly process resulted in a reduction of the scaffold numbers; for this reason GBs were not reassembled at the end of the binning pipeline. The entire binning procedure resulted in the identification of 115 GB; 9 were discarded due to the low level of estimated completeness or to the high number of duplicated genes. The completeness of the GBs and the amount of duplication was defined comparing the univocal and the total number of essential genes present on each GB. It was found that only 31 out of 107 essential genes are present in the archaeal GBs identified, but of course this is not due to incompleteness but to the absence of some essential genes in archaea or to the high level of divergence. To better evaluate the completeness of the five archaeal GBs a further analysis was performed with CheckM software [6] and it was found that only Eu05 has a completeness around 78%, all the others are more than 90% complete (Additional File 3: Table S2). It was also considered that not all the bacterial phyla harbor all the 107 essential genes [3] and this allowed a better evaluation of the GBs completeness. The average completeness of 83% (Additional File 3: Table S2) is slightly underestimated due to difficulties in estimating this value for archaea on the basis of the the107 essential genes. By using CheckM software, the completeness of the archaeal GBs was better evaluated, and the average completeness of the GBs was estimated around 85%. For each GB the assembly result was checked by using the Human Microbiome Project assembly criteria [7]. The threshold used for each HMP criteria validation are: (1) contig N90 > = 500 bp (all GBs meet this criteria); (2) 90% or more of the essential genes are found (60 GBs meet this criteria, but also 4 Euryarchaeota are nearly complete); (3) contig N50 > = 5 kb (84 GBs meet this criteria); (4) scaffold N50 > = 20 kb (46 GBs meet this criteria); (5) average contig length > = 5 kb (57 GBs meet this criteria). The 35 GBs of high quality that meet all the criteria were highlighted in bold in Additional File 3: Table S2; 20 GBs meet at least 4 criteria, 14 GBs meet at least 3 criteria, 16 GBs meet at least 2 criteria and 21 GBs meet only one criteria. The 53 GBs satisfying at least 4 assembly criteria and the 5 archaeal GBs were submitted to NCBI WGS database. In the x and y axes are reported respectively the scaffold coverage in two of the samples analyzed (CSTR01a and CSTR01b). In figure are reported some of the GBs identified using this procedure. (4) After scaffold coverage calculation in all the samples, the taxonomically assigned scaffolds were clustered considering the similarity of the coverage profile. At the right of the cluster image, the taxonomic assignment of the scaffolds and some of the GBs identified are reported. (5) After extraction of the taxonomically-assigned scaffolds from MeV clustering, a "canopy profile" was calculated and the scaffolds having similar coverage profiles were extracted from the entire scaffold list with a procedure similar to that proposed by Nielsen et al. [8]. (6) Scaffolds extracted for each GB and PE connections were analyzed and used to complete the GBs. (7) GBs completeness was estimated determining the presence in the scaffolds binned of 107 essential genes.

Taxonomic assignment of the GBs.
Taxonomic assignment of the GBs was performed with four different methods; results were then compared to extract the best possible one.
(1) The essential genes associated to each GB were checked by sequence similarity to the NR database using BLASTN, with e-value threshold 1e-5. Sequence similarity search of the 107 essential genes sequences was performed on the NR database and the taxonomic assignment of the best match was recovered. Sequence similarity of 95%, 85% and 75% or better was used for species, genus and phylum level taxonomical assignment [8], respectively. According to these thresholds, only 10 GBs were assigned to genus level (Additional File 3: Table S2) and no GBs were assigned at species level.
(2) The essential genes associated to each genome bin were checked by sequence similarity to the NR database using BLASTP, with e-value threshold 1e-5. Results obtained were analyzed and the most similar species are reported in Additional File 3: Table S2. Despite for all the GBs the most similar species was however distantly related, these results helped in the assignment of a putative functional role to the GBs.
(3) The scaffolds of each genome bin were analyzed using Phylopythia (http://phylopythias.cs.uni-duesseldorf.de/) with standard parameters [9]. For this specific analysis were considered only the results where more than 50% of the genome was assigned to the same taxonomy. (4) Approximately 400 broadly conserved proteins were used to extract phylogenetic signal using Phylophlan [10] with standard parameters. Results were separated in "high", "medium", "low" and "incomplete" confidence (Additional File 3: Table S2). The same method provided the high-resolution microbial tree of life with taxonomic annotations (Figure 1). Considering results from (3) and (4), GBs were manually taxonomically assigned according to the concordant result at the lower taxonomical level (Additional File 2: Table S2). Only 16 GBs were assigned to genus level, 21 at family level, 20 at class 47 at order and 2 at phylum level.

Functional roles of the microbial species.
The annotation (COG, KEGG, Pfam) of the genes assigned to the GBs can help to define the functional roles of the species identified. This analysis has been improved performing annotation of each GB using the SEED annotation subsystem [11]. Some categories relevant for functional characterization are reported in Additional File 8: Table S7. Hypergeometric analysis (Materials and methods) allowed the identification of the GBs with different functional roles. In Figure 3 only the GBs having P value lower than 0.05 in hypergeometric analysis and present in the top one eighth of each SEED/KEGG/COG functional category were reported (Additional Files 4-6: Tables S3-S5).

COG analysis.
Despite that COG classes describe very general functions, the analysis of the percentage of each COG class on each GB can highlight the functional roles of the species ( Figure S4). These findings can be further confirmed and more specific functions can be identified by means of metabolic reconstruction (KEGG, SEED) [11,12] and functional domains identification (Pfam) [13]. Considering the steps of the biogas production process, the most relevant COG classes to define functional roles are: (C) "energy production and conversion", (E) "amino acid transport and metabolism", (G) "carbohydrate transport and metabolism", (H) "coenzyme metabolism" and (I) "lipid metabolism". Class "C" has a crucial importance in the biogas microbial community, and GBs Eu05, Pr11, Sy01, Sy03 and Fi09 have a consistent fraction of their genes devoted to energy production ( Figure S4). Of course, for Methanothermobacter sp. DTU051 (Eu05) and for other Euryarchaeota, this genome characteristic is due to their fundamental role in methanogenesis, while for other GBs this result is less obvious considering also their low abundance in the microbial community. Category "C" can be considered as strictly related to "G" (carbohydrate transport and metabolism), where the genomes having the higher number of genes are Clostridia (Fi15, Fi49) and Sy04. This result was confirmed by SEED subsystem where analysis of "carbohydrates" category revealed also that Fi09 is the genome with the highest number of subsystem feature counts in this category (Additional File 5: Table S4 and Figure S7). Category "H" (coenzyme metabolism) is dominated by the archaeal species, but obviously this result is determined by the high number of genes involved in the biosynthesis of the methanogenesis coenzymes. Interestingly, by analyzing the functional properties of the Synergistetes (Sy02, Sy03, Sy06), it was found that a high fraction of genes are related to amino acid transport and metabolism (category "E") ( Figure S4). This can be intriguingly linked to their suggested role in amino acids degradation, that resulted in the production of short-chain fatty acids and sulfate used by terminal degraders such as the methanogens and sulphate-reducing bacteria [14]. The presence of numerous features annotated as branched chain amino acids ABC-transporters in SEED supports this evidence at least for Sy01, Sy02, Sy03 and Sy06. Class "I" (lipid metabolism) is dominated by Syntrophomonadaceae (Fi07, Fi08, Fi09), Pr01 and Alcaligenaceae (Pr05, Pr06, Pr10).

Figure S4
Genes' fraction belonging to COG classes on each GB. For each functional COG class the percentage of genes identified for each GB is represented (y axes). Acronyms are reported for GBs having functional enrichments, colored by taxonomic assignment to Phylum level.
Boxes are colored according to COG representation (ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/static/lists/homeCOGs.html). The bottom and top of each box refers to the first and third quartiles, the band inside the box is the median, the whiskers are 1.5 fold the inner quartile range, dots represent the outliers.

SEED analysis.
SEED analysis was performed on each GB in order to tentatively assign them a functional role within the biogas community. In the SEED subsystem, the "subsystem category distribution" (feature counts) can be investigated at three levels; in the description we refer to the first two levels as "general" and "sub-categories". Results revealed that species with the higher number of genes in the general "carbohydrates" category (those in the first quartile) (Additional File 5: Table S4) can have 160 or more feature counts in this category and these are distributed in an heterogeneous way in the SEED sub-categories (i.e. some species possess a higher number of feature counts in some sub-categories and lower in others). This suggests specialized roles for groups of species in carbohydrate utilization and metabolism. The number of feature counts in the "sub categories" was used to generate the Network Representation of the Biogas Functional Organization (NRBFO) described in the main text ( Figure 2). Briefly, this analysis converts the number of features in the SEED sub-categories (for each GB) in a network representation, and classifies the GBs considering their functional properties in "specialized" and "multifunctional". In the NRBFO the nodes represent the SEED classes and have a size proportional to the number of GBs present in the top oneeighth of the class. In some functional classes numerous GBs have no features and for this reason their size in the network is smaller.
As reported in the main text, most of the SEED categories for carbohydrate utilization (azure nodes in the network of Figure 2) are weakly connected suggesting specialized functions for the bacterial species. On the contrary, genomes having high number of genes in more SEED categories are included in nodes connected by thick edges in the network (i.e. "fermentation" and "fatty acids") which "share" 10 common genomes (Fi07, Fi08, Fi09, Fi12, Fi62, Fi68, Pr01, Pr02, Pr05, Pr10) (Additional File 5: Table S4). Among these, Fi07, Fi08 and Fi09 have numerous genes involved in "acetyl-CoA fermentation to butyrate", "butanol biosynthesis" (pathways included in the general SEED category "fermentation") and some are shared by the polyhydroxybutyrate metabolism" (a pathway included in the general SEED category "fermentation"). Diverse functions characterize the single species, for example only Fi07, Fi09, Fi12 and Pr05 have numerous genes in the "serine-glyoxylate cycle" suggesting that C1 assimilation involves butyryl-CoA and propionyl-CoA as intermediates (coming from polyxydroxybutyrate degradation). GBs on each SEED sub-class were ranked considering their number of features. In this way it is possible to verify if the GBs present in the first 1/8 th of each SEED category are also present in other SEED sub-categories. The GBs ranked in the first 1/8 th of numerous categories have a "multifunctional purpose" in the microbial community. This characteristic is common in Proteobacteria, in fact considering the "multifunctional purpose" GBs, we found three members of the Gammaproteobacteria (Pr01, Pr02, Pr04), one of Alcaligenaceae (Pr10), but also two Syntrophomonadaceae (Fi07, Fi09), the remaining belong to some different groups (Thermoanaerobacterales Fi34, Clostridiaceae Fi40, Peptococcaceae Fi65). This is at a certain extent due to their slightly higher number of genes but this is not the only reason, since the range of genome sizes in the microbial community is quite small and 80% of the genomes have a reduced genome size ranging between 1.35 and 2.57 Mbp (Additional File 3: Table S2). Other microbial groups, for example Clostridiaceae comprise both members with a "multifunctional behavior" (i.e. Fi40, Fi48, Fi51, Fi30) and others more "specialized" (i.e . Fi23, Fi58, Fi50, Fi14). There is not a correlation between abundance in the microbial community and "behavior", considering the ten most abundant genomes (in CSTR01a, 02a and 03a), Fi07 is "multifunctional" and Fi02, Fi06, Fi16 are more specialized.