Skip to main content
Figure 1 | Biotechnology for Biofuels

Figure 1

From: Inference of phenotype-defining functional modules of protein families for microbial plant biomass degraders

Figure 1

Identifying phenotype-related functional modules. We used protein sequences from 2,884 prokaryotic isolate species and 18 microbial communities, some of which are known to be involved in lignocellulose degradation. Known lignocellulose degradation abilities are indicated by phenotype labels (positive/negative: +/-). For the metagenomes, we considered only protein-coding sequences with predicted taxonomic origins assigned by a taxonomic binning method (PhyloPythia or PhyloPythiaS). We used HMMER to assign protein family annotations from Pfam and CAZy to all input sequences, and summarized the set of (meta-)genome annotations as a document collection for LDA (1). Each document represented a single genome or a metagenome bin, and was composed of protein family identifiers from a controlled vocabulary (Pfam, CAZy). We then inferred a probabilistic topic model (2). The topic variables of the model can be interpreted as potential functional modules, that is, sets of functionally coupled protein families [42]. We obtained 400 modules with diverse biochemical functions. Next, we defined genome-specific weights of the modules, and used these weights in conjunction with the phenotype labels to rank the modules according to their estimated relevance for the phenotype of lignocellulose degradation (3). As weights, we used the fraction of protein families in a module that were present in a certain genome or metagenome bin (completeness scores). We identified stable, high-ranking modules from independent repetitions of the analysis, and constructed consensus modules, which we named "plant biomass degradation modules" (PDMs) (4). These PDMs were found to cover different aspects of plant biomass degradation, such as degradation of cellulose, hemicellulose, and pectin. Moreover, the weights of the PDMs could be used to predict the biomass degradation abilities of organisms, and we were able to identify specific gene clusters in the input set of (meta-)genomes that reflected the protein family content of individual modules (5). The clusters thus provided evidence for the functional coherence of the modules by gene neighborhood.

Back to article page