Skip to main content

Advertisement

Table 4 Association with lignocellulose degradation based on different performance measures for the consensus PDMs M1 to M5

From: Inference of phenotype-defining functional modules of protein families for microbial plant biomass degraders

  Module
  M1 M2 M3 M4 M5
Set of recurring modules (18 repetitions of analyses)      
Number of modules in set 18 18 18 18 16
Average F0.5-score in rankings, % 95.2 ± 1.7 92.5 ± 1.1 88.9 ± 2.1 85.8 ± 1.3 84.9 ± 5.3
Average rank 1.3 ± 0.57 2.4 ± 0.61 4.2 ± 1.5 6 ± 1.6 7.5 ± 3.4
Consensus PDM      
Size 18 22 23 25 13
Weight threshold used for classification, % 66.67 50.00 73.91 72.00 38.46
Performance evaluation      
LOO F0.5-score, % 96.2 94.1 89.6 82.5 82.1
LOO recall, % 92.1 84.2 63.2 84.2 57.9
LOO precision, % 97.2 97.0 100.0 82.1 91.7
CV accuracy, % 96.7 93.8 87.7 89.6 84.3
Estimated 95% confidence interval for CV accuracy [91.69, 99.08] [87.82, 97.35] [80.42, 92.96] [82.68, 94.42] [76.57, 90.32]
CV-MAC, % 95.4 91.3 81.2 88.2 77.2
  1. CV, cross-validation; LOO, leave-one-out; MAC, macro-accuracy; PDM, plant biomass degradation module.
  2. Each consensus PDM represents a set of recurring modules from 18 independent repetitions of our analysis (Figure 1), and contains all families that occurred in at least nine of these modules. The recurring modules used to build the PDMs were identified by finding modules having minimal pairwise distances from each other (see Methods). We reported the average rank and average F-score of these module sets (F0.5 puts stronger emphasis on precision; that is, it weights recall as half as strongly as precision [54]; see Additional file 3: Section 3). "Size" gives the number of Pfam and/or CAZy families that are contained in a PDM. We computed recall, precision, and the F-measure scores for the individual PDMs in LOO validation. In addition, accuracies and estimated confidence intervals for 10-fold cross-validation (CV) were used to assess the generalization error more accurately. Following our previous study [28], we also computed the cross-validation macro-accuracy (CV-MAC) as the average of the true-positive (TP) and true-negative (TN) rates.