Skip to main content

Table 4 Association with lignocellulose degradation based on different performance measures for the consensus PDMs M1 to M5

From: Inference of phenotype-defining functional modules of protein families for microbial plant biomass degraders

 

Module

 

M1

M2

M3

M4

M5

Set of recurring modules (18 repetitions of analyses)

     

Number of modules in set

18

18

18

18

16

Average F0.5-score in rankings, %

95.2 ± 1.7

92.5 ± 1.1

88.9 ± 2.1

85.8 ± 1.3

84.9 ± 5.3

Average rank

1.3 ± 0.57

2.4 ± 0.61

4.2 ± 1.5

6 ± 1.6

7.5 ± 3.4

Consensus PDM

     

Size

18

22

23

25

13

Weight threshold used for classification, %

66.67

50.00

73.91

72.00

38.46

Performance evaluation

     

LOO F0.5-score, %

96.2

94.1

89.6

82.5

82.1

LOO recall, %

92.1

84.2

63.2

84.2

57.9

LOO precision, %

97.2

97.0

100.0

82.1

91.7

CV accuracy, %

96.7

93.8

87.7

89.6

84.3

Estimated 95% confidence interval for CV accuracy

[91.69, 99.08]

[87.82, 97.35]

[80.42, 92.96]

[82.68, 94.42]

[76.57, 90.32]

CV-MAC, %

95.4

91.3

81.2

88.2

77.2

  1. CV, cross-validation; LOO, leave-one-out; MAC, macro-accuracy; PDM, plant biomass degradation module.
  2. Each consensus PDM represents a set of recurring modules from 18 independent repetitions of our analysis (Figure 1), and contains all families that occurred in at least nine of these modules. The recurring modules used to build the PDMs were identified by finding modules having minimal pairwise distances from each other (see Methods). We reported the average rank and average F-score of these module sets (F0.5 puts stronger emphasis on precision; that is, it weights recall as half as strongly as precision [54]; see Additional file 3: Section 3). "Size" gives the number of Pfam and/or CAZy families that are contained in a PDM. We computed recall, precision, and the F-measure scores for the individual PDMs in LOO validation. In addition, accuracies and estimated confidence intervals for 10-fold cross-validation (CV) were used to assess the generalization error more accurately. Following our previous study [28], we also computed the cross-validation macro-accuracy (CV-MAC) as the average of the true-positive (TP) and true-negative (TN) rates.