Inference of phenotype-defining functional modules of protein families for microbial plant biomass degraders

Table 4 Association with lignocellulose degradation based on different performance measures for the consensus PDMs M1 to M5

	Module
	M1	M2	M3	M4	M5
Set of recurring modules (18 repetitions of analyses)
Number of modules in set	18	18	18	18	16
Average F_0.5-score in rankings, %	95.2 ± 1.7	92.5 ± 1.1	88.9 ± 2.1	85.8 ± 1.3	84.9 ± 5.3
Average rank	1.3 ± 0.57	2.4 ± 0.61	4.2 ± 1.5	6 ± 1.6	7.5 ± 3.4
Consensus PDM
Size	18	22	23	25	13
Weight threshold used for classification, %	66.67	50.00	73.91	72.00	38.46
Performance evaluation
LOO F_0.5-score, %	96.2	94.1	89.6	82.5	82.1
LOO recall, %	92.1	84.2	63.2	84.2	57.9
LOO precision, %	97.2	97.0	100.0	82.1	91.7
CV accuracy, %	96.7	93.8	87.7	89.6	84.3
Estimated 95% confidence interval for CV accuracy	[91.69, 99.08]	[87.82, 97.35]	[80.42, 92.96]	[82.68, 94.42]	[76.57, 90.32]
CV-MAC, %	95.4	91.3	81.2	88.2	77.2

CV, cross-validation; LOO, leave-one-out; MAC, macro-accuracy; PDM, plant biomass degradation module.
Each consensus PDM represents a set of recurring modules from 18 independent repetitions of our analysis (Figure 1), and contains all families that occurred in at least nine of these modules. The recurring modules used to build the PDMs were identified by finding modules having minimal pairwise distances from each other (see Methods). We reported the average rank and average F-score of these module sets (F_0.5 puts stronger emphasis on precision; that is, it weights recall as half as strongly as precision [54]; see Additional file 3: Section 3). "Size" gives the number of Pfam and/or CAZy families that are contained in a PDM. We computed recall, precision, and the F-measure scores for the individual PDMs in LOO validation. In addition, accuracies and estimated confidence intervals for 10-fold cross-validation (CV) were used to assess the generalization error more accurately. Following our previous study [28], we also computed the cross-validation macro-accuracy (CV-MAC) as the average of the true-positive (TP) and true-negative (TN) rates.

ISSN: 2731-3654