An accurate and complete representation of an organism’s genome sequence and its functional annotation is requisite for systems biology studies and genome-scale engineering for synthetic biology . New technologies (for example DNA sequencing ), algorithms (for example Prodigal ), and biological features (for example sRNA ) expand our knowledge of genomes. However, the majority of genome sequences and annotations are rarely updated. Re-annotation has been suggested as an essential component for assaying and understanding systems biology data  and wiki-based solutions have been recommended to facilitate genome updates . In this study, we used the gene prediction program Prodigal to update the C. thermocellum ATCC 27405 gene models. The methodology, accuracy, and specificity improvements incorporated into Prodigal have been described . RNA-seq analysis and proteomic analysis performed using two-dimensional liquid chromatography (LC)-tandem mass spectrometry (MS/MS) offers the possibility of searching continuously updated genome databases with previously obtained information. This is an important advantage since it is likely that further improvements will be made to C. thermocellum gene models and annotations in the future.
We were able to develop a protocol to obtain high quality RNA from C. thermocellum grown on biomass for the first time and to enrich mRNA by subtractive hybridization so that greater than 99.6% of the reads did not map to the 5S, 16S, and 23S rRNA gene sequences. This protocol development opens up new possibilities for future RNA-seq studies of industrially-relevant biomass fermentations. In our transition to a transcriptomic analytical platform based on RNA-seq we sought to compare and contrast the relatively new technology of RNA-seq to an established custom designed microarray. The cross-platform comparisons described here are among the best that we are aware of, with Spearman correlation coefficients ranging from 0.83 to 0.88 (Additional file 11).
Normalization strategies remove experimental noise from transcriptomic datasets prior to analyses used to determine biological differences in samples of interest. In microarray analyses, known biases include variation in dye incorporation rates and hybridization of material to the platform . In RNA-seq analyses distinct biases relate to the depth of sequencing, the length and GC content of genes, and mapping approach [39–42]. We found that normalization of the RNA-seq data had dramatic effects on the final results of our data (Figure 1, Additional file 12). KDMM and UQS gave similar distribution and clustering profiles. The KDMM normalization method was the preferred regime in this study as it provided more results in common with the array data. The KDMM method uses a scaling factor based on the geometric mean of the mapped reads and the UQS method scales read count distributions so that the 75th percentiles are consistent after normalization . Both TMM and RPM performed poorly with our dataset. TMM gave the fewest genes (10) identified in the analysis of variance (ANOVA) as significantly differentially expressed, which was likely due to greater variation post-normalization (Additional file 12). TMM is a conservative normalization method that performs well where datasets have a consistent number of mapped reads across samples . The number of reads that mapped uniquely for given samples differed as much as approximately 2-fold between the largest and smallest totals (Additional file 7). The C. thermocellum sample that was run with the PhiX sequencing control had the fewest number of reads that mapped to the genome, and inconsistencies in the number of mapped reads is likely to explain why the other methods performed better than TMM in this instance. Although widely used, there are reports that the RPKM method can bias estimates of differential expression [40, 43]. In this study, many genes which were identified as having the largest expression differences in the array and KDMM normalized RNA-seq data, such as phosphate and sulfate transport genes, were not identified in significance testing using data normalized by the RPM (Figure 2) or similar RPKM method (Additional file 15).
A number of studies have investigated RNA-seq, mapping methods, technical variability and reproducibility, normalization, and statistical testing methods. However, the field of RNA-seq is still relatively new and rapidly evolving. Differential expression measurements cannot be estimated with any confidence if a single biological replicate is employed. We employed two biological replicate fermentations on each biomass with samples taken at two time points, 12 hours and 37 hours postinoculation, but we expect that as sequencing costs continue to decrease, more biological replicates will be used to increase statistical power. This will allow for greater confidence in RNA-seq differential expression estimates. We used the NimbleGen call files for the microarray data, which uses outlier detection and then summarizes unique probe intensity values into one value for three technical array replicates for each biological replicate. We also employed the Kenward-Roger method to estimate the degrees of freedom in the mixed model analyses of the array data. The array analysis had considerably more statistical power (six expression estimates per gene per condition) compared to the RNA-seq dataset (two expression estimates per gene per condition). Our array data and RNA-seq data generally agreed, although different genes were categorized as significant or did not meet criteria for certain comparisons (Table 3, Additional file 15). We have made the datasets available so that others may compare and contrast different methods and analyses.
The yields of the major fermentation products were approximately 1.4-fold higher after 37 hours on Populus compared to switchgrass with normalization to the original biomass loading. The results of this study suggest more favorable growth of C. thermocellum when pretreated Populus was the substrate. Hemicelluloses present in these two lignocellulosic substrates differ, with glucuronoxylan in hardwoods such as Populus while grasses have predominantly arabinoxylans [44, 45]. The dilute acid pretreatment of each of the biomass substrates should solubilize the majority of hemicelluloses from the biomass, which are then removed by numerous wash steps. It is likely, however, that residual material is left, as well as remaining quantities of inhibiting compounds derived from the pretreatment and breakdown of the hemicelluloses. Examples of inhibitor byproducts from pretreatment include vanillin, hydroxymethylfurfural (HMF), furfural, and syringic acid . Lignin remains after pretreatment and can influence the accessibility of C. thermocellum to cellulose in the biomass substrate. The degree of cellulose polymerization after pretreatment may be another factor that differs between the two biomasses that could influence the fermentation performance [47, 48]. ICP-ES analysis also revealed differences in calcium removal efficiency (Table 3), with the majority of calcium removed during pretreatment of Populus while two-thirds remained after pretreatment of switchgrass. The data suggests that under the pretreatment and process conditions used in this study the dilute acid pretreated Populus was a more accessible substrate for C. thermocellum fermentation compared to the pretreated switchgrass. Alternatively, the species biomass may have differed in the proportion of bound versus free calcium. Nonetheless, different pretreatment strategies and process conditions will be required for optimal conversion of different biomass feedstocks into different biofuels .
From both the microarray and the RNA-seq data we could identify C. thermocellum genes that were highly expressed when grown on these two complex biomass substrates. The cellotriose transport system (Cthe_0391-0393) was among genes that were highly expressed on both substrates. Dextrins of length 3 to 5 are the preferred substrate of C. thermocellum, and this particular transporter is one of five involved in carbohydrate transport and the only one with a specificity for cellotriose . Three other systems transport glucans ranging from one to five glucose subunits with variable substrate affinities and the last is specific for laminaribiose . High-level expression of the cellotriose transport system on Populus and switchgrass suggests the majority of the cellulose in these biomasses is processed by the C. thermocellum cellulosome into cellotriose. Other highly expressed genes included cellulosomal genes such as CipA (primary non-catalytic scaffoldin unit) and CelS (exoglucanase) (Table 2), which is in agreement with earlier data . Identifying highly expressed genes on various substrates is useful for strain engineering as it can expand the repertoire of available promoter sequences to facilitate enhanced cellulosic conversion.
More than 70 dockerin-containing proteins and potential cellulosome-related subunits have been identified in the C. thermocellum ATCC 27405 genome . Of interest in the current study were those genes encoding enzymes or proteins with functions related to cellulosome degradation of biomass and had differential regulation when C. thermocellum was grown on switchgrass compared to Populus (Additional file 15). For example, the genomic locus Cthe_1256-1257 that encodes a glycoside hydrolase and a carbohydrate-binding protein exhibited higher expression on Populus at 12 hours compared to switchgrass (Table 4). Cthe_1257 may encode a protein with potential for cellulose binding, while Cthe_1256 lacks a signal peptide and is predicted to function as a β-glucosidase cleaving imported dextrins to yield β-D glucose. These gene expression differences indicate a degree of specificity of the C. thermocellum response to different substrate availability while growing on the two biomasses. A glycoside hydrolase (Cthe_0624) was upregulated at 12 hours on switchgrass compared to 37 hours on switchgrass with no differences identified on Populus. The glycoside hydrolase (Cthe_0624) amino acid sequence includes a signal peptide and has xylan and lichenan hydrolase activities as well as activity against crystalline cellulose .
Cellulosomes are naturally shed at the end of C. thermocellum growth, which was exploited by an affinity purification method and proteomics approach to show C. thermocellum cellulosomal compositional changes occurred in response to different carbon sources . One surprising aspect of the current study was that larger differences in cellulosomal genes were not observed at the level of transcription for the two biomasses, which may be a reflection of the pretreatment procedure efficiently homogenizing the carbohydrate components of the two biomasses. Although C. thermocellum cannot use xylose, we observed cellulosomal xylanases (Cthe_1398, Cthe_1838, Cthe_1963, Cthe_2590, and Cthe_2972) were among the most highly expressed genes (top 10%) suggesting this activity is important to access its preferred substrates. Xylanases showed little to no differential expression under the conditions assayed in this study despite bulk differences in xylose content of the two biomass substrates. An earlier study also reported highly expressed xylanase proteins on switchgrass  but high-level expression was not found for chemostat growth on purified cellulose , which shows the value in exploring a range of substrates and including those of industrial relevance. It is worth noting that the growth conditions, ‘omic’ level, and detection technologies were quite different between the current transcriptomic and earlier proteomic studies. Further systematic, integrated omic studies will be required to reveal more of this organism’s complex regulatory control mechanisms.
A putative Pst high-affinity phosphate transport system was expressed to a greater amount on switchgrass compared to Populus 12 hours postinoculation while one member of a sulfate transport system was upregulated on Populus. Other members of the sulfate transport system were highly differentially expressed in both the RNA-seq and array; however, they did not pass the significance threshold for the RNA-seq. Differences in phosphorus and sulfur contents for pretreated biomasses were observed (Additional file 7); however, the defined medium (MTC) used to suspend each biomass substrate was identical and replete for phosphate and sulfate for pure cellulose fermentations. Phosphate and sulfate uptake genes were not upregulated during growth on pure cellulose or cellobiose . The corresponding binding proteins for ABC transporters often have high degrees of specificity that can distinguish the phosphate and sulfate oxyanions despite their similarities , although there is little data on these systems for C. thermocellum. Phosphate is required for C. thermocellum carbohydrate breakdown as the bacteria favor transport of cellodextrins over monomeric sugars. Cellodextrins enter C. thermocellum cells via ATP-dependent ABC transport systems and once inside a phosphate anion act as a nucleophile for phosphorolytic cleavage [53, 54]. Multiple uncharacterized phosphate transport systems exist in the ATCC 27405 genome including two putative Na+/Pi co-transporters (Cthe_0064 and Cthe_2810), a putative Pit transporter (Cthe_3000), as well as the Pst system differentially expressed between the two biomass substrates. The Pst transporter is typically only induced under conditions of phosphate starvation [55–58], which would indicate that cells in the switchgrass fermentations were limited in phosphate despite sufficient phosphate being provided in the MTC medium for growth of this organism on pure cellulose or cellobiose. We observed a greater amount of divalent cations in the switchgrass compared to Populus, but at levels relatively insignificant compared to those provided in the MTC medium. Differences in medium ion composition may have influenced chemical speciation through formation of compounds such as insoluble metallophosphates, or disruption of ion exchange. Alternatively, one or more compounds generated during the switchgrass fermentation may have interfered with the C. thermocellum Na/Pi symporter leading to upregulation of the energetically more expensive high-affinity phosphate transport system. We observed approximately twice as much molybdenum in pretreated Populus verses switchgrass (Additional file 7) and factors such as this may have interfered with sulfate uptake and/or iron-sulfur proteins involved in metabolism. Differences in the expression of C. thermocellum anion transporters (phosphate and sulfate) may indicate part of a coordinated system for osmoadaptation and/or pH stasis with variation in the ash composition of the two biomasses influencing the osmotic balance of the cell [59, 60]. Further studies are required to investigate the physiological status of C. thermocellum during industrially-relevant fermentations.
Much higher expression from gene locus Cthe_1479-1481 occurred on switchgrass relative to Populus at both sampling time points. These genes are well conserved in bacteria and are currently annotated as a member of the RND exporter family. This type of transport system is typically associated with Gram-negative bacteria where they act to remove toxic compounds from the cell . Inhibitory compounds are generated from the pretreatment processing of biomass substrates , and despite extensive washing of the pretreated biomass, residual compounds are likely to remain in low quantities. Thus it is conceivable that a toxic compound liberated solely from switchgrass is removed from the cell via this efflux system and this could be a possible target for strain development. A recent study identified arabitol, a putative fermentation inhibitor, as liberated during C. thermocellum fermentation on switchgrass . We also observed greater expression in genes related to urea uptake and metabolism at 37 hours compared to 12 hours on Populus (switchgrass failed to meet one or both of the threshold criteria), which coincided with increases in ethanol concentrations. A previous study showed that the largest response of C. thermocellum to ethanol shock treatment was in genes and proteins related to nitrogen uptake and metabolism .
Three spore-related genes upregulated at 37 hours compared to 12 hours on both biomasses indicated that cells were priming for transition to stationary phase. C. thermocellum ATCC 27405 is inefficient at sporulation, converting between 0 to 7% of resting cells into spores after stressor application . An agr-dependent quorum sensing mechanism for Clostridium acetobutylicum sporulation regulation and granulose formation has been recently described . However, early signal sensing and transduction mechanisms for sporulation in Clostridia are not as well defined as for Bacillus subtilis. Cthe_3383 among the most highly expressed of C. thermocellum genes during growth on biomass substrates (Additional files 14 and 15), is a newly predicted gene that encodes a small (40 aa) putative hypothetical protein (putative autoinducer prepeptide), and is adjacent to genes annotated as having roles in sporulation. At a separate genomic locus we observed differential gene expression for two genes on the different biomass substrates (Cthe_1309 and Cthe_1310) (Additional file 15), with higher expression occurring during fermentation on Populus at 12 hours postinoculation. The latter gene is predicted to encode an accessory gene regulator B. Interestingly, a new addition to the genome, Cthe_3348, is directly downstream of Cthe_1310 and is predicted to encode a 54 amino acid AgrD-like peptide. The agrD gene was highly expressed but was not considered differentially expressed like the two upstream genes. The role, if any, that Cthe_3383 and Cthe_3348 play in signaling and the C. thermocellum sporulation regulatory cascade remains to be elucidated (for alignment see Additional file 14).