Global transcriptome analysis of Clostridium thermocellum ATCC 27405 during growth on dilute acid pretreated Populus and switchgrass
Biotechnology for Biofuels volume 6, Article number: 179 (2013)
The thermophilic anaerobe Clostridium thermocellum is a candidate consolidated bioprocessing (CBP) biocatalyst for cellulosic ethanol production. The aim of this study was to investigate C. thermocellum genes required to ferment biomass substrates and to conduct a robust comparison of DNA microarray and RNA sequencing (RNA-seq) analytical platforms.
C. thermocellum ATCC 27405 fermentations were conducted with a 5 g/L solid substrate loading of either pretreated switchgrass or Populus. Quantitative saccharification and inductively coupled plasma emission spectroscopy (ICP-ES) for elemental analysis revealed composition differences between biomass substrates, which may have influenced growth and transcriptomic profiles. High quality RNA was prepared for C. thermocellum grown on solid substrates and transcriptome profiles were obtained for two time points during active growth (12 hours and 37 hours postinoculation). A comparison of two transcriptomic analytical techniques, microarray and RNA-seq, was performed and the data analyzed for statistical significance. Large expression differences for cellulosomal genes were not observed. We updated gene predictions for the strain and a small novel gene, Cthe_3383, with a putative AgrD peptide quorum sensing function was among the most highly expressed genes. RNA-seq data also supported different small regulatory RNA predictions over others. The DNA microarray gave a greater number (2,351) of significant genes relative to RNA-seq (280 genes when normalized by the kernel density mean of M component (KDMM) method) in an analysis of variance (ANOVA) testing method with a 5% false discovery rate (FDR). When a 2-fold difference in expression threshold was applied, 73 genes were significantly differentially expressed in common between the two techniques. Sulfate and phosphate uptake/utilization genes, along with genes for a putative efflux pump system were some of the most differentially regulated transcripts when profiles for C. thermocellum grown on either pretreated switchgrass or Populus were compared.
Our results suggest that a high degree of agreement in differential gene expression measurements between transcriptomic platforms is possible, but choosing an appropriate normalization regime is essential.
Clostridium thermocellum exhibits one of the highest rates of degradation of cellulosic substrates, which is facilitated by large extracellular multi-subunit enzyme systems termed cellulosomes [1–3]. It also has productivity advantages associated with thermophilic growth conditions. The bacterium has many attributes that are of interest for fundamental research. It also has the potential to be used in industrial-scale consolidated bioprocessing (CBP) (without added enzymes) of lignocellulosic biomass into ethanol for the displacement of petroleum products [4–8].
The C. thermocellum ATCC 27405 genome was originally submitted to the US Department of Energy (DOE) Joint Genome Institute (JGI; Walnut Creek, CA, USA) for sequencing by JHD Wu (University of Rochester, Rochester, NY, USA) and ME Himmel (National Renewable Energy Laboratory (NREL), Golden, CO, USA). The genome was sequenced using the Sanger method, made available in November 2003 [GenBank:CP000568], and represented the first genome sequence for this species. Repetitive sequences such as transposases and those present in cohesin domains made closing this genome challenging and the genome sequence was not finished until 2007. The C. thermocellum ATCC 27405 genes were originally predicted using two gene modeling programs, Glimmer  and Critica , as part of a JGI annotation pipeline. The gene prediction program Prodigal  was developed at Oak Ridge National Laboratory (ORNL; Oak Ridge, TN, USA) and incorporated into the JGI annotation pipeline after the initial ATCC 27405 genome annotation. We have found that its use has improved the gene prediction models for several bacteria [12, 13]. As a result, we applied Prodigal to the C. thermocellum genome sequence and report an update to the C. thermocellum ATCC 27405 genome annotation in this study.
Previous studies have suggested that C. thermocellum coordinates its cellulosomal subunit composition depending on the growth substrate [14, 15] and growth rates . Such studies are important for designer cellulosome engineering studies, developing efficient industrial enzyme cocktails, metabolic engineering, and synthetic biology endeavors . Biomass from the monocot switchgrass (Panicum virgatum) and the woody dicot black cottonwood (Populus trichocarpa) have been proposed as model bioenergy crops for the USA . In order to gain insights into the C. thermocellum genes required for growth on either pretreated switchgrass or Populus we generated whole genome DNA microarray profiles for its growth on biomass for the first time. We have also developed an effective method to isolate high quality RNA from C. thermocellum during these biomass fermentations with initial solid substrate loadings of 5 g/L.
RNA sequencing (RNA-seq) has recently been used for prokaryotic transcriptome analysis [19–21]. It has several advantages over a microarray platform such as greater dynamic range of reads relative to the intensity of probe signal on a microarray platform. The technology allows for the identification of new transcripts and transcriptional start sites at a higher resolution than would be available on a tiling array. RNA-seq technologies and statistical approaches for transcriptome analyses are developing rapidly [22–26], and debate remains over the ideal methods for data normalization and which statistical methods are most useful to help identify biologically-relevant effects.
A comprehensive comparison of different normalization methods for Illumina data has been reported previously . We tested five RNA-seq normalization strategies: trimmed mean of M component (TMM); reads per million (RPM) scaling; reads per kilobase per million (RPKM); upper quartile scaling (UQS); and a newly developed method called kernel density mean of M component (KDMM). Each method is a scaling type method whose corresponding scaling factors are calculated based on the geometric mean for KDMM, arithmetic mean for RPM, geometric mean divided by arithmetic mean for TMM, and the 75th percentile for UQS. We compared the results from these different normalization methods with microarray data derived from the same cDNA using an established expression microarray platform to offer useful suggestions for future RNA-seq studies.
Genome reannotation and updated microarray probe sequences
Improvements in DNA sequencing technologies, assembly, and gene prediction algorithms have facilitated continuous updates to sequenced genomes [12, 13, 27–29]. The latest annotation of the C. thermocellum ATCC 27405 genome has 3,175 candidate protein coding sequences (CDSs) predicted using Prodigal [GenBank:CP000568.2] . Previously reported proteomics data was used to confirm predicted gene models  (see Additional file 1 for all peptides used for annotation confirmation and Additional file 2 for peptides used to update open reading frame (ORF) start sites and include new genes). Compared to the primary C. thermocellum ATCC 27405 annotation, 130 CDSs have been added or converted from pseudo genes into genes and 65 former CDSs were deleted or converted into pseudo genes (see Additional file 3 for examples of peptide hits used to update the genome annotation). Other modifications include the merging of two former genes into a single ORF and the modification of transcriptional start sites. A comparison of the annotation versions can be found at: http://genome.ornl.gov/microbial/cthe/. We have updated our microarray dataset to reflect the new gene numbers where probes originally designed to intergenic regions are now acknowledged to target a newly annotated gene (see Additional file 4 for microarray probe gene assignment update and Additional files 5 and 6 for details).
Of interest to us were any inherent compositional differences between the two biomasses. Quantitative saccharification of pretreated biomass samples revealed that there was more glucose in the Populus biomass (646 mg/g of biomass SD ± 13.6) compared to the switchgrass pretreated biomass (522.5 mg/g of biomass SD ± 9.3) and reflects the cellulose component of the two biomasses. The levels of xylose and arabinose differed between the biomasses with almost four times the amount in switchgrass (xylose: 72.5 mg/g of biomass SD ± 0.4; arabinose: 7.1 mg/g of biomass SD ± 1.0) relative to Populus (xylose: 19.4 mg/g of biomass SD ± 1.6; arabinose: 1.6 mg/g of biomass SD ± 0.2). This is a reflection of the hemicellulose compositional differences, in particular the arabinoxylan component that predominates in the cell wall of switchgrass .
Samples of the pretreated biomasses used as substrates for the fermentations were analyzed by inductively coupled plasma emission spectroscopy (ICP-ES) for elemental compositional differences that could influence the fermentation performance. The pretreated material was also compared to untreated biomass to identify any elemental differences associated with the pretreatment procedure. In both biomasses the pretreatment procedure appeared to introduce chromium, molybdenum, and titanium, which were significantly (P <0.001) different between pretreated and unpretreated biomass (Additional file 7).
Calcium was present in the untreated material at levels of 1,388 mg/kg and 2,868 mg/kg of Populus and switchgrass, respectively. The calcium was removed more efficiently from the Populus biomass with the amount in the pretreated biomass decreasing to 34.3 mg/kg, whereas levels remained high after pretreatment in the switchgrass biomass (1,918 mg/kg) (Additional file 4). Pretreatment efficiently reduced the levels of potassium, magnesium, manganese, phosphorus, strontium, and zinc from both biomasses. The divalent cations barium, calcium, copper, iron, manganese, nickel, strontium, and zinc as well as the phosphorus and sulfur content were higher in pretreated switchgrass compared to Populus (Additional file 7). The only significantly different element that was higher in pretreated Populus relative to switchgrass was molybdenum, which was likely introduced during the pretreatment procedure (Additional file 7).
Growth characterization on biomass
Inocula were similar at the beginning of the experiment, and cell count data taken at 12 hours and 37 hours postinoculation confirmed the fermentations were actively growing (Additional file 8). C. thermocellum doubled by approximately 2.7 times (SD ± 0.8) and 4.4 times (SD ± 1.3) when grown on Populus at 12 hours and 37 hours postinoculation, respectively. Similarly, cell doubling data from switchgrass fermentations showed C. thermocellum doubled 3.6 times (SD ± 1.2) and 5.6 times (SD ± 0.90) at 12 hours and 37 hours postinoculation, respectively. These time points were chosen for analysis as they correlate with exponential and early stationary phase based on the fermentation product formation and cell counts (Additional file 8). Analysis of the fermentation medium over time revealed that C. thermocellum grown on pretreated Populus substrate had greater concentrations of the major fermentation products, ethanol and acetic acid, compared to growth on switchgrass, with approximately 1.6 times greater yields on the former substrate (Table 1). Ratios of the major fermentation products (acetic acid:ethanol) were 2.20 and 2.05 for Populus and switchgrass, respectively. Lactic acid is typically a minor fermentation product, and was present at less than 0.06 g/L in each of the fermentations. Quantitative saccharification revealed that between 58% and 64% of glucose present in the Populus biomass was utilized during the 37-hour fermentation compared to the range of approximately 43% to 49% glucose conversion that occurred during the fermentation using switchgrass as the substrate (Table 1).
Normalization and transcriptome analysis
RNA-seq is an alternative technology for microarrays in transcriptome analysis. This study sought to identify changes in the transcript profile of C. thermocellum ATCC 27405 grown on the substrates of pretreated Populus and switchgrass and whether these profiles were maintained across the two gene expression analytical platforms. RNA-seq reads gave a genome depth coverage of at least 580× (Additional file 9) and gave data for 3,370 genes (98.4% of the annotated protein coding genes). Fluorescence intensity values from the microarrays gave data on 3,157 genes (92.2% of the annotated genes). Data was collected for 3,088 genes on both platforms, constituting 90% of the 3,424 predicted genes (both protein coding and non-protein coding) in the latest version of the C. thermocellum ATCC 27405 genome. Correlations of log2 transformed fluorescent intensity counts for the array or log2 transformed read counts for the RNA-seq of the biological replicates for each condition gave Pearson R correlations ranging from 0.93 to 0.97 in the array and 0.94 to 0.98 in the RNA-seq (Additional file 10). A comparison of the array intensity values and RNA-seq read counts across the two transcriptomic techniques gave Spearman correlation coefficients ranging from 0.83 to 0.88 for each of the growth and substrate comparisons (Additional file 11).
While microarray data normalization strategies are well established, an ideal method for RNA-seq normalization has yet to be defined. A comprehensive comparison of different normalization methods for Illumina data has been reported previously . In this study, we tested five RNA-seq normalization strategies, KDMM, TMM, RPM, RPKM, and UQS, and compared the results of differential gene expression to microarray data obtained from the same cDNA (Additional file 12). We found normalization had significant effects on the distribution of the read counts (Additional file 12). Expression profiles from the UQS and KDMM normalization schemes were almost indistinguishable and replicates had similar RNA-seq distributions (Additional file 12). The TMM normalization method appeared to introduce greater variation into this RNA-seq dataset compared to the pre-normalized data (Additional file 12). Both RPM and RPKM shifted the distribution of reads markedly, which influenced the final results by dramatically reducing their overall expression values (Figure 1, Additional file 12). The other three strategies had less of an effect in terms of shifting the overall distributions (Figure 1, Additional file 12).
Normalized intensity values were used to identify highly expressed genes (Additional file 13). A subset of cellulosomal and cellulose utilization-related genes with a range of expression levels from the array and RNA-seq data normalized with the KDMM strategy are given in Table 2. Featured in this list are the glycoside hydrolase Cel48S (Cthe_2089) and the scaffoldin CipA (Cthe_3077) which are known to be abundant proteins in the cellulosome . A gene (Cthe_0271) was highly expressed on both biomasses and is predicted to encode a protein with a putative function as a type 3A cellulose-binding protein. Cthe_0271was identified in a recent study as the most highly expressed gene when C. thermocellum was grown on both cellulose and cellobiose, indicating that the data generated in this study is consistent with published reports of C. thermocellum grown on various substrates [14, 16]. Also highly expressed on both biomass substrates and at both time points was a transport system (Cthe_0391-0393), recently identified as specific for cellotriose transport . A non-cellulosomal highly expressed gene was Cthe_3383, which has a putative AgrD function (Additional file 13). This gene was a new addition to the C. thermocellum ATCC 27405 genome annotation and reflects the necessity of updating genomes as annotation algorithms improve and knowledge expands. We also compared mapped reads to bioinformatic predictions for small RNAs and in several cases found experimental data supported one model over another (Additional file 14). We expect these data will be useful to refine future sRNA models.
Altered gene regulation and validation of expression differences
A summary of genes that passed the significance threshold of a false discovery rate (FDR) of <0.05 in one of the comparisons is shown in Table 3. A complete list of altered gene regulation in each of the conditions is given in Additional file 15. We found that 2,351 genes were considered significantly different by microarray based on a threshold of a FDR of <0.05 in any one of the four growth or substrate comparisons. A 2-fold filter for differential gene expression narrows the differences between the technologies in terms of the numbers of genes identified as significantly differentially expressed (Table 3, Additional file 15). TMM normalization performed poorly based on statistical testing of the RNA-seq data with only ten genes considered significantly differentially expressed and only five of these overlapping with the array. This is likely due in part to the greater variation seen post-normalization compared to the pre-normalized data (Additional file 12).
RNA-seq data normalized by RPM, UQS, or KDMM identified 117, 104, and 192 significantly differentially expressed genes, respectively. Significant differentially expressed genes from the RPM method had 50 in common with the array; however, genes in the array that had the greatest expression differences were not detected in the RPM normalized data (Figure 2). UQS normalization gave 104 genes that were differentially expressed. Forty-one of these genes were in common with the array. RNA-seq data normalized with the KDMM strategy had the highest number of genes (73) in common with the previously validated array  (Table 4). Six genes exhibiting a broad expression range from samples harvested 12 hours postinoculation were selected for confirmation by RT-qPCR. Expression data from the array or KDMM normalized RNA-seq data compared to RT-qPCR data had correlation coefficient values of R2 = 0.92 and 0.95, respectively (Additional file 16), thus confirming that the data from both analytical platforms were of high quality.
Growth stage-specific changes in gene expression
C. thermocellum expression profiles can vary based on growth rate . No genes showed consistent patterns of regulation at 12 hours relative to 37 hours postinoculation on both substrates using stringent criteria, which may reflect relative differences in growth (Additional file 8). By 37 hours there were eight genes consistently expressed by 2-fold or greater compared to the earlier sampling time point irrespective of the substrate. These eight genes included those encoding proteins related to spore formation (Cthe_0964 (also lysine biosynthesis), Cthe_1084, and Cthe_1759), a glycosyltransferase (Cthe_1085), and genes involved in nucleotide and amino sugar metabolism (Cthe_2642 and Cthe_2644) (Table 4). Other genes affected in the growth stage comparison include an anti-sigma factor (Cthe_1437) and a putative ABC transporter subunit (Cthe_2573). These genes are potentially contributing to the transition of the cells from log to stationary phase.
Substrate-specific gene expression
Comparison of differentially expressed genes permitted the identification of genes that were only affected on one of the biomass substrates. Six genes were upregulated during growth on Populus relative to switchgrass 12 hours after inoculation with the patterns of expression consistent across the two analytical platforms. These genes met the FDR <0.05 and ≥2-fold difference in gene expression requirements, and included genes encoding glycoside hydrolase and CenC carbohydrate-binding proteins (Cthe_1256 and Cthe_1257) (Table 4). A genomic locus that includes a gene encoding a predicted Radical SAM domain protein and an AgrB protein (Cthe_1309 and Cthe_1310) were upregulated on Populus at 12 hours relative to switchgrass. Interestingly, these two genes are upstream of a new addition to the C. thermocellum genome with predicted AgrD functions (Cthe_3348) suggesting a signaling or bacteriocin-like production specific to the substrate. Gene Cthe_2531 is predicted to be involved in sulfate transport and was upregulated when C. thermocellum was grown on Populus. Three other genes from this cluster were also upregulated but did not pass the significance threshold in the RNA-seq analysis. Conversely on switchgrass, three genes related to phosphate transport (Cthe_1603, Cthe_1604, and Cthe_1605) were upregulated. These genes are part of a putative high affinity phosphate transport system we have identified only in strain ATCC 27405 and this system is distinct from the common Na/Pi symporters found in all C. thermocellum strains examined to date. One Na/Pi symporter (Cthe_0064) in C. thermocellum ATCC 27405 was among the top 5% most highly expressed genes on both biomasses (Additional file 9).
Two genes (Cthe_1480 and Cthe_1481) with hypothetical function annotations were upregulated on switchgrass and met the significance criteria. The expression patterns of these genes were maintained in the comparison at 37 hours postinoculation. They have a general function prediction as members of the RND family of exporters and are well conserved in bacteria. Interestingly none of these genes were identified in a study of C. thermocellum ATCC 27405 grown on pure cellulose or pure cellobiose  suggesting the regulation of these genes were specific to the lignocellulosic biomasses used in the current study.
Differential expression of cellulosome genes and central carbon metabolism
Consistent expression patterns for cellulosomal-related genes identified in both the RNA-seq (KDMM) and array included two known cellulosome genes. Cthe_0624 (CelJ) encoding a glycoside hydrolase family 9 enzyme with predicted endoglucanase functions was upregulated in early growth stages on switchgrass relative to the later growth stage, while no differences were identified on Populus. This protein was reported as highly abundant in a proteome study with growth of C. thermocellum when grown on cellobiose, cellulose, and switchgrass . Cthe_1890 encoding a protein with a type 1 dockerin domain had higher expression in the latter growth stage on switchgrass relative to the 12-hour sample. A gene (Cthe_1256), predicted to encode a glycoside hydrolase family 3 enzyme that converts a variety of glucans into β-D glucose, was upregulated on Populus relative to switchgrass at 12 hours postinoculation.
An accurate and complete representation of an organism’s genome sequence and its functional annotation is requisite for systems biology studies and genome-scale engineering for synthetic biology . New technologies (for example DNA sequencing ), algorithms (for example Prodigal ), and biological features (for example sRNA ) expand our knowledge of genomes. However, the majority of genome sequences and annotations are rarely updated. Re-annotation has been suggested as an essential component for assaying and understanding systems biology data  and wiki-based solutions have been recommended to facilitate genome updates . In this study, we used the gene prediction program Prodigal to update the C. thermocellum ATCC 27405 gene models. The methodology, accuracy, and specificity improvements incorporated into Prodigal have been described . RNA-seq analysis and proteomic analysis performed using two-dimensional liquid chromatography (LC)-tandem mass spectrometry (MS/MS) offers the possibility of searching continuously updated genome databases with previously obtained information. This is an important advantage since it is likely that further improvements will be made to C. thermocellum gene models and annotations in the future.
We were able to develop a protocol to obtain high quality RNA from C. thermocellum grown on biomass for the first time and to enrich mRNA by subtractive hybridization so that greater than 99.6% of the reads did not map to the 5S, 16S, and 23S rRNA gene sequences. This protocol development opens up new possibilities for future RNA-seq studies of industrially-relevant biomass fermentations. In our transition to a transcriptomic analytical platform based on RNA-seq we sought to compare and contrast the relatively new technology of RNA-seq to an established custom designed microarray. The cross-platform comparisons described here are among the best that we are aware of, with Spearman correlation coefficients ranging from 0.83 to 0.88 (Additional file 11).
Normalization strategies remove experimental noise from transcriptomic datasets prior to analyses used to determine biological differences in samples of interest. In microarray analyses, known biases include variation in dye incorporation rates and hybridization of material to the platform . In RNA-seq analyses distinct biases relate to the depth of sequencing, the length and GC content of genes, and mapping approach [39–42]. We found that normalization of the RNA-seq data had dramatic effects on the final results of our data (Figure 1, Additional file 12). KDMM and UQS gave similar distribution and clustering profiles. The KDMM normalization method was the preferred regime in this study as it provided more results in common with the array data. The KDMM method uses a scaling factor based on the geometric mean of the mapped reads and the UQS method scales read count distributions so that the 75th percentiles are consistent after normalization . Both TMM and RPM performed poorly with our dataset. TMM gave the fewest genes (10) identified in the analysis of variance (ANOVA) as significantly differentially expressed, which was likely due to greater variation post-normalization (Additional file 12). TMM is a conservative normalization method that performs well where datasets have a consistent number of mapped reads across samples . The number of reads that mapped uniquely for given samples differed as much as approximately 2-fold between the largest and smallest totals (Additional file 7). The C. thermocellum sample that was run with the PhiX sequencing control had the fewest number of reads that mapped to the genome, and inconsistencies in the number of mapped reads is likely to explain why the other methods performed better than TMM in this instance. Although widely used, there are reports that the RPKM method can bias estimates of differential expression [40, 43]. In this study, many genes which were identified as having the largest expression differences in the array and KDMM normalized RNA-seq data, such as phosphate and sulfate transport genes, were not identified in significance testing using data normalized by the RPM (Figure 2) or similar RPKM method (Additional file 15).
A number of studies have investigated RNA-seq, mapping methods, technical variability and reproducibility, normalization, and statistical testing methods. However, the field of RNA-seq is still relatively new and rapidly evolving. Differential expression measurements cannot be estimated with any confidence if a single biological replicate is employed. We employed two biological replicate fermentations on each biomass with samples taken at two time points, 12 hours and 37 hours postinoculation, but we expect that as sequencing costs continue to decrease, more biological replicates will be used to increase statistical power. This will allow for greater confidence in RNA-seq differential expression estimates. We used the NimbleGen call files for the microarray data, which uses outlier detection and then summarizes unique probe intensity values into one value for three technical array replicates for each biological replicate. We also employed the Kenward-Roger method to estimate the degrees of freedom in the mixed model analyses of the array data. The array analysis had considerably more statistical power (six expression estimates per gene per condition) compared to the RNA-seq dataset (two expression estimates per gene per condition). Our array data and RNA-seq data generally agreed, although different genes were categorized as significant or did not meet criteria for certain comparisons (Table 3, Additional file 15). We have made the datasets available so that others may compare and contrast different methods and analyses.
The yields of the major fermentation products were approximately 1.4-fold higher after 37 hours on Populus compared to switchgrass with normalization to the original biomass loading. The results of this study suggest more favorable growth of C. thermocellum when pretreated Populus was the substrate. Hemicelluloses present in these two lignocellulosic substrates differ, with glucuronoxylan in hardwoods such as Populus while grasses have predominantly arabinoxylans [44, 45]. The dilute acid pretreatment of each of the biomass substrates should solubilize the majority of hemicelluloses from the biomass, which are then removed by numerous wash steps. It is likely, however, that residual material is left, as well as remaining quantities of inhibiting compounds derived from the pretreatment and breakdown of the hemicelluloses. Examples of inhibitor byproducts from pretreatment include vanillin, hydroxymethylfurfural (HMF), furfural, and syringic acid . Lignin remains after pretreatment and can influence the accessibility of C. thermocellum to cellulose in the biomass substrate. The degree of cellulose polymerization after pretreatment may be another factor that differs between the two biomasses that could influence the fermentation performance [47, 48]. ICP-ES analysis also revealed differences in calcium removal efficiency (Table 3), with the majority of calcium removed during pretreatment of Populus while two-thirds remained after pretreatment of switchgrass. The data suggests that under the pretreatment and process conditions used in this study the dilute acid pretreated Populus was a more accessible substrate for C. thermocellum fermentation compared to the pretreated switchgrass. Alternatively, the species biomass may have differed in the proportion of bound versus free calcium. Nonetheless, different pretreatment strategies and process conditions will be required for optimal conversion of different biomass feedstocks into different biofuels .
From both the microarray and the RNA-seq data we could identify C. thermocellum genes that were highly expressed when grown on these two complex biomass substrates. The cellotriose transport system (Cthe_0391-0393) was among genes that were highly expressed on both substrates. Dextrins of length 3 to 5 are the preferred substrate of C. thermocellum, and this particular transporter is one of five involved in carbohydrate transport and the only one with a specificity for cellotriose . Three other systems transport glucans ranging from one to five glucose subunits with variable substrate affinities and the last is specific for laminaribiose . High-level expression of the cellotriose transport system on Populus and switchgrass suggests the majority of the cellulose in these biomasses is processed by the C. thermocellum cellulosome into cellotriose. Other highly expressed genes included cellulosomal genes such as CipA (primary non-catalytic scaffoldin unit) and CelS (exoglucanase) (Table 2), which is in agreement with earlier data . Identifying highly expressed genes on various substrates is useful for strain engineering as it can expand the repertoire of available promoter sequences to facilitate enhanced cellulosic conversion.
More than 70 dockerin-containing proteins and potential cellulosome-related subunits have been identified in the C. thermocellum ATCC 27405 genome . Of interest in the current study were those genes encoding enzymes or proteins with functions related to cellulosome degradation of biomass and had differential regulation when C. thermocellum was grown on switchgrass compared to Populus (Additional file 15). For example, the genomic locus Cthe_1256-1257 that encodes a glycoside hydrolase and a carbohydrate-binding protein exhibited higher expression on Populus at 12 hours compared to switchgrass (Table 4). Cthe_1257 may encode a protein with potential for cellulose binding, while Cthe_1256 lacks a signal peptide and is predicted to function as a β-glucosidase cleaving imported dextrins to yield β-D glucose. These gene expression differences indicate a degree of specificity of the C. thermocellum response to different substrate availability while growing on the two biomasses. A glycoside hydrolase (Cthe_0624) was upregulated at 12 hours on switchgrass compared to 37 hours on switchgrass with no differences identified on Populus. The glycoside hydrolase (Cthe_0624) amino acid sequence includes a signal peptide and has xylan and lichenan hydrolase activities as well as activity against crystalline cellulose .
Cellulosomes are naturally shed at the end of C. thermocellum growth, which was exploited by an affinity purification method and proteomics approach to show C. thermocellum cellulosomal compositional changes occurred in response to different carbon sources . One surprising aspect of the current study was that larger differences in cellulosomal genes were not observed at the level of transcription for the two biomasses, which may be a reflection of the pretreatment procedure efficiently homogenizing the carbohydrate components of the two biomasses. Although C. thermocellum cannot use xylose, we observed cellulosomal xylanases (Cthe_1398, Cthe_1838, Cthe_1963, Cthe_2590, and Cthe_2972) were among the most highly expressed genes (top 10%) suggesting this activity is important to access its preferred substrates. Xylanases showed little to no differential expression under the conditions assayed in this study despite bulk differences in xylose content of the two biomass substrates. An earlier study also reported highly expressed xylanase proteins on switchgrass  but high-level expression was not found for chemostat growth on purified cellulose , which shows the value in exploring a range of substrates and including those of industrial relevance. It is worth noting that the growth conditions, ‘omic’ level, and detection technologies were quite different between the current transcriptomic and earlier proteomic studies. Further systematic, integrated omic studies will be required to reveal more of this organism’s complex regulatory control mechanisms.
A putative Pst high-affinity phosphate transport system was expressed to a greater amount on switchgrass compared to Populus 12 hours postinoculation while one member of a sulfate transport system was upregulated on Populus. Other members of the sulfate transport system were highly differentially expressed in both the RNA-seq and array; however, they did not pass the significance threshold for the RNA-seq. Differences in phosphorus and sulfur contents for pretreated biomasses were observed (Additional file 7); however, the defined medium (MTC) used to suspend each biomass substrate was identical and replete for phosphate and sulfate for pure cellulose fermentations. Phosphate and sulfate uptake genes were not upregulated during growth on pure cellulose or cellobiose . The corresponding binding proteins for ABC transporters often have high degrees of specificity that can distinguish the phosphate and sulfate oxyanions despite their similarities , although there is little data on these systems for C. thermocellum. Phosphate is required for C. thermocellum carbohydrate breakdown as the bacteria favor transport of cellodextrins over monomeric sugars. Cellodextrins enter C. thermocellum cells via ATP-dependent ABC transport systems and once inside a phosphate anion act as a nucleophile for phosphorolytic cleavage [53, 54]. Multiple uncharacterized phosphate transport systems exist in the ATCC 27405 genome including two putative Na+/Pi co-transporters (Cthe_0064 and Cthe_2810), a putative Pit transporter (Cthe_3000), as well as the Pst system differentially expressed between the two biomass substrates. The Pst transporter is typically only induced under conditions of phosphate starvation [55–58], which would indicate that cells in the switchgrass fermentations were limited in phosphate despite sufficient phosphate being provided in the MTC medium for growth of this organism on pure cellulose or cellobiose. We observed a greater amount of divalent cations in the switchgrass compared to Populus, but at levels relatively insignificant compared to those provided in the MTC medium. Differences in medium ion composition may have influenced chemical speciation through formation of compounds such as insoluble metallophosphates, or disruption of ion exchange. Alternatively, one or more compounds generated during the switchgrass fermentation may have interfered with the C. thermocellum Na/Pi symporter leading to upregulation of the energetically more expensive high-affinity phosphate transport system. We observed approximately twice as much molybdenum in pretreated Populus verses switchgrass (Additional file 7) and factors such as this may have interfered with sulfate uptake and/or iron-sulfur proteins involved in metabolism. Differences in the expression of C. thermocellum anion transporters (phosphate and sulfate) may indicate part of a coordinated system for osmoadaptation and/or pH stasis with variation in the ash composition of the two biomasses influencing the osmotic balance of the cell [59, 60]. Further studies are required to investigate the physiological status of C. thermocellum during industrially-relevant fermentations.
Much higher expression from gene locus Cthe_1479-1481 occurred on switchgrass relative to Populus at both sampling time points. These genes are well conserved in bacteria and are currently annotated as a member of the RND exporter family. This type of transport system is typically associated with Gram-negative bacteria where they act to remove toxic compounds from the cell . Inhibitory compounds are generated from the pretreatment processing of biomass substrates , and despite extensive washing of the pretreated biomass, residual compounds are likely to remain in low quantities. Thus it is conceivable that a toxic compound liberated solely from switchgrass is removed from the cell via this efflux system and this could be a possible target for strain development. A recent study identified arabitol, a putative fermentation inhibitor, as liberated during C. thermocellum fermentation on switchgrass . We also observed greater expression in genes related to urea uptake and metabolism at 37 hours compared to 12 hours on Populus (switchgrass failed to meet one or both of the threshold criteria), which coincided with increases in ethanol concentrations. A previous study showed that the largest response of C. thermocellum to ethanol shock treatment was in genes and proteins related to nitrogen uptake and metabolism .
Three spore-related genes upregulated at 37 hours compared to 12 hours on both biomasses indicated that cells were priming for transition to stationary phase. C. thermocellum ATCC 27405 is inefficient at sporulation, converting between 0 to 7% of resting cells into spores after stressor application . An agr-dependent quorum sensing mechanism for Clostridium acetobutylicum sporulation regulation and granulose formation has been recently described . However, early signal sensing and transduction mechanisms for sporulation in Clostridia are not as well defined as for Bacillus subtilis. Cthe_3383 among the most highly expressed of C. thermocellum genes during growth on biomass substrates (Additional files 14 and 15), is a newly predicted gene that encodes a small (40 aa) putative hypothetical protein (putative autoinducer prepeptide), and is adjacent to genes annotated as having roles in sporulation. At a separate genomic locus we observed differential gene expression for two genes on the different biomass substrates (Cthe_1309 and Cthe_1310) (Additional file 15), with higher expression occurring during fermentation on Populus at 12 hours postinoculation. The latter gene is predicted to encode an accessory gene regulator B. Interestingly, a new addition to the genome, Cthe_3348, is directly downstream of Cthe_1310 and is predicted to encode a 54 amino acid AgrD-like peptide. The agrD gene was highly expressed but was not considered differentially expressed like the two upstream genes. The role, if any, that Cthe_3383 and Cthe_3348 play in signaling and the C. thermocellum sporulation regulatory cascade remains to be elucidated (for alignment see Additional file 14).
The results suggest a high degree of concordance in differential gene expression measurements between the three transcriptomic platforms. We observed few transcriptomic differences for C. thermocellum cellulosome-related genes for cells fermenting either dilute acid pretreated Populus or switchgrass, which may indicate that under this pretreatment regime they sense and respond to similar carbohydrate profiles during active growth. We observed differential expression sulfate- and phosphate-related genes, which may point to aspects of metabolism for more consideration during industrial-relevant fermentations. We have identified new and highly expressed genes and our update to the ATCC 27405 genome will be useful for follow-on studies.
Microarrays and RNA-seq each have respective biases that can interfere with differential expression determinations and in this study RNA-seq normalization methods dramatically affected downstream analyses. RNA-seq offers important advantages for transcriptomic profiling and it will invariably substitute microarrays as a preferred method. However, DNA microarray testing and analysis has evolved over many years through studies such as the MicroArray Quality Control (MAQC) project [65, 66] and further studies and cost reductions in sequencing are similarly required to develop RNA-seq analyses.
A gene modeling program termed Prodigal  was applied to the C. thermocellum ATCC 27405 genome sequence, followed by a round of manual curation in combination with proteomics data analysis  to ensure no peptide evidence existed for any deleted genes (data derived from Yang et al.  and reported in Additional files 1,2,3). A six-frame translation generated predicted ORFs and a search of available peptide data against these ORFs resulted in three groups: 1) peptides that fall under existing gene call; 2) those that have one end within an existing gene call and the other outside, which were used to correct the start and end coordinates for a gene; and 3) those that were not within an existing gene and were used to add a new gene. In addition, the following criteria were assessed: whether peptide hit is unique or matches several places in the genome, number of times peptide was detected, peptide BLAST percent identity and length of match, transcription level via RNA-seq data from this study at the start of a gene/ORF, 100 bp upstream and average coverage, Prodigal score for coding potential, start codon used, Prodigal score for ribosome binding site (RBS), manually checked RBS, similar sequences, and their start sites by blasting ORF against the National Center for Biotechnology Information (NCBI) non-redundant database. Predicted genes were annotated using an automated annotation pipeline, as described previously . The current annotation and a comparison to the earlier versions can be found at http://genome.ornl.gov/microbial/cthe/.
The biomass substrates used in the fermentations were dilute acid pretreated switchgrass (Panicum virgatum cultivar Alamo; SWG) and dilute acid pretreated Populus (Populus trichocarpa x Populus deltoides F1 hybrid; POP). The biomasses were milled to -20/+80 mesh size and pretreated with dilute sulfuric acid at 0.050 g/g of dry biomass at 190°C for 1 minute residence time (flow-through mode) and 25% (w/w) total solids using a Sunds reactor at the NREL [14, 67]. The pretreated biomasses were washed with Milli-Q H2O (Millipore, Billerica, MA, USA) until less than 0.1 g/L glucose could be detected in the wash eluent, and dried prior to fermentations .
Compositional analysis of biomass
Trace elements were determined by ICP-ES. The samples for ICP-ES were prepared using a method based on the United States Environmental Protection Agency (USEPA) SW-846 Method 3050A. Pretreated and unpretreated biomass samples were oven dried and a 2 g sample digested by sequentially heating in nitric acid, hydrogen peroxide, and hydrochloric acid. The samples were filtered through Whatman 41 filter paper (Whatman, Maidstone, UK) and the volume made up to 50 mL with deionized (DI) water. Aliquots (5 mL) were subjected to ICP-ES analysis in an Optima 3000 DV ICP Emission Spectrometer (PerkinElmer, Waltham, MA, USA) with yttrium used as an internal standard .
Overnight inoculum cultures of C. thermocellum 27405 were grown anaerobically in 50 mL bottles. Five 40 mL aliquots from 5 g/L Avicel in MTC  50 mL serum bottles were used to inoculate the 5-L Twin BIOSTAT B plus fermenters (Sartorius Stedim Biotech, Göttingen, Germany) (total volume 2 L) for a final inoculum of 10%. Two replicate fermentations were performed for each biomass. The dry weight basis of the loading of the biomass in each fermenter was 5 g/L in MTC medium. The fermenters were run at 58°C, 250 rpm, and pH-controlled at 7.0 with 3 N NaOH. Time = 0 samples were taken immediately postinoculation of the fermenter vessels. At 12 hours and 37 hours post-inoculation, 50 mL samples were removed for transcriptomic analyses.
Samples were removed periodically from the fermenter vessel to determine cell counts and monitor fermentation product formation and residual carbohydrates (Additional file 8). Samples for cell counts were diluted with Milli-Q H2O when necessary and a 10 μL aliquot was loaded onto a hemocytometer counting chamber for counting. Cell counts were performed in triplicate for each fermenter at a given time point.
Fermentation residues were analyzed for carbohydrate composition using quantitative saccharification assay ASTM E 1758–01 (ASTM 2003), NREL/TP 510–42618, and HPLC method NREL/TP 51–42623. Cell-free samples from the fermenters were analyzed for metabolites (acetic acid, lactic acid, and ethanol) and residual carbohydrates (cellobiose, glucose, xylose, and arabinose) using a LaChrom Elite HPLC System (Hitachi High Technologies America, Pleasanton, CA, USA) equipped with a refractive index detector (model L-2490), as previously described .
Cells pelleted from an 8 mL sample drawn from each fermenter were resuspended in 1.5 mL of TRIzol (Invitrogen, Carlsbad, CA, USA) and used for cell lysis by bead beating with 0.8 g of 0.1 mm glass beads (BioSpec Products, Bartlesville, OK, USA) with 3 × 20 seconds bead beating treatments at 6,500 rpm in a Precellys 24 high-throughput tissue homogenizer (Bertin Technologies, Montigny-le-Bretonneux, France). The RNA from each cell lysate was purified, DNaseI-treated, and quantity and quality assessed, as previously described . Purified RNA of high quality (RIN >8) was pooled from the same fermentation samples and depleted of rRNA using Ribo-Zero rRNA Removal Kit for Gram-positive bacteria (Epicentre, Madison, WI, USA). The sample was then concentrated with RNA Clean & Concentrate-5 (Zymo Research, Irvine, CA, USA) following the manufacturer’s protocol.
Depleted RNA was used as the starting material for the Epicentre ScriptSeq mRNA-Seq Library Preparation Kit (Illumina-compatible) utilizing the FailSafe PCR Enzyme Mix (Epicentre) and following the manufacturer’s protocol. cDNA tagged with standard adaptors was eluted with 20 μL of Buffer EB provided in the MinElute PCR Purification Kit (Qiagen, Venlo, Netherlands) according to the ScriptSeq protocol. Cycles were increased to 14 during amplification and samples were purified using the MinElute PCR Purification Kit and eluted with 20 μL of Buffer EB. The final mRNA-seq library was quantified with a Qubit fluorometer (Invitrogen) and library quality was assessed with Bioanalyzer High Sensitivity DNA Chip (Agilent, Santa Clara, CA, USA).
Samples were diluted to 2 nM, denatured, and further diluted to 6 pM. These were run on cBot (Illumina, San Diego, CA, USA) (SR_Amp_Lin_Block_Hyb_V7) overnight to cluster on version 1.5 Flow Cell. The mRNA-seq libraries were analyzed on a HiSeq 2000 (Illumina) platform with a SR50 sequencing kit for a single read of 51 cycles. The lane containing the F188 12-hour Populus sample included the control of phiX DNA.
Raw reads were mapped to genome [GenBank:CP000568.1] using CLC Genomics Workbench version 5.5.1 (CLC bio, Aarhus, Denmark) using the default settings for prokaryote genomes. Uniquely mapped reads were log2 transformed on importation into JMP Genomics version 6 (SAS Institute, Cary, NC, USA). Data were normalized using default settings for each of the four normalization strategies (see Additional file 12 for pre- and post-normalization distribution curves) and any genes with no read counts were removed prior to ANOVA analysis. Filtering was applied to identify those genes with an FDR <0.05 and a greater than a log2 of ± 1 for differential gene expression. Raw RNA-seq data have been deposited in the NCBI Sequence Read Archive (SRA) [SRA:060947] and we have made mapped reads and data available through the BioEnergy Science Center (BESC) KnowledgeBase http://bobcat.ornl.gov/besc/index.jsp. Samples in the SRA series [SRA:060947] are labeled accordingly with the accession number given in square brackets. C. thermocellum harvested after growth on Populus for 12 hours: F185_Ctherm_Pop_12 hr [SRR:620218] and F188_Ctherm_Pop_12 hr [SRR:620325]. C. thermocellum harvested after growth on Populus for 37 hours: F185_Ctherm_Pop_37 hr [SRR:620219] and F188_Ctherm_Pop_37 hr [SRR:620327]. C. thermocellum harvested after growth on switchgrass for 12 hours: F186_Ctherm_Swg_12 hr [SRR:620229] and F187_Ctherm_Swg_12 hr [SRR:620532]. C. thermocellum harvested after growth on switchgrass for 37 hours: F186_Ctherm_Swg_37 hr [SRR:620238] and F187_Ctherm_Swg_37 hr [SRR:620324]. Note that the same nomenclature of fermenter number (F185, F186, F187, and F188), biomass substrate (Pop and Swg), and time point of sampling (12 hours and 37 hours) is used for naming the samples in the microarray Gene Expression Omnibus (GEO) submission, see details below.
Microarray sample labeling, hybridization, scan, and statistical analysis of array data
RNA-seq libraries were also used for hybridization to the microarray. Beginning with 100 ng of cDNA, half volume Cy3 labeling reactions were undertaken for all eight samples according to the manufacturer’s protocols. Cy3 labeling efficiency was assessed by NanoDrop ND-1000 spectrophotometer (NanoDrop, Wilmington, DE, USA) and determined to fall within the range of 20 to 24 pmol/μg. Hybridizations were conducted using a 12-bay hybridization station (BioMicro Systems, Salt Lake City, UT, USA) and the arrays dried using a MAUI Wash System (BioMicro Systems). Microarrays were scanned with a SureScan High-Resolution DNA Microarray Scanner (5 μm) (Agilent), and the images were quantified using NimbleScan software (Roche NimbleGen, Madison, WI, USA).
Raw data was log2 transformed and imported into the statistical analysis software JMP Genomics 6.0 software (SAS Institute). The data were normalized together using a single round of the LOESS normalization algorithm within JMP Genomics, and distribution analyses conducted before and after normalization were used as a quality control step. An ANOVA was performed in JMP Genomics to determine differential gene expression levels via a direct comparison of the two biomasses and time points using the FDR testing method (P <0.05) and Kenward-Roger degrees of freedom method. Microarray data have been deposited in the NCBI GEO database [GSE:47010]. Samples in the GEO series [GSE:47010] are labeled accordingly with the specific GEO sample accession number given in square brackets. C. thermocellum harvested after growth on Populus for 12 hours: F185_Pop_12 hr_rep1 [GSM:1142896] and F188_Pop_12 hr_rep1 [GSM:1142902]. C. thermocellum harvested after growth on Populus for 37 hours: F185_Pop_37 hr_rep1 [GSM:1142897] and F188_Pop_37 hr_rep1 [GSM:1142903]. C. thermocellum harvested after growth on switchgrass for 12 hours: F186_Swg_12 hr_rep1 [GSM:1142898] and F187_Swg_12 hr_rep1 [GSM:1142900]. C. thermocellum harvested after growth on switchgrass for 37 hours: F186_Swg_37 hr_rep1 [GSM:1142899] and F187_Swg_37 hr_rep1 [GSM:1142901].
Microarray data were validated using RT-qPCR, as described previously . Six genes representing a range of gene expression values based on microarray hybridizations were analyzed using qPCR from cDNA derived from different time point samples. Oligonucleotide sequences of the primers targeting the six genes selected for qPCR analysis were: Cthe_0344_F CGACTTCCCGAACCAGATAA, Cthe_0344_R GCAGCGGCTATCTTCATTTC; Cthe_0482_F GAGCAGGGATTGGTAATGGA, Cthe_0482_R TACCGCAAGACCTACAAGCA; Cthe_1481_F AGTCATATCCGAAAACATGG, Cthe_1481_R TTGTAGTCGTCAAGGGAAGT; Cthe_1604_F GTGTCCCCGCTATTGCTAAA, Cthe_1604_R ATGGGTAAAATGCCGAATGA; Cthe_1951_F AAAATAAAAGCCCAGGATTC, Cthe_1951_R GCATTATCCTGAAGTTCGTC; and Cthe_2531_F CGGAAAGGACATTGTCATCC, Cthe_2531_R CAAAGCCAGGGTTACGACAT.
Analysis of variance
BioEnergy Science Center
Basic Local; Alignment Search Tool
Department of Energy
False discovery rate
Gene Expression Omnibus
High performance liquid chromatography
Inductively coupled plasma emission spectroscopy
Joint Genome Institute
Kernel density mean of M component
MicroArray Quality Control
Tandem mass spectrometry
Medium for Thermophilic Clostridia
National Center for Biotechnology Information
National Renewable Energy Laboratory
Open reading frame
Oak Ridge National Laboratory
Polymerase chain reaction
Ribosome binding site
RNA integrity number
Reads per kilobase per million
Reads per million
Sequence Read Archive
Trimmed mean of M component
Upper quartile scaling
United States Environmental Protection Agency.
Lynd LR, Van Zyl WH, McBride JE, Laser M: Consolidated bioprocessing of cellulosic biomass: an update. Curr Opin Biotechnol. 2005, 16: 577-583. 10.1016/j.copbio.2005.08.009.
Lynd LR, Weimer PJ, Van Zyl WH, Pretorius IS: Microbial cellulose utilization: Fundamentals and biotechnology. Microbiol Mol Biol Rev. 2002, 66: 506-577. 10.1128/MMBR.66.3.506-577.2002.
Demain AL, Newcomb M, Wu JH: Cellulase, clostridia, and ethanol. Microbiol Mol Biol Rev. 2005, 69: 124-154. 10.1128/MMBR.69.1.124-154.2005.
Alper H, Stephanopoulos G: Engineering for biofuels: exploiting innate microbial capacity or importing biosynthetic potential?. Nat Rev Microbiol. 2009, 7: 715-723. 10.1038/nrmicro2186.
Farrell AE, Plevin RJ, Turner BT, Jones AD, O’Hare M, Kammen DM: Ethanol can contribute to energy and environmental goals. Science. 2006, 311: 506-508. 10.1126/science.1121416.
Hahn-Hagerdal B, Galbe M, Gorwa-Grauslund MF, Liden G, Zacchi G: Bio-ethanol - the fuel of tomorrow from the residues of today. Trends Biotechnol. 2006, 24: 549-556. 10.1016/j.tibtech.2006.10.004.
Himmel ME, Ding S-Y, Johnson DK, Adney WS, Nimlos MR, Brady JW, Foust TD: Biomass recalcitrance: Engineering plants and enzymes for biofuels production. Science. 2007, 315: 804-807. 10.1126/science.1137016.
Stephanopoulos G: Challenges in engineering microbes for biofuels production. Science. 2007, 315: 801-804. 10.1126/science.1139612.
Delcher A, Bratke K, Powers E, Salzberg S: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007, 23: 673-679. 10.1093/bioinformatics/btm009.
Badger J, Olsen G: CRITICA: coding region identification tool invoking comparative analysis. Mol Bio Evol. 1999, 16: 512-524. 10.1093/oxfordjournals.molbev.a026133.
Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ: Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 2010, 11: 119-10.1186/1471-2105-11-119.
Hauser LJ, Land ML, Brown SD, Larimer F, Keller KL, Rapp-Giles BJ, Price MN, Lin M, Bruce DC, Detter JC, Tapia R, Han CS, Goodwin LA, Cheng JF, Pitluck S, Copeland A, Lucas S, Nolan M, Lapidus AL, Palumbo AV, Wall JD: The complete genome sequence and updated annotation of Desulfovibrio alaskensis G20. J Bacteriol. 2011, 193: 4268-4269. 10.1128/JB.05400-11.
Yang S, Pappas KM, Hauser LJ, Land ML, Chen GL, Hurst GB, Pan C, Kouvelis VN, Typas MA, Pelletier DA, Klingeman DM, Chang YJ, Samatova NF, Brown SD: Improved genome annotation for Zymomonas mobilis. Nat Biotechnol. 2009, 27: 893-894. 10.1038/nbt1009-893.
Raman B, Pan C, Hurst GB, Rodriguez M, McKeown CK, Lankford PK, Samatova NF, Mielenz JR: Impact of pretreated switchgrass and biomass carbohydrates on Clostridium thermocellum ATCC 27405 cellulosome composition: A quantitative proteomic analysis. PLoS One. 2009, 4: e5271-10.1371/journal.pone.0005271.
Gold ND, Martin VJJ: Global view of the Clostridium thermocellum cellulosome revealed by quantitative proteomic analysis. J Bacteriol. 2007, 189: 6787-6795. 10.1128/JB.00882-07.
Riederer A, Takasuka TE, Makino S-I, Stevenson DM, Bukhman YV, Elsen NL, Fox BG: Global gene expression patterns in Clostridium thermocellum as determined by microarray analysis of chemostat cultures on cellulose or cellobiose. Appl Environ Microbiol. 2011, 77: 1243-1253. 10.1128/AEM.02008-10.
Fontes CM, Gilbert HJ: Cellulosomes: Highly efficient nanomachines designed to deconstruct plant cell wall complex carbohydrates. Ann Rev Biochem. 2010, 79: 655-681. 10.1146/annurev-biochem-091208-085603.
Department of Energy (DOE), Office of Energy Efficiency and Renewable Energy (EERE), Office of the Biomass Program: U.S. Billion-Ton Update: Biomass Supply for a Bioenergy and Bioproducts Industry. 2011, Oak Ridge, TN: Oak Ridge National Laboratory, 227-
Leimena MM, Wels M, Bongers RS, Smid EJ, Zoetendal EG, Kleerebezem M: Comparative analysis of Lactobacillus plantarum WCFS1 transcriptomes by using DNA microarray and next-generation sequencing technologies. Appl Environ Microbiol. 2012, 78: 4141-4148. 10.1128/AEM.00470-12.
Passalacqua KD, Varadarajan A, Ondov BD, Okou DT, Zwick ME, Bergman NH: Structure and complexity of a bacterial transcriptome. J Bacteriol. 2009, 191: 3203-3211. 10.1128/JB.00122-09.
Yoder-Himes DR, Chain PS, Zhu Y, Wurtzel O, Rubin EM, Tiedje JM, Sorek R: Mapping the Burkholderia cenocepacia niche response via high-throughput sequencing. Proc Natl Acad Sci U S A. 2009, 106: 3976-3981. 10.1073/pnas.0813403106.
Jourdren L, Bernard M, Dillies MA, Le Crom S: Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses. Bioinformatics. 2012, 28: 1542-1543. 10.1093/bioinformatics/bts165.
Shendure J, Ji H: Next-generation DNA sequencing. Nat Biotechnol. 2008, 26: 1135-1145. 10.1038/nbt1486.
Oshlack A, Robinson MD, Young MD: From RNA-seq reads to differential expression results. Genome Biol. 2010, 11: 220-10.1186/gb-2010-11-12-220.
Soneson C, Delorenzi M: A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinforma. 2013, 14: 91-10.1186/1471-2105-14-91.
Mardis ER: Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008, 9: 387-402. 10.1146/annurev.genom.9.081307.164359.
Luo C, Hu G-Q, Zhu H: Genome reannotation of Escherichia coli CFT073 with new insights into virulence. BMC Genomics. 2009, 10: 552-10.1186/1471-2164-10-552.
Siezen RJ, Francke C, Renckens B, Boekhorst J, Wels M, Kleerebezem M, Van Hijum SAFT: Complete resequencing and reannotation of the Lactobacillus plantarum WCFS1 genome. J Bacteriol. 2012, 194: 195-196. 10.1128/JB.06275-11.
Wood V, Rutherford KM, Ivens A, Rajandream MA, Barrell B: A re-annotation of the Saccharomyces cerevisiae genome. Comp Funct Genomics. 2001, 2: 143-154. 10.1002/cfg.86.
Yang S, Giannone RJ, Dice L, Yang ZK, Engle NL, Tschaplinski TJ, Hettich RL, Brown SD: Elucidation of the Clostridium thermocellum ATCC27405 ethanol shock responses using an integrated transcriptomic, proteomic and metabolomic profiling approach. BMC Genomics. 2012, 13: 336-10.1186/1471-2164-13-336.
Mazumder K, York WS: Structural analysis of arabinoxylans isolated from ball-milled switchgrass biomass. Carbohydr Res. 2010, 345: 2183-2193. 10.1016/j.carres.2010.07.034.
Nataf Y, Bahari L, Kahel-Raifer H, Borovok I, Lamed R, Bayer EA, Sonenshein AL, Shoham Y: Clostridium thermocellum cellulosomal genes are regulated by extracytoplasmic polysaccharides via alternative sigma factors. Proc Natl Acad Sci U S A. 2010, 107: 18646-18651. 10.1073/pnas.1012175107.
Nataf Y, Yaron S, Stahl F, Lamed R, Bayer EA, Scheper TH, Sonenshein AL, Shoham Y: Cellodextrin and laminaribiose ABC transporters in Clostridium thermocellum. J Bacteriol. 2009, 191: 203-209. 10.1128/JB.01190-08.
Yang S, Giannone RJ, Dice L, Yang ZK, Engle NL, Tschaplinski TJ, Hettich RL, Brown SD: Clostridium thermocellum ATCC27405 transcriptomic, metabolomic and proteomic profiles after ethanol stress. BMC Genomics. 2012, 13: 336-10.1186/1471-2164-13-336.
Esvelt KM, Wang HH: Genome-scale engineering for systems and synthetic biology. Mol Syst Biol. 2013, 9: 641-
Chen Y, Indurthi DC, Jones SW, Papoutsakis ET: Small RNAs in the Genus Clostridium. mBio. 2011, 2: e00340-10-doi:10.1128/mBio.00340-10
van den Berg BH, McCarthy FM, Lamont SJ, Burgess SC: Re-annotation is an essential step in systems biology modeling of functional genomics data. Plos One. 2010, 5: e10642-10.1371/journal.pone.0010642.
Salzberg SL: Genome re-annotation: a wiki solution?. Genome Biol. 2007, 8: 102-10.1186/gb-2007-8-6-r102.
Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloë D, Le Gall C, Schaëffer B, Le Crom S, Guedj M, Jaffrézic F, French StatOmique Consortium: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2013, 14: 671-683. 10.1093/bib/bbs046.
Oshlack A, Wakefield MJ: Transcript length bias in RNA-seq data confounds systems biology. Biol Direct. 2009, 4: 14-10.1186/1745-6150-4-14.
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK: Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010, 464: 768-772. 10.1038/nature08872.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5: 621-628. 10.1038/nmeth.1226.
Bullard JH, Purdom E, Hansen KD, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinforma. 2010, 11: 94-10.1186/1471-2105-11-94.
Ebringerova A: Structural diversity and application potential of hemicelluloses. Macromol Symp. 2005, 232: 1-12. 10.1002/masy.200551401.
Izydorczyk MS, Biliaderis CG: Cereal arabinoxylans: Advances in structure and physicochemical properties. Carbohyd Polym. 1995, 28: 33-48. 10.1016/0144-8617(95)00077-1.
Klinke HB, Thomsen AB, Ahring BK: Inhibition of ethanol-producing yeast and bacteria by degradation products produced during pre-treatment of biomass. Appl Microbiol Biotechnol. 2004, 66: 10-26. 10.1007/s00253-004-1642-2.
Yee KL, Rodriguez MJ, Tschaplinski TJ, Engle NL, Martin MZ, Fu C, Wang ZY, Hamilton-Brehm SD, Mielenz JR: Evaluation of the bioconversion of genetically modified switchgrass using simultaneous saccharification and fermentation and a consolidated bioprocessing approach. Biotechnol Biofuels. 2012, 5: 81-10.1186/1754-6834-5-81.
Pu Y, Hu F, Huang F, Davison BH, Ragauskas AJ: Assessing the molecular structure basis for biomass recalcitrance during dilute acid and hydrothermal pretreatments. Biotechnol Biofuels. 2013, 6: 15-10.1186/1754-6834-6-15.
DeMartini JD, Pattathil S, Miller JS, Li H, Hahn MG, Wyman CE: Investigating plant cell wall components that affect biomass recalcitrance in poplar and switchgrass.Energy. Environ Sci. 2013, 6: 898-909.
Zhang YH, Lynd LR: Cellulose utilization by Clostridium thermocellum: bioenergetics and hydrolysis product assimilation. Proc Natl Acad Sci U S A. 2005, 102: 7321-7325. 10.1073/pnas.0408734102.
Ahsan M, Matsumoto M, Karita S, Kimura T, Sakka K, Ohmiya K: Purification and characterization of the family J catalytic domain derived from the Clostridium thermocellum endoglucanase CelJ. Biosci Biotechnol Biochem. 1997, 61: 427-431. 10.1271/bbb.61.427.
Davidson AL, Dassa E, Orelle C, Chen J: Structure, function, and evolution of bacterial ATP-binding cassette systems. Microbiol Mol Biol Rev. 2008, 72: 317-364. 10.1128/MMBR.00031-07.
Strobel HJ, Caldwell FC, Dawson KA: Carbohydrate transport by the anaerobic thermophile Clostridium thermocellum LQRI. Appl Environ Microbiol. 1995, 61: 4012-4015.
Alexander JK: Purification and specificity of cellobiose phosphorylase from Clostridium thermocellum. J Biol Chem. 1968, 243: 2899-2904.
Fischer RJ, Oehmcke S, Meyer U, Mix M, Schwarz K, Fiedler T, Bahl H: Transcription of the pst operon of Clostridium acetobutylicum is dependent on phosphate concentration and pH. J Bacteriol. 2006, 188: 5469-5478. 10.1128/JB.00491-06.
Gebhard S, Tran SL, Cook GM: The Phn system of Mycobacterium smegmatis: a second high-affinity ABC-transporter for phosphate. Microbiology. 2006, 152: 3453-3465. 10.1099/mic.0.29201-0.
Antelmann H, Scharf C, Hecker M: Phosphate starvation-inducible proteins of Bacillus subtilis: proteomics and transcriptional analysis. J Bacteriol. 2000, 182: 4478-4490. 10.1128/JB.182.16.4478-4490.2000.
Aguena M, Yagil M, Spira B: Transcriptional analysis of the pst operon of Escherichia coli. Mol Genet Genomics. 2002, 268: 518-524. 10.1007/s00438-002-0764-4.
El-Nashaar HM, Banowetz GM, Griffith SM, Casler MD, Vogel KP: Genotypic variability in mineral composition of switchgrass. Bioresour Technol. 2009, 100: 1809-1814. 10.1016/j.biortech.2008.09.058.
Diaz-Ramirez M, Boman C, Sebastian F, Royo J, Xiong SJ, Bostrom D: Ash characterization and transformation behavior of the fixed-bed combustion of novel crops: poplar, brassica, and cassava fuels. Energ Fuel. 2012, 26: 3218-3229. 10.1021/ef2018622.
Alvarez-Ortega C, Olivares J, Martinez JL: RND multidrug efflux pumps: what are they good for?. Front Microbiol. 2013, 4: 7-
Mearls EB, Izquierdo JA, Lynd LR: Formation and characterization of non-growth states in Clostridium thermocellum: spores and L-forms. BMC Microbiol. 2012, 12: 180-10.1186/1471-2180-12-180.
Steiner E, Scott J, Minton NP, Winzer K: An agr quorum sensing system regulates granulose formation and sporulation in Clostridium acetobutylicum. Appl Environ Microbiol. 2012, 78: 1113-1122. 10.1128/AEM.06376-11.
Paredes CJ, Alsaker KV, Papoutsakis ET: A comparative genomic view of clostridial sporulation and physiology. Nat Rev Microbiol. 2005, 3: 969-978. 10.1038/nrmicro1288.
Shi L, Campbell G, Jones WD, Campagne F, Wen Z, Walker SJ, Su Z, Chu TM, Goodsaid FM, Pusztai L, Shaughnessy JD, Oberthuer A, Thomas RS, Paules RS, Fielden M, Barlogie B, Chen W, Du P, Fischer M, Furlanello C, Gallas BD, Ge X, Megherbi DB, Symmans WF, Wang MD, Zhang J, Bitter H, Brors B, Bushel PR, Bylesjo M, et al: The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol. 2010, 28: 827-838. 10.1038/nbt.1665.
Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY, Luo Y, Sun YA, Willey JC, Setterquist RA, Fischer GM, Tong W, Dragan YP, Dix DJ, Frueh FW, Goodsaid FM, Herman D, Jensen RV, Johnson CD, Lobenhofer EK, Puri RK, Schrf U, Thierry-Mieg J, Wang C, Wilson M, MAQC Consortium, et al: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol. 2006, 24: 1151-1161. 10.1038/nbt1239.
Schell DJ, Farmer J, Newman M, McMillan JD: Dilute-sulfuric acid pretreatment of corn stover in pilot-scale reactor: investigation of yields, kinetics, and enzymatic digestibilities of solids. Appl Biochem Biotechnol. 2003, 105–108: 69-85.
Sannigrahi P, Ragauskas AJ: Characterization of fermentation residues from the production of bio-ethanol from lignocellulosic feedstocks. J Biobased Mater Bio. 2011, 5: 514-519. 10.1166/jbmb.2011.1170.
Kridelbaugh DM, Nelson J, Engle NL, Tschaplinski TJ, Graham DE: Nitrogen and sulfur requirements for Clostridium thermocellum and Caldicellulosiruptor bescii on cellulosic substrates in minimal nutrient media. Bioresour Technol. 2013, 130: 125-135.
Syed MH, Karpinets TV, Parang M, Leuze MR, Park BH, Hyatt D, Brown SD, Moulton S, Galloway MD, Uberbacher EC: BESC knowledgebase public portal. Bioinformatics. 2012, 28: 750-751. 10.1093/bioinformatics/bts016.
The authors gratefully acknowledge Brian Davison (ORNL) for critical review of the manuscript. The authors thank Kelsey Yee (ORNL), Janet Westpheling (University of Georgia, Athens, GA, USA), Lee Lynd (Dartmouth College, Hanover, NH, USA), and Edward Bayer (Weizmann Institute of Science, Rehovot, Israel) for helpful discussions. Sagar Utturkar (University of Tennessee, Knoxville, TN, USA) provided technical assistance with sequence data. This work was supported by the Office of Biological and Environmental Research in the DOE Office of Science through the BESC, a DOE Bioenergy Research Center. ORNL is managed by UT-Battelle, LLC, Oak Ridge, TN, USA, for the DOE under contract DE-AC05-00OR22725.
SLM, TMC, and RDW are employees of the SAS Institute and developers of JMP Genomics.
SDB, MR, JRM, and CMW designed the experiments. MR, CMJ, DMK, AJR, and CMW carried out the experiments. CMW, SLM, TMC, RDW, MR, JRM, LJH, MLL, MHS, AJR, TJT, and SDB analyzed the data. CMW and SDB wrote the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 1: Peptides BLAST output. Complete output from BLAST search of peptides against the [GenBank:CP000568.1] version of the C. thermocellum ATCC 27405 genome. The query name given in the first column includes the ORF name, the genome coordinates of the ORF (ORF start to ORF stop), the peptide ID, and the spectral counts of each mapped peptide. The subject is the [GenBank:CP000568.1] version of the C. thermocellum ATCC 27405 genome. The remaining columns are standard output from the BLAST search. (XLSX 3 MB)
Additional file 2: Peptides used to manually curate the C. thermocellum genome. A subset of Additional file 1 that includes those peptides used to update ORF start sites and check for new genes. False positives were common (see column labeled Comments) and were due to peptides hitting multiple locations in the genome. (XLSX 14 KB)
Additional file 3: Peptide support for updates to the C. thermocellum genome. Examples of where peptides were used to update the C. thermocellum ATCC 27405 genome annotation. (A) Illustration of where peptide hits were used to update the predicted start site of an ORF; (B) illustration of peptide support for the addition of a new gene; and (C) illustration of peptide support for the expression of an existing pseudogene. Within each image: 1. represents the genome coordinates; 2. RNA-seq data from one replicate of C. thermocellum grown on Populus for 12 hours; 3. existing gene coding sequence; 4. updated ORF; and 5. mapped peptides. (PDF 293 KB)
Additional file 4: Microarray probe assignment update. The methods and results from the update to the microarray probe gene assignment. (DOCX 14 KB)
Additional file 5: Table of BLAST results for the new probe assignment. Dataset of results from a BLAST search of probes (60 bp in length) from the microarray platform (GEO platform GPL15992). The best hit against the C. thermocellum ATCC 27405 genome [GenBank:CP000568.1] is given in the column Gene, with the percentage of identical nucleotides and alignment between the query and result sequence given in the ID column and Alignment column, respectively. The proportion of the alignment length or accuracy of the alignment is given in the column Proportion of alignment length: ID/100*Alignment length for those alignments greater than 36. (XLSX 2 MB)
Additional file 6: New probe assignments. Dataset containing a subset of probes from Additional file 2. These sequences were originally designed as probes targeting non-coding regions of the C. thermocellum ATCC 27405 genome. Results of BLAST search of probes (60 bp in length) from the microarray platform (GEO platform GPL15992). The best hit against the C. thermocellum ATCC 27405 genome [GenBank:CP000568.1] is given in the column Gene, with the percentage of identical nucleotides and alignment between the query and result sequence given in the ID column and Alignment column, respectively. (XLSX 71 KB)
Additional file 7: ICP-ES elemental analysis results. Table of results from the compositional analysis of the pretreated and unpretreated biomass substrates. Samples of dried biomass substrates were analyzed for elemental composition (mg/kg) by ICP-ES. (DOCX 18 KB)
Additional file 8: Fermentation products and cell counts. Fermentation products and cell counts of C. thermocellum grown in duplicate batch fermenters. Arrows correspond to time points sampled for transcriptomic analyses. Fermentation products were determined by HPLC. (PDF 176 KB)
Additional file 9: Summary of RNA-seq reads. Table summarizing the RNA-seq reads mapped to the C. thermocellum ATCC 27405 genome [GenBank:CP000568.1] using CLC Genomics Workbench version 5.5.1 (CLC bio) using the default settings for prokaryote genomes. Reads that were uniquely mapped to a single locus in the genome [GenBank:CP000568.1] were used in further analyses. (DOCX 16 KB)
Additional file 10: Correlation curves of biological replicates. Figure of the gene-wise correlation of transcriptome data of pre-normalized reads (RNA-seq) or pre-normalized intensity values (microarray) of biological replicates log2 transformed and plotted against each other; each axis corresponds to a single biological replicate for each condition. Pearson R values are given for each correlation. If values for the RNA-seq were missing, that is, no reads for a particular gene, values were estimated by the REML method in JMP Genomics 6. (PDF 42 KB)
Additional file 11: Spearman correlation of RNA-seq and array for each averaged sample. Figure showing the gene-wise correlation of transcriptome data from averaged biological duplicates of pre-normalized microarray log2 transformed intensity values and pre-normalized RNA-seq log2 transformed reads. The color intensities (scale given) indicate the level of Spearman correlation coefficients of the sets of data. (PDF 117 KB)
Additional file 12: Pre- and post-normalization distribution curves. Figure of the distribution curves of pre- and post-normalization log2 transformed intensity values or reads (x-axis displays minimum and maximum values) of each gene for the microarray and RNA-seq, respectively. (PDF 467 KB)
Additional file 13: Hierarchical clustering of gene abundance profiles. Dataset of the abundance profiles of C. thermocellum ATCC 27405 genes detected in both the microarray and RNA-seq datasets. Given are log2 transformed values of normalized data for each gene. The cluster that each gene was grouped in Figure 1 is indicated. (XLSX 808 KB)
Additional file 14: RNA-seq reads mapped to sRNA and 3383. Figure showing the RNA-seq reads from a representative of each biomass fermentation mapped to the updated C. thermocellum genome [GenBank:CP000568.1]. (A) Rfam and mBio predictions for sRNA gene structure, blue indicates high levels of gene expression. (B) High levels of expression from a newly annotated gene, Cthe_3383 (black arrow), with predicted functions as an AgrD-like signaling peptide. (C) Multiple sequence alignments of small newly predicted C. thermocellum proteins, Cthe_3383 and Cthe_3348, against C. acetobutylicum ATCC 824 and Staphylococcus aureus ArgD sequences. (D) Pairwise percent identical residue comparisons. CLC Genomics Workbench (version 6.0.1) was used to create alignments and comparisons. (PDF 232 KB)
Additional file 15: Significantly differentially expressed genes. Dataset of differential gene expression expressed as a ratio between stated conditions. Included is the FDR adjusted P value for each gene comparison, with an FDR adjusted P value <0.05 and greater than ± 1 log2 transformed ratio between the conditions indicative of altered gene regulation. (XLSX 2 MB)
Additional file 16: qPCR validation of microarray and RNA-seq expression data. Figure of the RT-qPCR confirmation of differential gene regulation when C. thermocellum ATCC 27405 was harvested at 12 hours postinoculation on the biomass substrates Populus and switchgrass. R2 values are given for the RT-qPCR correlation with both the array and RNA-seq analytical platforms. (PDF 87 KB)
About this article
Cite this article
Wilson, C.M., Rodriguez, M., Johnson, C.M. et al. Global transcriptome analysis of Clostridium thermocellum ATCC 27405 during growth on dilute acid pretreated Populus and switchgrass. Biotechnol Biofuels 6, 179 (2013). https://doi.org/10.1186/1754-6834-6-179
- Elemental composition