Correlation of structure, function and protein dynamics in GH7 cellobiohydrolases from Trichoderma atroviride, T. reesei and T. harzianum

Background The ascomycete fungus Trichoderma reesei is the predominant source of enzymes for industrial conversion of lignocellulose. Its glycoside hydrolase family 7 cellobiohydrolase (GH7 CBH) TreCel7A constitutes nearly half of the enzyme cocktail by weight and is the major workhorse in the cellulose hydrolysis process. The orthologs from Trichoderma atroviride (TatCel7A) and Trichoderma harzianum (ThaCel7A) show high sequence identity with TreCel7A, ~ 80%, and represent naturally evolved combinations of cellulose-binding tunnel-enclosing loop motifs, which have been suggested to influence intrinsic cellobiohydrolase properties, such as endo-initiation, processivity, and off-rate. Results The TatCel7A, ThaCel7A, and TreCel7A enzymes were characterized for comparison of function. The catalytic domain of TatCel7A was crystallized, and two structures were determined: without ligand and with thio-cellotriose in the active site. Initial hydrolysis of bacterial cellulose was faster with TatCel7A than either ThaCel7A or TreCel7A. In synergistic saccharification of pretreated corn stover, both TatCel7A and ThaCel7A were more efficient than TreCel7A, although TatCel7A was more sensitive to thermal inactivation. Structural analyses and molecular dynamics (MD) simulations were performed to elucidate important structure/function correlations. Moreover, reverse conservation analysis (RCA) of sequence diversity revealed divergent regions of interest located outside the cellulose-binding tunnel of Trichoderma spp. GH7 CBHs. Conclusions We hypothesize that the combination of loop motifs is the main determinant for the observed differences in Cel7A activity on cellulosic substrates. Fine-tuning of the loop flexibility appears to be an important evolutionary target in Trichoderma spp., a conclusion supported by the RCA data. Our results indicate that, for industrial use, it would be beneficial to combine loop motifs from TatCel7A with the thermostability features of TreCel7A. Furthermore, one region implicated in thermal unfolding is suggested as a primary target for protein engineering. Electronic supplementary material The online version of this article (10.1186/s13068-017-1006-7) contains supplementary material, which is available to authorized users.


Background
Cellulolytic fungi are responsible for the majority of the degradation of terrestrial plants, which in turn accounts for most of the Earth's biomass. In many of these fungi, the major secreted enzymes are glycoside hydrolase family 7 (GH7) cellobiohydrolases (CBH) [1]. GH7 CBHs are the workhorses of cellulose degradation and, thus, play a key role in the recycling of the biosphere. Lignocellulosic biomass is also by far the most abundant renewable carbon source available to humanity for transition from fossil-based to sustainable production of fuels and chemicals. As central as these enzymes are to biomass degradation, they have become the cornerstone of modern industrial enzyme formulations for biofuel processes [2]. As such, GH7 CBHs are the target of intense structural, mechanistic, and engineering studies [3][4][5][6][7][8][9].
The ascomycete fungus T. reesei is the predominant source of enzymes for industrial lignocellulosic ethanol production, largely because of the development of hyperproducing strains capable of secreting over 50 g/L of protein [2,10]. The major component, GH7 CBH from T. reesei (TreCel7A), constitutes nearly half of the total protein in the secretome [11]. TreCel7A is the most extensively studied of GH7s and serves as a model enzyme for GH7 CBHs. About one-third of the known GH7 members are bimodular, having a family 1 carbohydrate-binding module (CBM1) linked to the catalytic domain (CD) by a glycosylated, flexible peptide comprised of about 30 amino acids [12][13][14]. The first crystal structure of a GH7 catalytic domain (CD) was obtained from TreCel7A in 1994 [15], and the first structure of a fungal CBM1 was determined by NMR in 1989 [16].
Structurally, GH7 proteins share a β-jelly roll fold with two β-sheets packing face-to-face into a curved β-sandwich. Loop regions extend the edges of the β-sandwich to form a 45 Å-long groove along the entire catalytic domain. CBHs within GH7 are readily distinguished because several loops are further elongated, effectively enclosing the active site in a tunnel. This enables the CBHs to act processively along a cellulose chain and cleave off numerous cellobiose units before detachment from the substrate, which is believed to be key to their efficiency on highly crystalline cellulose [7]. Although often referred to as exoglucanases, CBHs are not true exo-enzymes, in the sense that they do not seem to be exclusively restricted to chain initiation by threading of a chain end through the tunnel. Experiments with TreCel7A and Phanerochaete chrysosporium Cel7D (Pch-Cel7D) reveal substantial ratios of endo-initiation (40-80%; [17]), suggesting that the tunnel-enclosing loops in these enzymes are sufficiently flexible to open occasionally.
Compared to other GH families, the degree of conservation of GH7 CBHs through evolution is remarkably high. In addition to fungi, GH7 encoding genes are found in very distant branches of the eukaryotic tree of life, such as Amoebozoa, Oomycetes, Dinoflagellates and Crustaceans. GH7 encoding genes have not been found in any prokaryote so far [18,19]. The sequence identity is over 40% between organisms that diverged more than 1 billion years ago, suggesting that GH7 CBHs cannot accommodate a broad sequence space for primary function [18]. Differences are primarily found in loops and surface regions distant from the active site. However, there are also small variations in the length and sequence of loop regions along the cellulose-binding path that will affect the dynamics of loop regions and the accessibility of the active site [3,20,21]. These variations may in turn influence key enzymatic properties, such as processivity, product inhibition, endoinitiation and rate of substrate dissociation [8,17].
Trichoderma species have attracted attention as alternative enzyme sources [22][23][24][25], among other reasons. For example, whereas T. reesei is a weak mycoparasite and is adapted to a saprotrophic lifestyle as a wood degrader [26], most Trichoderma spp. are described as mycoparasitic fungi and several have garnered interest as powerful biocontrol agents (BCA) against pathogenic fungi [27]. Such BCA fungi include T. harzianum and T. atroviride. These fungi have a cosmopolitan distribution and are commonly found in soil in both tropical and temperate climates. Both are considered to have a broad ecological opportunistic lifestyle, where they can live as saprotrophs of dead organic matter (plant, fungal, and animal) but also interact with plants and other fungi as mutualistic symbionts and necrotrophic mycoparasites [28]. They are widely studied for their capacity to produce antibiotics, parasitize other fungi, and control diseases caused by plant pathogenic microorganisms [29]. The cellulolytic secretomes of both T. harzianum and T. atroviride have higher β-glucosidase activity and are competitive with that of T. reesei on lignocellulose [22,24,30].
The GH7 CBHs of T. harzianum (ThaCel7A) and T. atroviride (TatCel7A) share the GH7_CD-linker-CBM1 bimodular organization and are very similar to each other and to TreCel7A, with ~ 80% pairwise sequence identities. Whereas the characterization of TatCel7A has not previously been reported, the ThaCel7A enzyme has previously been isolated and characterized, and the crystal structure has been determined [21]. MD simulations demonstrated that a single mutation (Tyr371 in TreCel7A to Ala in ThaCel7A) at the tip of one loop drastically increased the flexibility of an opposing loop across the active site and, thereby, the solvent exposure of the catalytic center [21].
In contrast to distantly related homologs, GH7 CBHs from closely related species have obtained a limited number of evolutionary-driven mutations. The limited set of differences between the enzymes can give important insights to correlate sequence differences with enzyme function and performance. The ThaCel7A and TreCel7A enzymes can be regarded as naturally occurring variants of TatCel7A, particularly in terms of the combination of tunnel-enclosing loop motifs. With this view, TatCel7A is 'intermediate' between TreCel7A and ThaCel7A, combining loop motifs present in either ThaCel7A or TreCel7A. In this study, we report the biochemical and structural characterization of TatCel7A. We further compare three cellobiohydrolases, TatCel7A, ThaCel7A, and TreCel7A, side-by-side using enzyme activity and performance assays, structural analysis, MD simulation, and reversed conservation analysis (RCA) to correlate differences in sequence with differences in function.

Preparation of Cel7A enzymes
The Cel7A enzymes were purified from culture filtrates of T. atroviride IOC 4503, T. harzianum IOC 3844, and T. reesei QM9414 grown in submerged culture under cellulase inducing conditions. Extracellular protein production appeared to be slightly lower for T. atroviride than T. harzianum, and less protein was obtained than from T. reesei. In all cases, Cel7A is the major protein (see Additional file 1: Figure S1; [11,21]). The yield of purified enzyme per liter of culture was 70 mg for TatCel7A and 85 mg for ThaCel7A, compared to typical yields of 200-700 mg/L TreCel7A from T. reesei QM9414 [31]. Partial proteolysis with papain [3] could be used to remove the CBM-linker portion from the full-length enzyme and prepare isolated catalytic domains, TatCel7A_CD, ThaCel7A_CD and TreCel7A_CD.

Temperature and pH dependence of enzyme activity and stability
The soluble chromogenic substrate p-nitrophenyl-βlactoside (pNP-Lac) was used to monitor and compare the dependence of activity and stability on temperature and pH for the catalytic domains of the Cel7 enzymes. The specific activity of TatCel7A_CD on pNP-Lac is less than half compared to ThaCel7A_CD and TreCel7A_CD. The pH profiles are rather similar with a pH optimum around pH 4.0-4.5, although ThaCel7A_CD seems to exhibit a slight shift in the alkaline direction and more pronounced drop in activity below pH 4 (Fig. 1a). All three enzymes are essentially inactive from pH 7 and upward. To assess pH dependence of irreversible inactivation, the enzymes were first incubated at different pHs (ranging from pH 3 to 9.5) for 20 h at 40 °C and then assayed at pH 4.5 for residual activity on pNP-Lac (Fig. 1b). All three enzymes retained full activity between pH 4 and pH 6 but showed a slight loss of activity at pH 3 and substantial loss from pH 7 and upward. The TreCel7A_CD seems to be slightly more stable at higher pHs than the other two enzymes. After 20 h incubation at 40 °C and pH 7, the activity dropped to 80, 72, and 52% for TreCel7A_CD, ThaCel7A_CD, and TatCel7A_CD, respectively, and at pH 8, to 43, 20, and 24%, respectively. The pronounced sensitivity of ThaCel7A to pH above neutral was observed already during initial attempts of purification. At one stage, the protein was exposed to pH 8, and the activity was lost over time. Consequently, the Error bars indicate the standard deviation of triplicate measurements purification procedures were adapted to avoid exposure of the Cel7A enzymes to pH > 6.
The temperature dependence plots show that Tat-Cel7A_CD has a lower optimum temperature, 55 °C, whereas the other enzymes exhibit highest activity at 60 °C (Fig. 2a). The observation that TatCel7A_CD is most temperature sensitive and TreCel7A_CD is most thermostable was confirmed by monitoring thermal inactivation over time (Fig. 2b-d). At 60 °C, TatCel7A_CD is inactivated after 30 min, while ThaCel7A_CD and TreCel7A_CD retain 30 and 90% activity, respectively, after 90 min. At 70 °C, all three enzymes were inactivated within minutes.

Enzyme kinetics and cellobiose inhibition
Kinetic properties on pNP-Lac and product inhibition by cellobiose was compared for TreCel7A_CD, Tat-Cel7A_CD, and ThaCel7A_CD (Table 1). In all cases, Michaelis-Menten kinetics apply, and cellobiose acts as a competitive inhibitor (see Additional file 1: Figure S2). The K M values are rather similar for all three, and k cat /K M is practically the same for TreCel7A_CD and ThaCel7A_ CD. However, the catalytic rate constant (k cat ) is significantly lower for TatCel7A_CD, and its catalytic efficiency (k cat /K M ) on pNP-Lac is only about 25% compared to the others. All three enzymes are highly sensitive to product inhibition, but ThaCel7A_CD and TatCel7A_CD exhibit Thereafter, the increase in pNP concentration was measured and divided by the incubation time (60 min) to yield the hydrolysis rates plotted on the y-axis. Error bars indicate the standard deviation of triplicate measurements somewhat weaker cellobiose binding with 2 and 3 times higher K i , respectively, than TreCel7A_CD (Table 1).

Initial cellulose hydrolysis
The initial production of cellobiose from bacterial microcrystalline cellulose (BMCC) was monitored in real time using an amperometric cellobiose dehydrogenase (CDH) enzyme biosensor [32,33]. Figure 3 shows the progress curves for full-length TatCel7A, TreCel7A, and ThaCel7A and the TatCel7A_CD and TreCel7A_CD catalytic domains.
The experimental data in Fig. 3 were fit to a processive model [33,34]. The model consists of three distinct steps: enzyme-substrate association, processive catalysis, and enzyme dissociation, governed by the rate constants k on , k cat , and k off , respectively. Further, the model contains an apparent processivity parameter, n, which represents the average number of sequential catalytic cycles. For further details, refer to Additional file 1. The kinetic parameters derived from non-linear regression are given in Table 2. The apparent processivity of all the analyzed enzymes was similar; however, TatCel7A exhibited 20% higher apparent processivity than ThaCel7A and 9% higher than TreCel7A. The catalytic domains have approximately the same processivity number as the corresponding full-length variants. The main difference in function between the full length enzymes (TreCel7A and TatCel7A) and catalytic domains (TreCel7A_CD and Tat-Cel7A_CD) manifested in k on , which was significantly lower for each of the catalytic domains. When comparing full-length enzymes, the kinetic parameters were similar, except for a significantly higher k cat for TatCel7A. Since all the enzymes were purified from the native host, a test of endoglucanase (EG) activity in the CBH samples was performed, using AZCL-HE-cellulose (Megazyme) as substrate. The estimated amount of EG was negligible and was not considered to affect the kinetic parameters.

Enzyme performance on pretreated biomass
The efficiency of the Cel7A enzymes in synergistic lignocellulose saccharification was assessed by performance assays on dilute acid-pretreated corn stover (PCS) as substrate. Full-length GH7 CBHs, together with a GH7 endoglucanase (Trichoderma longibrachiatum Cel7B/EG I) and a β-glucosidase, were incubated with PCS at pH 5 and 40 °C, and the release of soluble sugar was followed for 96 h (Fig. 4). The highest conversion was obtained with TatCel7A, closely followed by ThaCel7A, both of which appeared more efficient than TreCel7A. The conversion after 47 h was 84, 83, and 76% for TatCel7A, ThaCel7A, and TreCel7A, respectively.

Sequence comparison of Trichoderma spp. Cel7A orthologs
The protein sequences of TatCel7A, TreCel7A, and ThaCel7A all contain a signal peptide for secretion Table 1 Enzyme kinetics parameters with pNP-Lac as substrate and inhibition constants for cellobiose, at pH 4.5 and 30 °C K i is the competitive inhibition constant with 100 µM cellobiose in the reactions. The RMSD between calculated and experimental reaction rates was 3.4, 4.0, and 2.4% for TreCel7A_CD, TatCel7A_CD, and ThaCel7A_CD, respectively  pathway targeting in their N-termini, followed by a GH7 catalytic domain and a C-terminal, fungal-type cellulose-binding module (CBM1). The linker region is significantly shorter in TatCel7A (22 residues), containing mostly glycine residues, whereas TreCel7A has the longest linker (30 residues) of the three enzymes (Fig. 5). A structure-based sequence alignment (excluding the signal peptide) confirms that the sequence identity is high (~ 80%; Fig. 5). All previously described loop regions are conserved in length [20], with the exception of loop A1, where three residues at the tip of the loop are missing in ThaCel7A compared to TreCel7A and TatCel7A.
In addition, there are three more indels in the catalytic domains, represented by one residue deletion (between Gly298 and Ile299) and one residue insertion (Gly317) in TatCel7A, and one residue deletion in ThaCel7A (Ser24 in TatCel7A). Two N-glycosylation sites are conserved in all three enzymes (i.e., Asn270 and Asn384 in TatCel7A).

Crystal structures of TatCel7A_CD
The TatCel7A_CD protein was successfully crystallized, and two structures were solved, one apo structure without sugars bound (APO) and one thio-cellotrioside complex (SG3). Both structures were refined at 1.7 Å resolution in space group P21 with two protein chains, A and B, in the asymmetric unit. X-ray diffraction data and structure refinement statistics are summarized in Table 3.
The APO structure was obtained from co-crystallization of TatCel7A_CD with thio-linked cellobiose, but no cellobiose is seen in the structure. In the SG3 structure, from co-crystallization with thio-linked cellotriose, there are two cellotrioside molecules bound to each protein chain, in subsites − 6/− 5/− 4 and + 1/+ 2/+ 3, respectively (Figs. 6, 7). In both protein chains of both structures, all amino acids from 1 to 430 of TatCel7A_CD could be included in the structure model. The N-terminal glutamine residue is cyclized to pyroglutamate (PCA1), the C-terminal residue is Gly430, and all the 20 cysteines form disulfide bridges. N-glycosylation is visible at one site, with one GlcNAc residue attached to Asn270. In the APO structure, there is distinct density for a Bis-Tris molecule bound to the catalytic residues, Glu212, Asp214, and Glu271, at the catalytic center of the active site, whereas glycerol is found in a similar position in the SG3 structure.
The position and conformation of the thio-cellotrioside ligands in the SG3 structure are well defined by the electron density, as shown in Fig. 7. For the ligand in the − 6/− 5/− 4 position at the tunnel entrance, all glucose residues adopt the 4 C 1 chair conformation, but the C1 hydroxyl at the reducing end in subsite − 4 is predominantly in the α-position, with very weak density for a β-hydroxyl. Interestingly, the sugar ring at each site is flipped upside down compared to the orientation in the Michaelis complex of TreCel7A (Fig. 7a). The binding may represent a sliding intermediate during processive cellulose hydrolysis. The flipped orientation could also be a consequence of the slight difference in geometry of the thio-ether linkage compared to that of a standard O-glycosidic bond. On the other hand, another Cel7A structure (4ZZT) shows two thio-cellotrioside molecules bound in subsites − 4/− 3/− 2 and − 1/+ 1/+ 2 in the normal orientation (Fig. 8). The other thio-cellotrioside molecule in the SG3 structure, at + 1/+ 2/+ 3, is in register and aligns well with the glycan binding at the product sites of the TreCel7A Michaelis complex. The + 1 and + 2 glucosides are in 4 C 1 chair conformations, whereas the + 3 unit at the reducing end adopts a 1 S 3 skew conformation, again with α-hydroxyl predominance at the anomeric carbon. The sugar ring distortion is not induced by crystal contacts, since there are no interactions with any neighbor protein in this region.
Overall, the TatCel7A_CD structures are very similar (0.18 Å root mean square deviation, RMSD), although there is a general 'tightening' of the protein around the active site in SG3 compared to APO, which is most pronounced at the A4-loop near the product sites (Fig. 9). The fold is very similar to that of TreCel7A_CD and ThaCel7A_CD (RMSD 0.54 and 0.44 Å, respectively), as Performance assay on pretreated corn stover (PCS). Synergistic conversion of PCS (5.0 g glucan/L) to soluble sugar at 40 °C and pH 5.0 was monitored using full-length GH7 CBH enzymes (~ 2.5 µM), together with a GH7 endoglucanase and a β-glucosidase (28, 1.9, and 0.5 mg enzyme per gram glucan, respectively). In addition to the three Trichoderma CBHs, Cel7A from Scytalidium sp. (ScyCel7A; identical to the enzyme called Geotrichum candidum Cel7A in [3]) and Cel7D from Phanerochaete chrysosporium were also analyzed at the same time; the results are shown for comparison. Experiments were performed in duplicate expected from the high sequence identity (80 and 82%, respectively).
The lining of the cellulose-binding path is identical in the three enzymes except at two locations, loop A1 at the entrance to the tunnel and loop A3 near the catalytic center (Fig. 6). In loop A1, Glu101 in TreCel7A binds the 6-hydroxyl of the glucose unit in subsite − 6. This residue is replaced by a shorter sidechain, Asn101, in TatCel7A. Nevertheless, Asn101 may still bind to the cellulose chain but probably results in a weaker interaction. In ThaCel7A, a corresponding interaction is completely absent, since the tip of the A1 loop is three residues shorter (99-101 in TatCel7A and TreCel7A) [21]. The shorter A1 loop does not reach over subsite − 6, making the entrance to the tunnel more open in ThaCel7A ( Fig. 6). At the second location, loop A3, Tyr371 in TreCel7A interacts with Tyr247 at the tip of the opposing B3 loop. Tyr371 is replaced by an alanine in both Tat-Cel7A and ThaCel7A, and there are no direct interactions between loops A3 and B3 across the tunnel. Also, loop B3 has moved outwards in TatCel7A and ThaCel7A compared to the TreCel7A structure (5.5 Å distance between Tyr247 OH of TatCel7A and TreCel7A). At the next position in loop A3, Ala372 of TatCel7A and TreCel7A is replaced by a valine in ThaCel7A, which appears to influence the dynamics of the adjacent A4 loop near the product site.
The overall backbone structures of the three enzymes are very similar, apart from small deviations in loops and turns at the surface of the protein. However, there is one Fig. 5 Alignment of the GH7 TreCel7A, ThaCel7A, and TatCel7A protein sequences. Secondary structural elements of the TreCel7A structure are indicated above the alignment (β-strand arrows and α-helices) based on the 3D structure of the catalytic domain (1CEL). Strictly identical residues are marked in white letters on a black background. Regions of conserved, highly similar residues are framed in thin-lined boxes with bold letters. The figure was prepared using the ESPript web server with default parameters (http://espript.ibcp.fr; [82]). Red frames indicate loop regions of interest, with loop nomenclature underneath. The regions highlighted in blue denote N-glycosylation motifs; the colored triangles indicate the N-glycosylated asparagine residues observed in structures, where gray triangles correspond to TreCel7A, green triangles to ThaCel7A, and the blue triangle to TatCel7A. Sections of interest defined by RCA (reverse conservation analysis) are marked with blue lines and corresponding identifiers (I-IV), and residues with high S scores are marked in yellow small, but significant, difference in TatCel7A that locally affects protein folding. A one-residue insertion, Gly317, is present in the region 316-319 (section III in the sequence alignment, Figs. 5,9), where there is a β-strand-turn in TreCel7A and ThaCel7A, which supports the A4 loop at the product end of the active site. Also, a glutamine (TreCel7A) or glutamic acid (ThaCel7A) is substituted by Ser316 in TatCel7A, thus introducing a shorter side chain. This substitution, followed by the glycine insertion, disrupts β-strand interactions, and the 315-317 residues bulge outwards in TatCel7A. Furthermore, Ser316 introduces an N-glycosylation motif (at Asn314) not present in the two other enzymes (Fig. 5). This glycosylation site is close in space to the Asn270 N-glycosylation. Although glycans potentially attached to Asn314 could not be observed in the TatCel7A structures, a glycan present at this site would be in direct contact with the one at Asn270.
Overall, TatCel7A appears to have fewer secondary structure interactions compared to TreCel7A and ThaCel7A, with shorter β-strands and α-helixes at several locations (see overlay of the structures in Fig. 9). This suggestion is corroborated by a lower number of total native contacts found from the MD simulations (see below).

Molecular dynamics (MD)
We performed MD simulations to understand how structural differences in TatCel7A_CD, TreCel7A_CD, and ThaCel7A_CD manifest in protein dynamics. We also examined the effect of bound substrate on protein dynamics for each of the three cellulases, conducting    12). TatCel7A is stabilized when complexed with the cellulose microfibril, which is indicated by a decrease in overall fluctuations. The decrease in overall fluctuation is much less pronounced in the other two enzymes (Figs. 11,12). Inside the binding tunnel, at subsites − 5 to − 2, the cellononaose ligand and the chain of the microfibril fluctuate very little in either of the three cellulases; however, towards the ends of the active site, subsites − 7/− 6 and + 1/+ 2, the ligands naturally fluctuate more, as the ligands are more solvent exposed (Fig. 11d). Higher RMSF in ThaCel7A subsites − 7/− 6 correlate well with the shorter A1 loop, which makes the entrance to the tunnel wider in this enzyme. Interestingly, this difference is only seen in the simulations with the cellononaose ligand. When complexed with the cellulose microfibril, the ligand fluctuations are nearly the same for the three enzymes along the length of the active site. Throughout all of the MD simulations, either in presence of the crystalline cellulose or while complexed with the cellononaose oligomer in solution, the reducing end of the ligand (+ 2 site) in the active site remained in the β-anomeric configuration. This configuration arises as a requirement of the implemented carbohydrate force field used for all of our MD simulations, which restricts the While it is feasible that the pyranose rings could temporarily occupy an α-anomeric configuration, the energy barrier to do so is quite large.
Despite high sequence and structural similarity, MD simulations reveal distinct differences in terms of loop dynamics between TatCel7A, ThaCel7A, and TreCel7A, illustrated by histograms of distances between tunnelenclosing loops, A3-B2, A3-B3, and B2-B3, respectively,  Illustration of cellulose active site occupancies examined using MD simulation. Three simulation cases were conducted for each of the three cellulose catalytic domains, including the ligand-free state (no ligand), the cellononaose-bound state (ligand), and the cellulose Iβ microfibrilbound state (microfibril). Each simulation was conducted in the NVT ensemble at 300 K for 100 ns using explicit solvent (not shown for clarity). The catalytic domain of the protein is shown in gray cartoon. Cellononaose and the cellulose microfibril are shown in red and yellow stick over the course of simulation (Fig. 13). Most notably, the active site loops of TreCel7A appear to be significantly more stationary than either ThaCel7A or TatCel7A. One conformational state is strongly preferred, as demonstrated by small fluctuations (within 1-2 Å) of the inter-loop distances around a single peak. Both the A3/ B2 loops and the A3/B3 loops, ' A' and 'B' designating opposite sides of the active site, remain in direct contact during most of the simulation. In this configuration, the tunnel is physically closed, in the sense that a cellulose chain would only be able to enter or exit through either end of the active site and not "sideways". Occasionally, the loops do separate enough (> 6 Å) to allow a cellulose chain to pass (Fig. 13a).
The TatCel7A and ThaCel7A enzymes behave similar to TreCel7A with respect to the A3/B2 loop distances, exhibiting small fluctuations around a single peak. However, the A3/B3 and B2/B3 loop distances are much more variable than in TreCel7A, likely due to the increased flexibility of loop B3 over B2. In TatCel7A the A3/B3 loop distance was most frequently between 3.5 and 4 Å, though the distance could also hover between 5 and 7 Å, indicating that loop B3 moves smoothly between a closed and an open conformation, without any major energy barrier (Fig. 13b). The B3 loop behavior is similar regardless of the presence or absence of ligand/microfibril in the active site, which is a contrast to the B3 loop behavior in TreCel7A where the loop is most often in a closed conformation. Yet another behavior was observed in ThaCel7A, where the B3 loop exhibits a bimodal distribution in the absence of ligand; the B3 loop seemingly flips between two closed conformations. The A3/B3 distance is very short for the primary conformation. With a bound ligand or microfibril, there is no evidence of the short-distance state, and loop B3 appears to fluctuate over a larger range of more open conformations.
To monitor and compare the unfolding process of the Cel7A proteins, MD simulations were also conducted at TreCel7A was reported previously and is shown here for comparison [3] elevated temperature (475 K) for 15 ns, which was a sufficient length to observe initial unfolding events. From these simulations, we determined the total number of native contacts formed within the protein as a function of time and compared to the number of native contacts formed at 300 K (Fig. 14). As expected, the total number of native contacts formed in each cellulase was roughly constant at 300 K, i.e., not unfolding. TatCel7A exhibits a lower number of native contacts than either TreCel7A or ThaCel7A (Fig. 14). When the temperature was elevated to 475 K, the number of contacts decreased at about the same rate for all three proteins as they unfolded, suggesting that they are equally sensitive to thermal unfolding.
Additional movie files show the initial unfolding of the protein structure during the 475 K simulations (see Additional files 2, 3 and, 4 for TatCel7A, TreCel7A, and ThaCel7A, respectively). One region, residues 380-410 containing loops A4 and A2, was among the first parts of the protein to unfold, suggesting that this region may be an important 'hotspot' for initiation of protein unfolding.
In loop A2, there is a short, surface exposed α-helix that maintained the helical structure longer in ThaCel7A than in either TreCel7A or TatCel7A, suggesting higher regional thermal stability around loop A2 in ThaCel7A. Another interesting observation is that the B3 loop in TreCel7A and ThaCel7A (not TatCel7A) transiently flipped ~ 180° and adopted a conformation where the tip of the loop pointed towards subsite + 2, similar to the conformation observed in the structure of Cel7A from Humicola grisea var. thermoidea (PDB code 4CSI; [5]).

Molecular evolution of Cel7A
When comparing closely related orthologs, amino acid residues that modulate functional properties of an enzyme are expected to display higher diversity than other positions due to adaptation [35]. Therefore, distribution of amino acid variation was analyzed using RCA [35] of GH7 CBH sequences from two groups of related fungi within the order Hypocreales: Trichoderma spp. (11 sequences) versus Fusarium spp. and Clonostachys rosea Fig. 12 RMSF comparison of the protein backbone at 300 K. The backbone is colored by a gradient from blue to white to red, representing lower to higher fluctuations. Red regions indicate larger fluctuations (RMSF > 4 Å). The cellooligosaccharide chain is colored by atom; oxygen is red, and carbon is cyan (6 sequences) (Additional file 1: Figure S8). The orthologous status of the selected sequences was confirmed by a phylogenetic analysis (Additional file 1: Figure S9). We specifically analyzed the alignment for regions displaying signs of type 1 functional divergence (i.e., site conserved in one lineage but variable in the other [36]) in Trichoderma spp. Four sections were identified in the Cel7A alignment-I, II, III and IV ( Fig. 12; (Additional file 1: Table S3)-that fulfilled the criteria (W mean score ≥ 1 in Trichoderma spp. and W mean score ≤ 1 in Fusarium spp.). All sites in sections I, II and III are located at the surface of the protein (Fig. 9). Section I is at a β-turn at one edge of the outer β-sheet, near the attachment of the linker. Section II comprises a short β-strand followed by a turn before loop B3 in the sequence and is located at the interface where the B2 and B3 loops are anchored. Section III includes the Gly317 insertion near the product-binding region mentioned above. Section IV includes an amphiphilic α-helix, where one side is buried; interestingly, three of the residue positions that display high amino acid diversity [S score ≥ 1, Met352, Ala355, Leu356 in TatCel7A, (Additional file 1: Table S3)] point into the hydrophobic core, just underneath the β-strand that carries the catalytic residues. In comparison with the Cel7A alignment of C. rosea and five Fusarium species, these sections (I, II, III, and IV) display signs of type 1 functional divergence, with the position conserved in Fusarium spp. but variable in Trichoderma spp. (Fig. 15) [36]. Figure 15 shows the W mean scores from RCA for Trichoderma spp. and Fusarium spp., plotted against residue number (in TatCel7A) (Fig. 11). The A4 loop is located at the exit of the tunnel and may affect release of the product, having contacts with section III (Fig. 9). Also noteworthy, loop A4 carries an N-glycosylation site near the product sites that is conserved in all three enzymes. Glycosylation at this site has been observed in structures of TreCel7A (Asn384) and ThaCel7A (Asn380) but was not visible in the TatCel7A structures (Asn384).

Discussion
Based on observations from initial hydrolysis of BMCC, synergistic conversion of pretreated biomass, and MD simulation, we suggest that the combination of the A1 and A3 loop motifs is the primary determinant for the observed differences in activity on cellulosic substrates. Initial hydrolysis of BMCC was most rapid with Tat-Cel7A followed by TreCel7A and then ThaCel7A, suggesting that the longer A1 loop, together with the weaker Fig. 13 Active site loop distances from molecular simulation. Histograms of the minimum distances between a loops A3 and B2, b loops A3 and B3, and c loops B2 and B3, evaluated from 100-ns simulation trajectories of TreCel7A, ThaCel7A, and TatCel7A A3-B3 loop interaction (Tyr-Ala rather than Tyr-Tyr at the tip of loop A3), may be superior under these experimental conditions. Polikarpov et al. showed that deletion of three residues at the tip of the A1 loop in ThaCel7A makes the entrance to the tunnel more open, and the replacement of one Tyr with Ala at the tip of loop A3 (relative to TreCel7A) increases the flexibility of the opposing B3 loop [21]. Our MD simulations of TreCel7A and ThaCel7A, conducted here for comparison to Tat-Cel7A, are in good agreement with those results. As with ThaCel7A, the B3 loop of TatCel7A is more flexible and opens more frequently than in TreCel7A (Fig. 13). ThaCel7A and TatCel7A share A3 loop features, whereas the A1 loop is similar in TatCel7A and TreCel7A. Thus, the most likely major determinant of the observed functional differences in initial crystalline cellulose hydrolysis is the longer A1 loop present in TatCel7A and TreCel7A.
Although TatCel7A showed higher k cat in initial hydrolysis of BMCC, the processive model kinetic parameter fit to the progress curves did not reveal clear differences that could be readily correlated with protein structure and dynamics. The apparent processivity values  are in the same range, but somewhat higher than found by alternative methods (66)(67)(68)(69)(70) [17,37,38]. Also, both k on and k off values are higher in our case. However, the , and ThaCel7A at 300 and 475 K. The total number of native contacts was determined as an average of three independent MD simulations at two temperatures, 300 K (solid lines) and 475 K (dashed lines). The high temperature simulations were performed for 15 ns, whereas the triplicate 300 K simulations were conducted for 50 ns; only 15 ns of the 300 K trajectories are shown here for comparison. In each case, the simulation was conducted without a ligand, in explicit solvent. To determine the total number of native contacts of each trajectory, the number of native contacts formed by each residue was first evaluated. Here, a native contact was defined as any amino acid whose side chain center of geometry was within 6.5 Å of the reference amino residue's Cα. The total number of native contacts is then the sum of the native contacts formed by all residues in the protein  Fig. 5). The scale for W mean scores from RCA analysis is on the right side of the graph. Sections of interest defined by RCA (reverse conservation analysis) are marked with blue lines and corresponding identifiers (I-IV), and residues with high S scores are marked in yellow studies cited above employed other cellulose substrate preparations (e.g., reduced bacterial cellulose and Avicel) and may not be directly comparable. In our case, the k off value is comparable or lower for TreCel7A_CD compared to the full-length enzyme. This is in contrast to Kont et al. who reported the opposite [38]. However, in a study by Cruys-Bagger et al. [32,33], similar k off values were found for TreCel7A and TreCel7A_CD. In that study, the authors suggested that the main energy barrier for dissociation is the release of the cellulose strand from the catalytic domain. In our current study, we find a slightly lower k off value for the examined enzymes without a linker and CBM. We cannot explain this observation, but suspect that it may be related to the low DP of the substrate. With an estimated DP of around 120 glucose units and an apparent processivity of 70-90, the enzyme would, in most cases, 'fall off ' when the entire cellulose chain has been hydrolyzed, rather than dissociating from the chain along the process.
When comparing the performance of the Cel7s in synergistic conversion of pretreated biomass to soluble sugar, both TatCel7A and ThaCel7A gave higher yields of soluble sugar than TreCel7A, indicating that weaker A3-B3 interaction and, hence, higher B3 loop flexibility is beneficial to conversion. The length of the A1 loop may be of less importance, although TatCel7A, with the longer A1 loop, appears to be slightly more efficient than ThaCel7A. Though our data implicates loops A1 and A3 in variable efficiency, we cannot rule out that other differences between the three enzymes may also influence their performance, such as linker length, glycosylation, or other residue substitutions that may affect the protein dynamics.
The B3 loop is anchored by disulfide bridges at both ends (Cys243 and Cys256 in TatCel7A) and is almost identical in sequence in the three enzymes, except for Asp-Asn conservative replacements at positions 249 and 250. In TatCel7A (and similarly in ThaCel7A), Asn249 hydrogen bonds to the nearby Asp241, which would stabilize the loop and reduce fluctuations. Asn249 is replaced by Asp249 in TreCel7A, which, in some structures, forms a short distance, low-barrier hydrogen bond with Asp241 called an acid pair [39]. Such acid pair interactions are pH dependent and more distance restrained [40], which may contribute further to the restriction of B3 loop mobility in TreCel7A. Interestingly, the variable RCA defined section II before loop B3 includes the hydrogen-bonding partner, Asp241, and the nearby Ser239. The latter is replaced by Glu239 in TreCel7A, which is stabilized in turn by metal ion coordination together with His206. This may indicate that fine-tuning of B3 loop flexibility represents an important evolutionary target in Trichoderma spp. Cel7 proteins.
We were surprised to find that TatCel7A exhibits significantly lower activity against pNP-Lac, while it was about the same for TreCel7A and ThaCel7A. The enzyme kinetics results show that this is mainly due to a significantly lower k cat (Table 1). Also, the K M value is slightly higher, giving a catalytic efficiency (k cat /K M ) for TatCel7A of only about 25% compared to TreCel7A and ThaCel7A. No obvious clues are evident from structural comparison, though, as to why that is the case. The three structures are practically identical at the subsites (− 2/− 1/+ 1) where pNP-Lac should bind for hydrolysis. However, pNP-Lac is an artificial chromogenic model substrate and may be a poor representative of function in Nature. Interestingly, a similar discrepancy in pNP-Lac activity has been reported previously for two close GH7 CBH orthologs from Amoebozoa [18]. Cel7A from Dictyostelium discoideum exhibited lower thermal stability and about half of the specific activity against pNP-Lac compared to D. purpureum Cel7A, despite 80% sequence identity.
The three enzymes showed similar pH dependence, with activity optimum around pH 4.5 and sensitivity to inactivation above neutral pH. This indicates that all the three species, T. atroviride, T. reesei, and T. harzianum, are adapted to biomass degradation at rather acidic conditions, without strong evolutionary pressure on their Cel7A enzymes towards action at higher pHs.
TatCel7A appears to be more temperature sensitive than either TreCel7A or ThaCel7A, with a slightly lower temperature optimum and more rapid irreversible inactivation at elevated temperature. This is likely a function of fewer secondary structure interactions in TatCel7A relative to TreCel7A and ThaCel7A, as observed by structural comparison and a lower number of native contacts found in the MD simulations. In particular, the A2-A4 region that appears to be a hotspot for initiation of unfolding seems to unfold faster in TatCel7A in the high-temperature MD simulations (see Additional file 2: Movie S1). Notably, though, the two regions on the backside of the protein where TatCel7A deviates structurally, i.e., near the linker attachment (13)(14)(15)(16)(17)(28)(29)(30) and around the Gly317 insertion (420-422), did not show any clear signs of unfolding more readily. Overall, the backside of the proteins remained remarkably stable throughout the high-temperature simulations, in contrast to large mobility of the extended loops along the active site.
The higher yield of soluble sugar obtained for TatCel7A vs. TreCel7A in the experiments on pretreated biomass suggests that this enzyme may be useful for industrial conversion of biomass. The lower temperature stability could be addressed by engineering a more stable variant inspired by TreCel7A or any other more thermostable GH7 CBH [5,9]. The improvement of thermal stability of TreCel7A by directed evolution has recently been reported, where the most stable variant contains 18 mutations and exhibited a 10.4 °C increase in protein melting temperature [41]. Based on that study and the results herein, we propose that the primary region to target would be the A2-A4 region in order to stabilize the α-helix of the A2 loop while taking into account product -enzyme interactions at the exit of the tunnel. It should be noted, though, that irreversible inactivation depends not only on protein unfolding, but also on the exposure and aggregation of hydrophobic regions of the protein, which is difficult to predict.

Conclusions
We have determined the three-dimensional structure and analyzed the properties of TatCel7A, the major secreted protein from T. atroviride, and compared these results to the close orthologs: ThaCel7A and TreCel7A. All three proteins are very similar in sequence, structure, and several other aspects, yet, subtle differences are manifested in terms of stability, activity, and protein dynamics. Such differences, for example, in initial hydrolysis rates of BMCC and synergistic conversion of pretreated biomass, may lead to significant effects in the large-scale process applied for biomass conversion.

Preparation of Trichoderma Cel7 enzymes
The fungal strains T. atroviride IOC 4503 and T. harzianum IOC 3844 were obtained from the Culture Collection of Filamentous Fungi at the Oswaldo Cruz Institute (CCFF/IOC) in Brazil. They were grown on potato dextrose agar plates at 25 °C until dense sporulation developed (about 1 week) to produce fresh spores for culture inoculation. Submerged cultivation in distiller's spent grain medium [42] with 1% w/v Avicel cellulose as a carbon source was undertaken for 6 days at 30 °C in a rotary incubator at 80 rpm; the cultivation took place in 2.8 L side-baffled Fernbach flasks (Bellco Glass Inc., Vineland, NJ, USA), each with 0.6 L medium containing: 6 g dry distillers spent grain, 9 g KH 2 PO 4 , 3 g (NH 4 ) 2 PO 4 , 0.36 g MgSO 4 , and 0.36 g CaSO 4 . The pH was measured daily. On day 2, the pH dropped to around pH 3.5-3.8 for both fungi and was adjusted to pH 5 by addition of 2 g K 2 HPO 4 to each flask. Upon harvest, the cultures were filtrated on Whatman GF/B glass fiber filters (~ 1 µm pore size) followed by 0.45 and 0.2 µm sterile filtration.
The culture filtrate was desalted on Bio-Gel P-6DG (BioRad; 500 mL column) to 10 mM potassium phosphate buffer, pH 6.0, then applied to a DEAE Sepharose Fast Flow column (GE Healthcare; CV = 200 mL) and eluted with a gradient up to 0.5 M NaCl in the same buffer. Fractions containing pNP-Lac activity were pooled, desalted, and applied to a SOURCE 30Q column (GE Healthcare; CV = 25 mL) eluted with a 10-500 mM potassium phosphate, pH 6.0, gradient. Fractions with activity against pNP-Lac were collected and subjected to SDS-PAGE analysis to estimate the purity of the Cel7 protein. The yield of purified enzyme per liter of culture was 70 mg for TatCel7A and 85 mg for ThaCel7A.
TatCel7A_CD used for crystallization was prepared from the T. atroviride strain IMI 206040, kindly donated by Dr. Alexander Golubev (Petersburg Nuclear Physics Institute, Gatchina, Russia). Cultivation, protein purification, domain cleavage with papain and enzymatic N-deglycosylation were performed as previously described [3]. The solved crystal structure confirms that the protein sequence is identical to that of TatCel7A from T. atroviride strain IOC 4503, at least in the catalytic domain.
For all TreCel7A experiments except the PCS hydrolysis experiments, TreCel7A was obtained from T. reesei strain QM9414 as described [31,43]. For the PCS hydrolysis experiments, TreCel7A was recombinantly produced in the T. reesei AST1116 constitutive expression system and purified to homogeneity as detailed in [44].
For preparation of the Cel7 catalytic domains, the CBM-linker portion of the full-length enzymes were removed by partial proteolysis using papain as previously described [3], followed by size-exclusion chromatography on a HiLoad Superdex 75 16

Temperature and pH dependence, enzyme kinetics and cellobiose inhibition
Hydrolytic activity measurements were carried out in triplicate in 96-well microtiter plates using pNP-Lac as substrate. Reaction mixtures of 150 µL contained 50 mM buffer (pH 3-7, phosphate-citrate; pH 7-8, potassium phosphate; pH 8-9, sodium borate), 2 mM of pNP-Lac and 0.15 µM of the enzyme (TreCel7A_CD, ThaCel7A_ CD, or TatCel7A_CD). The reaction was quenched by adding 150 µL of 0.5 M sodium carbonate, followed by measurement of absorbance at 405 nm using an Eon Multiplate Reader. The rate of pNP release was calculated using an extinction coefficient of 18.3 mM −1 cm −1 .
The pH dependence of hydrolytic activity was determined in the range of pH 3.0-8.0. The reactions were incubated at 30 °C for 30 min. In pH stability experiments, the enzymes were pre-incubated at 40 °C at pHs from pH 3.0-9.5 for 20 h, followed by pNP-Lac activity measurement at 30 °C and pH 4.5 using a 30-min incubation.
For temperature dependence of activity, the reactions were incubated in the temperature range of 20-75 °C for 1 h at pH 4.5. The reaction components were pre-cooled and mixed on ice, then transferred into the thermostat equilibrated at the desired temperature. For assessment of thermal inactivation, the enzymes were pre-incubated at 60, 65, and 70 °C at pH 4.5. Aliquots were taken at indicated time points up to 90 min and cooled on ice, followed by determination of residual hydrolytic activity against pNP-Lac at 30 °C, pH 4.5, and 1 h incubation time.
Experiments for determination of enzyme kinetics parameters V max and K M for pNP-Lac as substrate and inhibition constants K i for cellobiose were done in 96-well microtiter plates as described above. Reaction mixtures containing TreCel7A_CD, ThaCel7A_CD, or TatCel7A_CD (0.12, 0.22, 0.12 µM, respectively), 50 mM sodium phosphate citrate buffer, pH 4.5, and pNP-Lac at 0.1, 0.2, 0.4, 0.67, 1.2, 2, 3, 4, 5, and 6.7 mM concentration, without and with 100 µM cellobiose, were incubated for 1 h at 30 °C. Nonlinear regression fitting was accomplished using the Excel Solver add-in (Microsoft, Richmond, WA, USA). Weighted squared residuals were calculated for each data point using a statistical weighting scheme, [(v obs − v calc ) 2 /v calc ], where v obs is the observed reaction rate, and v calc is the rate calculated from kinetic parameters (V max , K M , K i ). The kinetic parameters were fit towards the minimized sum of residuals using the GRG nonlinear solving method within Solver. Mixed inhibition was first evaluated. In all cases, the uncompetitive K i was more than an order of magnitude higher than the competitive K i , indicating that cellobiose acted as a competitive inhibitor. Therefore, the final values shown in Table 1 were derived by fitting the data to the Michaelis-Menten expression for competitive inhibition (see Additional file 1: Figure S2). The RMSD between v calc and v obs was used as indicator of experimental error (3.4, 4.0 and 2.4% for TreCel7A_CD, TatCel7A_CD, and ThaCel7A_CD, respectively).

Initial hydrolysis of cellulose
The initial hydrolysis of cellulose was measured using Biosensor equipment at Roskilde University, Denmark. Bacterial microcrystalline cellulose (BMCC) from Acetobacter xylinum was prepared from bacterial cellulose (BC) extracted from commercially available Nata de Coco as described [45]. The degree of polymerization of such BMCC has been determined at 114 glucose units [45]. Hydrolysis of BMCC was monitored by cellobiose product formation. The concentration of cellobiose was measured in real time with cellobiose dehydrogenasemodified carbon paste electrodes as described in detail by Cruys-Bagger et al. [32,46]. The sensor had a response time and lower detection limit of 4 s and 60 nM, respectively. All reactions were carried out in 50 mM sodium acetate pH 5.0 at 25 °C with stirring. The reaction mixture contained 3.3 g/L of BMCC and 50 nM enzyme (TreCel7A, TreCel7A_CD, ThaCel7A, TatCel7A, and TatCel7A_CD). The experimental data (time interval 0-200 s) was fit to the processive model shown in Additional file 1: Figure S4A. The model consists of three rateconstants, k on , k cat , and k off , and an apparent processivity parameter, n. For further detail, see Additional file 1.

Pretreated corn stover (PCS) hydrolysis
Corn stover was harvested in 2009 in Hurley County, SD, USA, and was knife milled to pass a 19 mm (0.75 in) round screen and stored indoors in 200 kg lots at NREL (National Renewable Energy Laboratory, Golden, CO, USA). The compositional analysis of the native corn stover is given by Chen et al. [47]. Dilute acid pretreated corn stover (PCS) was prepared and analyzed by NREL standard laboratory analytical procedures [48], with PCS composed of 64.2% dry weight glucan. The PCS substrate was suspended in 20 mM sodium acetate buffer at pH 5.0. Digestions were conducted at 40 °C in high-performance liquid chromatography (HPLC) vials placed in a rotator at 10 rpm up to 96 h. An amount of PCS substrate equivalent to 8.5 mg of glucan was added to the enzymatic cocktail consisting of each of the GH7 CBHs, endoglucanase I from T. longibrachiatum (Megazyme Co., Bray, Ireland), and β-glucosidase from Aspergillus niger (Megazyme Co., Bray, Ireland) at a concentration of 28, 1.9, and 0.5 mg protein/g of glucan, respectively. The ratio and dosage of enzymes used here represent one of the standard conditions developed and used at NREL to assay the performance of Cel7 enzymes in NREL PCS conversion [49,50]. Adjustment of the biomass assay aliquots to 1.7 mL final volume resulted in a cellulose concentration of 5.0 mg/mL and a GH7 CBH concentration of 0.14 mg/mL, corresponding to 2.5 µM for TreCel7A. Sugar analyses were performed by HPLC as reported in [44]. Experiments were performed in duplicate.

X-ray crystallography
Crystallization experiments were carried out with the deglycosylated catalytic domain TatCel7A_CD. Screening for crystallization conditions was performed in 96-well sitting drop trays using a Mosquito crystallization robot (TTP Labtech, UK). The most promising crystallization hits were obtained at room temperature with Hampton polyethylene glycol (PEG)/Ion screen. The final optimized conditions contained 5 mM NiCl 2 , 0.1 M HEPES pH 7.0, and 20% w/v PEG 3350 as a precipitant. Crystals used for data collection were grown by sitting drop vapor diffusion under the same conditions after 1:1 mixing of precipitant with 4.8 mg/mL TatCel7A_CD in 20 mM Bis-Tris buffer, pH 7.0. Cellobiose was added to the crystallization drops for the APO structure but is not seen in the structure. The SG3 structure complex was obtained from co-crystallization drops with 5 mM 4,4′-dithio-cellotriose.
X-ray diffraction data were collected at 100 K at the synchrotron beamline ID23-1, ESRF, Grenoble, France, as indicated in Table 3. The data were integrated with XDS [51] and scaled using the programs Scala and Aimless in the CCP4 suite [52]. The initial TatCel7A_CD structure model was solved by molecular replacement using PHASER [53] and a structure of TreCel7A_CD as the search model (PDB code 1CEL).
REFMAC5 [54] was used for structure model refinements, and manual model rebuilding was performed with Coot [55,56] using maximum likelihood sigma-averageweighted 2F o -F c electron density maps [56]. For crossvalidation by R and R free calculations, 5% of the data were excluded from the structure refinement [57]. Solvent molecules were automatically added using the automatic water picking function in the ARP/wARP package [58]. Picked water molecules were selected or discarded manually by visual inspection of 2F o -F c and F o -F c electron density maps. The coordinates for the two final Tat-Cel7A_CD structure models and the structure factors have been deposited in the Protein Data Bank (http:// wwpdb.org/) with accession codes 5O5D and 5O59.

Molecular dynamics simulations
For the catalytic domain of each enzyme (TatCel7A, TreCel7A, and ThaCel7A), three ligand-bound states were modeled: without a ligand (no ligand), bound to cellononaose (ligand), and bound to a cellulose Iβ microfibril (microfibril) (Fig. 10). The cellulase structures used for MD simulations were obtained from crystal structures deposited in the Protein Data Bank: PDB ID 4C4C for TreCel7A [59], 2YOK for ThaCel7A [21], and 5O5D for TatCel7A. The three simulations of TreCel7A at 300 K have been previously reported [3] and are presented here again for direct comparison to ThaCel7A and TatCel7A dynamics. Additionally, we carried out a set of MD simulations at an elevated temperature, 475 K, considering each cellulase in the ligand-free "Apo" state in solution, to examine the unfolding process of the enzymes and to locate regions vulnerable to increased temperature (hotspots).
To build the TreCel7A apo simulation, the cellononaose ligand was removed from the active site of the catalytic domain. For the cellononaose-bound state, the cellononaose ligand from 4C4C was retained from the crystal structure (4C4C), occupying the active site from − 7 to + 2 sites (Fig. 10). The TreCel7A microfibril complex was constructed by docking the cellononaose-bound catalytic domain on the hydrophobic face of the cellulose 1β crystal matrix, where a single chain had been decrystallized as previously described [3]. In each TreCel7A case, the mutated Gln217 was reverted to the wild-type glutamic acid. Additional details of the modeling procedure for the TreCel7A simulations can be found in our previous work [3]. The ThaCel7A and TatCel7A ligand-free simulation sets were constructed from the apo crystal structures. The cellononaose-bound ThaCel7A and TatCel7A models were constructed by aligning the protein backbone with TreCel7A (4C4C) and adopting the coordinates of the 4C4C cellononaose; structural alignment was performed using PyMOL [60]. The ThaCel7A and TatCel7A microfibril complexes were constructed as described for TreCel7A above and previously [3].
In each model, only the catalytic domains of cellulases were simulated, excluding the glycosylated linker and the carbohydrate-binding module. Additionally, the glycans attached to the catalytic domains were omitted from the models, as they have relatively limited effects on the protein dynamics over MD-simulation time scales [61]. pKa calculations, using the H++ webserver, and visual inspection were used to determine the protonation states of the titratable residues at pH 5.0 with internal and external dielectrics of 10 and 80, respectively [62][63][64]. Disulfide bonds were defined according the PDB structures. CHARMM was used to construct and explicitly solvate the systems with the water molecules (80 Å × 80 Å × 80 Å for no ligand and ligand systems; 135 Å × 100 Å × 90 Å for the microfibril complexes) [65]. Na + ions were added to ensure the charge neutrality of the system, avoiding the self-energy artifact [66,67].
Minimization and equilibration simulations were conducted in CHARMM using the CHARMM36 force field to define the protein and carbohydrate behavior and the modified TIP3P force field for water [68][69][70][71][72][73]. Minimization of each system was conducted in three steps: (1) keeping the protein, the ligand (if present), and the microfibril (if present) fixed and allowing the water molecules to move freely, then (2) keeping only the protein fixed, allowing the remainder of the system to move freely, and (3) allowing every atom in the system to move freely without any restraint. Each of the three minimization steps used 1000 steps of steepest decent (SD) minimization. Following minimization, the systems were heated from 100 to 300 K in the NVE ensemble for 20 ps using 50 K temperature increments every 4 ps. The systems were then density equilibrated in the NPT ensemble at 300 K for 100 ps. Data collection simulations of 100 ns were conducted using NAMD in the NVT ensemble at 300 K with a time step of 2 fs [65,74]. Evaluation of the RMSD of the protein backbones, compared to their positions following density equilibration, indicates 100 ns is sufficient to reach a local equilibrium (Additional file 1: Figure S10). Long-range electrostatic calculations used a non-bonded cutoff distance of 10 Å, a switching distance of 9 Å, and a non-bonded pair list distance of 12 Å. The SHAKE algorithm was used to fix the hydrogen distances during all simulations. For microfibril complexes, during heating, density equilibration and production simulation, the bottom layer of the cellulose crystal was harmonically restrained with a force constant of 1 kcal/mol/Å 2 to prevent twisting of the microfibril, which occurs when the degree of polymerization is low.
To initiate the high temperature simulations, we first conducted three independent 50-ns MD simulations of each apo enzyme (9 total simulations) at 300 K in the NVT ensemble using NAMD. The high temperature simulations were started from 10 ns, 300 K equilibrated snapshots of each enzyme. High-temperature simulations were conducted in NAMD at 475 K for 15 ns each; all other simulation parameters were as described above. Again, three independent simulations of each enzyme were performed to obtain statistically meaningful structural insight. VMD was used to visualize the trajectories of the high temperature simulations and define the thermally unstable regions of the enzymes. The native contact analysis described above was conducted in CHARMM using the COORdinate DMAT (distance matrix) command.

Phylogenetic analysis
GH7 protein sequences were retrieved by pBLAST search with the TreCel7A full-length sequence (UniProtKB-P62694) in NCBI and individual species genome databases. Available sequences of both CBHs and EGs from Trichoderma spp., Fusarium spp. and C. rosea were selected, and one sequence from Acremonium strictum was included as an outgroup, resulting in a set of 28 GH7 orthologs. The amino acid sequences were aligned by ClustalW using MEGA7 software [75], and regions flanking the GH7 domain were trimmed off (signal peptide, before Gln 1 of TatCel7A; linker-CBM, after Thr429 of TatCel7A). The evolutionary history was inferred using the minimum evolution method [76] and bootstrap phylogeny testing with 2000 replicates. The evolutionary distances were computed using the Dayhoff matrix based method [77] and are in the units of the number of amino acid substitutions per site. The minimum evolution tree was searched using the close-neighbor-interchange (CNI) algorithm [78] at a search level of 1. The neighbor-joining algorithm [79] was used to generate the initial tree. All positions containing gaps and missing data were eliminated. There were a total of 349 positions in the final dataset.

Reverse conservation analysis (RCA)
A subset of 17 GH7 CBH protein sequences, including 11 sequences from Trichoderma spp. and six sequences from Fusarium spp. and C. rosea, was selected. The GH7 CBH catalytic domains were realigned by ClustalW using MEGA7 software [75], followed by indel elimination. This alignment was analyzed by RCA as described earlier [35]. In short, Rate4Site (Version 2.01) was used to calculate the degree of conservation (S score) for each amino acid position using the empirical Bayesian method [80,81]. A sliding window-average (n = 7) S score was plotted (W mean score) and significant peaks were defined by intensity (I) values of 1 [35].

Additional files
Additional file 1: Figure S1. SDS-PAGE analyses of T. atroviride culture filtrate and purified Trichoderma spp. Cel7A enzymes. Figure S2. Substrate dependence plots and Hanes-Wolff plots from enzyme kinetics experiments with TatCel7A, ThaCel7A and TreCel7A, using pNP-Lac as substrate and cellobiose as inhibitor. Additional information regarding the mathematical model for quasi-steady state kinetics of processive cellulose hydrolysis by GH7 cellobiohydrolases and the derivation of kinetic parameters by non-linear regression fitting to real-time progress curves of the initial stage of cellulose hydrolysis. Figure S3. A) Real-time progress curves. B) Derivative of the progress curves in A). Figure S4. A) Simplified reaction scheme for a processive cellulase. B) Illustration of the molecular steps involved in the reaction scheme. Figure S5. Non-linear regression fit to real-time progress curves. Figure S6. Bar diagram of kinetic parameters derived from initial hydrolysis of BMCC. Additional information regarding correlation of kinetic parameters derived by non-linear regression fit to initial hydrolysis data. Table S1. Parameter correlation matrix for TreCel7A. Figure S7. Kinetic parameter fit to simulated data with 2.5% random noise added, and to experimental data recorded for TreCel7A during initial hydrolysis of BMCC. Table S2. Comparison of kinetic parameters from the fit to simulated data with 2.5% random noise, and to experimental data recorded for TreCel7A during initial hydrolysis of BMCC. Figure S8. Sequence alignment of the GH7 CBH catalytic domains used for RCA analysis. Figure S9. Phylogenetic tree of GH7 catalytic domain protein sequences from Trichoderma spp. and Fusarium spp. Table S3. S scores from RCA analysis for residues of interest for TatCel7A, ThaCel7A and TreCel7A. Additional MD simulation results Figure S10. RMSD as a function of time for each 100-ns, ligand-bound MD simulation of TatCel7A, ThaCel7A and TreCel7A catalytic domains.
Additional file 2: Movie S1. Movie of TatCel7A_CD initial protein unfolding during 15-ns MD simulations at high temperature (475 K). The movie shows three individual MD runs side-by-side for the same protein, in two views. The top row shows the "front" of the enzyme, and the bottom row shows the "backside".
Additional file 3: Movie S2. Movie of TreCel7A_CD initial protein unfolding during 15-ns MD simulations at high temperature (475 K). The movie shows three individual MD runs side-by-side for the same protein, in two views. The top row shows the "front" of the enzyme, and the bottom row shows the "backside".
Additional file 4: Movie S3. Movie of ThaCel7A_CD initial protein unfolding during 15-ns MD simulations at high temperature (475 K). The movie shows three individual MD runs side-by-side for the same protein, in two views. The top row shows the "front" of the enzyme, and the bottom row shows the "backside".