Trichoderma reesei complete genome sequence, repeat-induced point mutation and partitioning of CAZyme gene clusters

Trichoderma reesei (Ascomycota, Pezizomycotina) QM6a is a model fungus for a broad spectrum of physiological phenomena, including plant cell wall degradation, industrial production of enzymes, light responses, conidiation, sexual development, polyketide biosynthesis and plant-fungal interactions. The genomes of QM6a and its high-enzyme producing mutants have been sequenced by second-generation-sequencing methods and are publicly available from the Joint Genome Institute (JGI). While these genome sequences have offered useful information for genomic and transcriptomic studies, their limitations and especially their short read lengths make them poorly suited for some particular biological problems, including assembly, genome-wide determination of chromosome architecture and genetic modification or engineering. We integrated Pacific Biosciences and Illumina sequencing platforms for the highest-quality genome assembly yet achieved, revealing seven telomere-to-telomere chromosomes (34,922,528 bp; 10877 genes) with 1630 newly-predicted genes and >1.5 Mb of new sequences. Most new sequences are located on AT-rich blocks, including 7 centromeres, 14 subtelomeres and 2329 interspersed AT-rich blocks. The seven QM6a centromeres separately consist of 24 conserved repeats and 37 putative centromere-encoded genes. These findings open up a new perspective for future centromere and chromosome architecture studies. Next, we demonstrate that sexual crossing readily induced cytosine-to-thymine point mutations on both tandem and unlinked duplicated sequences. We also show by bioinformatic analysis that Trichoderma reesei has evolved a robust repeat-induced point mutation (RIP) system to accumulate AT-rich sequences, with longer AT-rich blocks having more RIP mutations. The widespread distribution of AT-rich blocks correlates genome-wide partitions with gene clusters, explaining why clustering of genes has been reported to not influence gene expression in Trichoderma reesei. Compartmentation of ancestral gene clusters by AT-rich blocks might promote flexibilities that are evolutionarily advantageous in this fungus’ soil habitats and other natural environments. Our analyses, together with the complete genome sequence, provide a better blueprint for biotechnological and industrial applications.


Background
Trichoderma is a fungal genus in soils and many other natural environments.
Trichoderma reesei (syn. Hypocrea jecorina) is a widely used model organism for plant cell wall degradation and industrial enzyme production. The natural strain QM6a (ATCC13631) was first isolated from the Solomon Islands during the Second World War [1]. High enzyme producers (e.g., QM9414 and RUT-C30) were artificially generated from QM6a for industrial applications over the last 70 years [2][3][4][5][6].
The CBS999.97 wild-isolate strain is also an excellent fungal model for light responses since light variation greatly affects its sexual development and conidiation [7,9,10]. Constant light promotes conidiation and completely inhibits stromata formation, whereas total darkness causes a slowdown of the growth of stromata [10]. 5 The genomes of several Trichoderma species have been sequenced and are publicly available from the Joint Genome Institute of the US Department of Energy.
The QM6a-v2.0 draft genome (33.4 Mb) contains 87 scaffolds and 9129 predicted genes [11]. The RUT-C30-v1.0 draft genome (32.7 Mb) has 182 scaffolds and 9852 predicted genes [12]. The genomes of Trichoderma atroviride and Trichoderma virens were thought to be larger than QM6a-v2.0, with sizes of 36.1 and 38.8 Mb, respectively, versus 34.1 MB for QM6a, both encoding more than 2,000 additional predicted genes [4,13]. These genomes have been used to identify key genes involved in some important biological processes [4,[13][14][15], e.g., the transcriptional factors and transporters that control induction and expression of carbohydrate-active enzymes (CAZymes) and the plant cell wall degradation enzymes. Using QM6a as a reference, it has also been reported that QM9414 and RUT-C30 might carry multiple alternations, including rearrangements, point mutations, insertions and deletions [12,16]. Recently, a group of Trichoderma researchers have collectively annotated and compared ~30% genes in the JGI genomes of Trichoderma reesei, Trichoderma atroviride and Trichoderma virens [4].
To reveal gene order and dynamic gene expression at the chromosome level, a genome-wide chromosome conformation capture method (referred herein as "HiC") had been applied to close the gaps between the QM6a-v2.0 scaffolds. The HiC draft genome revealed seven super-scaffolds and four short contigs [17]. Druzhinina et al. then applied the QM6a-HiC draft to annotate 9151 (not 9194) predicted genes. A third of the putative CAZyme genes occurred in loose clusters that also contained a high number of genes encoding small secreted cysteine-rich proteins (SSCPs). Five CAZyme gene clusters are located close to chromosomal ends. These subtelomeric areas are also enriched in genes involved in conidiation, iron scavenging and 6 interactions with other fungi, such as secreted protease genes, amino acid transporter genes, gene clusters for polyketide synthases (PKS), non-ribosomal peptide synthase (NRPS) and PKS-NRPS fusion proteins [18]. The QM6a-HiC annotation (http://trichocode.com/index.php/t-reesei) [18] became publicly available in January, 2017. Because it does not provide any expectation values (E) for the BLAST sequence alignments, its reliability needs to be further confirmed.
Strictly speaking, the HiC draft genome is far from equivalent to a complete genome sequence. We noticed that both the QM6a-v2.0 and QM6a-HiC drafts lack several evolutionarily-conserved genes that are ubiquitiously expressed in almost all studied eukaryotic organisms, including the recA family protein Rad51 and the DNA repair protein Rad50. Therefore, we applied both second and third generation sequencing (SGS and TGS) platforms to resequence the QM6a genome. The Single Molecule Real Time (SMRT) sequencing method developed by Pacific BioSciences (PacBio) offers much longer reads of up to 60 kb [19]. After error correction with short and high-quality Illumina-MiSeq sequencing reads, the PacBio long reads were assembled into seven telomere-to-telomere chromosomes. Our high-quality genome sequence provides a large quantity of new information to facilitate functional and comparative studies of this industrially important workhorse fungus.

Resequencing the QM6a genome
The QM6a genome was resequenced using seven SMRT cells on the PacBio RSII platform. Following extraction of reads with Trichoderma reesei-only sequences, we recovered the longest raw reads (≥16 kb) with up to 80x coverage, totaling 3,397,762,180 bp. The hierarchical genome assembly process program [20] was used 7 to generate a preliminary PacBio draft with seven superscaffolds, a short unitig and 1.8 kb of contaminating DNA (Additional File 1: Table A1). This short unitig was completely identical in nucleotide sequence to the QM9414 mitochondrial genome (42,139bp; NC_003388.1) [21], indicating that the error rate for the preliminary PacBio draft was extremely low (<0.0024%).
For error correction, the Ilumina MiSeq 300 bp paired-end reads (6.8 Gb) were collected and trimmed. Reads (67.85%) with a quality score threshold (Q) greater than 30 were retained (Additional File 1: Table A2  The seven superscaffolds closely match (if not being completely identical to) the full-length chromosomes because all of their termini capture typical telomeric sequences (i.e., TTAGGG at 3′-termini and the reverse complement CCCTAA at 5′termini) [11] with up to 14 repeats (Additional File 1: Table A3). We categorized these telomere-to-telomere chromosomes with Roman numerals (ChI-ChVII), from largest to smallest. Genetically-defined linkage groups were designated alphabetically (A-G), and linkage group arms are designated L for the short (left) arm at 5′ termini and R for the long (right) arm at 3′ termini ( Figure 1). The complete QM6a genome sequences have been submitted to NCBI (accession number CP016232-CP016238).

The complete QM6a genome compared to two earlier draft genomes
By mapping the MiSeq reads to the QM6a-v2.0 draft, we found that ten scaffolds in 8 the QM6a-v2.0 draft were not genuine Trichoderma reesei sequences (Additional File   1: Tables A5). For this reason, 16 previously annotated genes (Additional File 1: Tables A6) in QM6a-v2.0 and QM6a-HiC might not be authentic QM6a genes. In addition, there were numerous sequencing and assembly errors in QM6a-v2.0. The most prominent assembly error was the first and longest scaffold, wherein the 5′ portion (~2.46 Mb) was mapped to ChV and the 3′ portion (~1.79 Mb) to ChIV (Additional File 1: Table A7).
Our complete QM6a genome sequence also covers all seven superscaffolds and the four short contigs of QM6a-HiC [17]. QM6a-HiC contains a large quantity of sequence and/or assembly errors, particularly those sequences (e.g., telomeric repeats) close to both termini of the seven superscaffolds. The four short HiC contigs are all located at chromosome regions with low guanine(G)-cytosine(C) contents. HiC0 is located close to tel6L (the left telomere of ChVI), HiC1 and HiC2 at cen3 (the centromere of ChIII), and HiC3 in an interspersed AT-rich block on the right arm of ChVI. There were at least 18 inversion errors in the HiC draft genome (Figure 1 and Additional File 1: Table A8). Eight of these inversion errors could account for the failure to connect these four short contigs to the corresponding superscaffolds during the HiC experiments [17].
The HiC experiments also resulted in incorrect assembly at the rDNA locus.
Using Southern hybridization, we confirmed that the right arm of ChVI harbors the large rDNA locus with nine tandem "head-to-tail" repeats. Each repeat contains an 18S-5.8S-26S rRNA gene cluster and a non-transcribed intergenic spacer (IGS) ( Figure 2 and Additional File 1: Table A9). This result approaches the theoretical limit for mapping results using Illumina Miseq short reads, i.e., 200-260x coverage at the rDNA locus versus 25-30x coverage along the entire chromosome ( Figure 3). It is 9 worth noting that there are 175-200 copies of the large rDNA tandem repeats in Neurospora crassa [22].
Thus, the complete QM6a genome sequence has uncovered many sequencing and assembly errors in QM6a-v2.0 and QM6a-HiC. We suggest that caution should be exercised in applying HiC or the chromosome conformation capture method for gapclosing draft genome sequences produced by SGS technology and for other genomic analyses involved in AT-rich and repetitive sequences.

QM6a compared to high enzyme producers
The high-quality QM6a genome sequence we provide here is a better scaffold to order and orient contigs of other Trichoderma draft genomes previously generated by SGS technologies. It has been reported that there might be five or eleven potential translocations in RUT-C30 [12,16]. A BLAST search revealed that there are only three promising translocations in RUT-C30: ChV to ChII (the first scaffold of RUT-C30), ChIII to ChI (the second scaffold of RUT-C30), and ChIII to ChIV (the fifth scaffold of RUT-C30). Theoretically, these three translocations in RUT-C30, together with the one translocation we identified in CBS999.97(MAT1-1, F/X), are sufficient to account for the large quantity of inviable SAN ascospores generated from sexually crossing RUT-C30 with CBS999.97(MAT1-1, F/X) [8,23]. There are also three short ectopic insertions in RUT-C30 including the second scaffold (34 bp), the fifth scaffold (231 bp) and the ninth scaffold (2,655 bp) (Additional File 1: Table A10). These three ectopic insertions are more likely due to sequence duplication or assembly errors. 10 We applied four different approaches to genome reannotation (Materials and Methods), including the use of all (28748) Trichoderma reesei proteins from National Center for Biotechnology Information (NCBI), QM6a-v2.0, RUT-C30-v1.0 and two publicly-available transcriptome datasets [24,25]. We annotated 1630 newlypredicted QM6a genes, including 70 tRNA genes and 23 5S-rDNA genes (Additional newly-predicted genes encode protein products that have homologs in Saccharomyces cerevisiae and Neurospora crassa, respectively. It is worth noting that we annotated several essential or biologically important genes, including six DNA repair genes (rad50, sae2/com1, rad51, rad57, srs2, rrm3 and pif1), an essential component of sister chromatin cohesion complex (smc1), a key autophagy gene (atg11), two cell division cycle genes (cdc4 and cdc15) and two mitochondrial genes (sod2 and tom7) (Additional File 1: Table A11 and Additional File 2: Tables B2 and B3). These evolutionarily conserved genes had never been annotated in QM6a-v2.0 or QM6a-HiC [4,11,18]. The complete genome sequence and set of genes we provide here can serve as a better guide for further experiments, especially for global approaches to evolution, biological functions and industrial applications.

Genome reannotation
We categorized all QM6a genes with a non-italicized uppercase letter, a number and a letter: Tr (for Trichoderma reesei); A, B to G (for chromosome I, II through VII); a number corresponding to the order of the transcripts (counting from the left telomere); and W or C to designate the Watson or Crick strand (the Watson strand is 5′→3′ left telomere to right telomere); for example, the 100th gene from the left 11 telomere of chromosome I is TrA0100C (Additional File 2).
Many straw-induced genes in QM6a [24] were shown to be differentially regulated in response to cellulose and sophorose in QM9414 [25] (Additional File 3: Table C1 and C2).
Among the 1535 new protein-encoding genes, there are 39 straw-induced (≥ 20-fold) genes in QM6a and 2 cellulose-induced genes (≥ 20-fold) in QM9414 (Additional File 3: Table C3). One hundred and forty four new QM6a genes are also significantly upregulated (5-20 fold) by straw (Additional File 3: Table C4). For example, TrC1345C, a new straw-induced gene, encodes the homolog of S. cerevisiae Cat8 zinc-cluster transcription factor. Cat8 is involved in gluconeogenesis, the glyoxylate cycle and ethanol utilization, and it is necessary for derepression of a variety of genes under non-fermentative growth conditions (e.g., diauxic shift and sporulation) [29]. Further research on this transcriptional factor might help to enhance cellulolytic enzyme production. We also identified nine new genes with TPM values in QM6a that are much higher than those in QM9414 (Additional File 3: Table C5).
These genes are likely mutated or even deleted in QM9414.

Wide distribution of AT-rich blocks along seven QM6a chromosomes
To establish the link between DNA sequence and chromatin architecture, the local GC contents along each chromosome were calculated using a 0.5 kb sliding window. We identified 2349 AT-rich chromosome blocks with GC contents ≥ 12% and ≥ 6% lower than the average GC content of the all predicted genes (56.5%) and the entire QM6a genome (51.1%), respectively ( Table 2).
The most prominent or longest AT-rich blocks in each chromosome are the centromeres, ranging from 162.5 kb (cen2) to 208.5 kb (cen6) (Figure 4 and Additional File 3: Table A2). The longest AT-rich blocks in each Neurospora crassa chromosome are also centromeres [30]. A BLAST search revealed that the seven QM6a centromeres collectively harbored 24 conserved sequences (≥90% identity; maximum length 8625 bp and minimum length 4847 bp) with a copy number per chromosome ranging from one to five (Additional File 4). These conserved sequences are centromere-specific and highly AT-rich, perhaps representing centromeric repeats.
The seven QM6a centromeres also separately encode 13 previously-annotated genes and 24 newly-predicted genes (Additional File 3: Table C6). Centromereencoded transcripts are known to be integral components of the genomes of mammals, higher plants and the fission yeast Schizosaccharomyces pombe [31-34]. Copy 13 numbers of these 37 centromere-encoded transcripts were relatively low; their expression (TPM > 0) could only be detected in QM9414, but not in QM6a (Additional File 3: Table C6). The QM9414 transcriptomic data from the Illumina HiSeq 2000 platform [25] apparently had better sequencing depth than the QM6a transcriptomic data from the SOLiD platform [24]. It will be of importance to investigate whether these 24 centromeric repeats and 37 centromere-encoded genes are involved in centromere integrity and chromosome segregation fidelity in Trichoderma reesei.
The AT-rich chromosomal blocks next to the 14 telomeres are subtelomeres, with the shortest being ~1 kb and the longest up to 87 kb (Additional File 1: Table   A3). Other than centromeres and subtelomeres, there are 2328 interspersed AT-rich blocks ( Table 2). On average, there are only five genes between two neighboring interspersed AT-rich blocks. The biological relevance of these interspersed AT-rich blocks remains to be elucidated (see below). It is worth noting that both the 5′ and 3′ flanking sequences of the rDNA locus contain a 2 kb AT-rich block ( Figure 3). We postulate that these two AT-rich blocks might be involved in regulating nucleolar organization, rRNA transcription, rDNA copy homeostasis and prevention of repeatinduced point mutation (RIP) (see below).

Comparative genomic analyses of chromosome architectures
Next, we compared the QM6a genome with 11 publicly-available and well-assembled fungal genomes for their AT-rich blocks. We highlight that all these fungal genomes contain ≤ 0.11% unresolved or unknown bases (Ns) (Additional File 1: Table A12).
Unresolved bases jeopardize the integrity of in silico genomic analysis.
All of these 12 fungal genomes have many short AT-rich blocks with lengths of 14 0.5-3 kb ( Table 2). In Saccharomyces cerevisiae, these short AT-rich blocks correlate with the convergent intergenic regions and the pericentromeres that associate with cohesin (an evolutionarily-conserved protein complex that functions to hold a pair of sister chromatids during mitosis and meiosis). The average distance between neighboring cohesin-binding sites along yeast chromosome arms is 10-15 kb, which is TADs is somewhat correlated [41][42][43][44][45]. Intriguingly, the average distance between neighboring AT-rich blocks along QM6a chromosomal arms is ~13 kb. We suggest that these AT-rich blocks might be functionally associated with chromosomal loading of cohesin in Trichoderma reesei.
We were able to categorize these 12 fungal genomes into three different groups according to: (1) the average AT content, (2) the genome-wide distribution of local Group II are three basidiomycetes, i.e., Ustilago maydis (521), Coprinopsis cinerea (Okayama 7#130) and Cryptococcus neoformans (JEC21). Their average AT contents are ~50%, similar to those of Group I fungal genomes ( Table 2). The Group II fungal genomes display relatively normal local AT content distribution ( Figure 5A, the lowest panel) due to the absence of longer AT-rich blocks ( Figure 5B).
Group III are three hemiascomycete yeasts, Saccharomyces cerevisiae (S288C), Schizosaccharomyces pombe (972h-) and Candida glabrata (CBS138). Their average AT contents (62-64%) are 12-14% higher than those of Group I and Group II fungi (Table 2). These three yeast genomes also display a relatively normal distribution of AT local content ( Figure 5A, the lowest panel) and have no or very few longer ATrich blocks ( Figure 5B).

Repeat-Induced Point mutation (RIP) is identified in Trichoderma reesei
The high number of long AT-rich blocks in the Group I fungal genomes might be correlated with RIP: a phenomena originally discovered in Neurospora crassa at a premeiotic stage during sexual development [46]. The process of RIP requires a specialized cytosine methyltransferase gene rid-1 (RIP-defective) and induces cytosine-to-thymine (C-to-T) point mutations in a homology-dependent manner [ 16 [55] and Aspergillus nidulans [56], but it has never been experimentally demonstrated in any Trichoderma species [18,57]. Intriguingly, the QM6a complete genome encodes almost all proteins known to be involved in RIP and DNA methylation in hpo (TrB1406W; HP1) (Additional File 2: Tables B1 and B2).
We carried out both bioinformatic and molecular genetic analyses to determine whether RIP could operate in Trichoderma reesei. The RIPCAL software tool was applied to compare differences in the extent of RIP mutations of different sequences by determining two widely-used RIP indices: TpA/ApT and [(CpA + TpG) / (ApC + GpT)] [58,59]. Higher values of TpA/ApT and lower values of [(CpA + TpG) / (ApC + GpT)] indicate stronger RIP responses [60,61]. We found that the hierarchy for RIP in QM6a is the mating type gene MAT1-2-1 < all predicted genes < the whole QM6a genome < the large rDNA tandem repeats (18S-5.8S-26S) < 5S rDNAs. It has been reported that the large rDNA tandem repeats and the 5S rDNAs in Neurospora crassa survived RIPs due to either nucleolar sequestration [62] or their smaller size [60], respectively. The RIP indices of the mating gene (MAT1-2-1 or MFa), all predicted genes and the whole genome suggest that the RIP respones in QM6a are as robust as those in Neurospora crassa (Additional File 1: Table A13). Thus, we suggest that Trichoderma reesei has evolved an RIP system similar to that of Neurospora crassa.
Our results also reveal that the longer the AT-rich blocks the greater the extent of RIP mutations in QM6a (Additional File 1: Table A13), consistent with RIP mutations accumulating in AT-rich sequences.
Next, we carried out sexual crossing tests between four parental (F0) strains, 17 including wild-type CBS999.97 [7], blr1Δ, env1Δ [9,10] and ku70Δ [63], and then tested their F1 progeny for mutations in the full-length hygromycin-resistant (hph) genes present in the selection marker construct used for deletion of the corresponding Trichoderma reesei genes [10]. Progeny were isolated using the hexadecad dissection technique [8]. No hph sequence is present in the wild-type CBS999.97 F0 strains, whereas env1Δ and ku70Δ each carries one copy of full-length hph. In contrast, the blr1Δ mutant contains two tandem head-to-tail hph sequences resulting in repeats; one is a full-length hph and the other an N-terminal truncated hph-ΔN ( Figure 6A). All of the hph and hph-ΔN alleles in the corresponding parental strains were confirmed first by genomic PCR and Sanger sequencing (see Additional File 5 for their nucleotide sequences) and then by Southern hybridization ( Figure 6B).
After sexual crossing (n = 10), the F1 progeny displayed numerous C-to-T point mutations in all cases where two similar sequences were present in one mating partner before crossing, i.e., in progeny of blr1Δ (hph and hph-ΔN), but not in progeny of ku70Δ or env1Δ that comprise only one copy of full-length hph in their genome ( Figure 6C and Additional File 1: Table A14).
To determine whether RIP could operate between two genetically unlinked hph alleles, we generated a ku70Δenv1Δ double mutant that had one deletion on ChII (env1Δ) and the other on ChIII (ku70Δ). Our data revealed that sexual crossing (n = 10) between ku70Δenv1Δ and a wild-type mating partner also resulted in C-to-T point mutations in progeny of these crosses ( Figure 6C and Additional File 5).
In Neurospora crassa, sequences mutated by RIP showed skewed dinucleotide frequencies because of the sequence preference of RIP (CpA > CpT > CpG > CpC) [64]. By comparing the nucleotide sequences of the hph alleles in all F1 progeny (Appendix A4), we found that Trichoderma reesei displayed a different sequence 18 preference of RIP, i.e., CpA ≈ CpG >> CpT > CpC (Additional File 1: Table A14).
We conclude that Trichoderma reesei, like Neurospora crassa, exhibits high homology pairing and RIP activities at a premeiotic stage before premeiotic DNA synthesis. Our results are thus of tremendous importance for industrial strain improvement.

Repetitive features
The completeness of the QM6a and other 11 high-quality fungal genomes also allowed us to accurately survey genome-wide repetitive features and their correlation to RIP using the RepeatMasker search program (http://www.repeatmasker.org/). We were able to identify almost all Ty elements along the 16 chromosomes of Saccharomyces cerevisiae [65,66] (Additional File 1: Table A15). Our results also confirm that the genome of Neurospora crassa accumulates fragmented and Schizosaccharomyces pombe [67], Cryptococcus neoformans [68] and Coprinopsis cinerea [69]. The Candida glabrata CBS138 genome contains very few (~0.4%) repetitive sequences (http://www.candidagenome.org/) (Additional File 1: Table A16).
The copy numbers of transposon sequences in QM6a we report here (Additional File 1: Table A16) are much lower than those reported by Kubicek  in all of that 16 yeast's chromosomes had been determined before [65,66].

Partitioning of gene clusters by AT-rich blocks
A hallmark of the QM6a-v2.0 draft genome is that a third of the 228 CAZyme genes are non-randomly distributed and form several CAZyme gene clusters. Several of the regions of high CAZyme gene density also contain genes encoding proteins involved in secondary metabolism. Accordingly, it has been proposed that gene clustering and/or coexpression might be evolutionarily advantageous for Trichoderma reesei in its competitive soil habitat or other natural environments [11,70]. Using the QM6a-HiC draft genome as a reference, 20 CAZyme gene clusters and 42 SSCP gene clusters were identified in the QM6a-HiC genome draft [18]. All these gene clusters consisted of only 3-6 CAZyme and/or SSCP genes but, surprisingly, gene clustering did not influence gene expression. 20 To gain a better insight, we reexamined all these 62 CAZyme and SSCP gene clusters in QM6a-HiC [18]. Due to sequencing and assembly errors, some gene clusters are overlapped or even duplicated. The original 62 gene clusters locate to 46 chromosomal blocks in the complete QM6a genome. The majority of these gene clusters also contain new QM6a genes we annotate in this study (Additional File 2: Table B4). One gene cluster even contains a counterfeit gene (QM6a-v2.0 gene number 71245) [18].  Table B4). The first example of this phenomenon is a nitrate assimilation gene cluster and an annotated CAZyme gene cluster [18] immediately adjacent to tel2R. These two gene clusters were divided by tel2R and five interspersed AT-rich blocks into five smaller compartments ( Figure   7A). The first (or rightmost) compartment consists of one hypothetical protein gene (TrB1976W) and three nitrate reductase genes (nit3, TrB1975W; nit2, TrB1974C;  Figure 7A). These two genes were highly induced by straw and then repressed by the addition of glucose (Additional File 1: Table A19A).
The second example is a CAZyme gene cluster on ChIV with 22 genes. It is divided by eight AT-rich blocks into seven smaller compartments (C1-C7). The four CAZyme genes are located in different compartments; a GH31 α -glucosidase (TrD1393W) in C1, cel3c (TrD1398C) in C3, and a GH28 polygalacturonase (TrD1411C) and pgx1 (TrD1412C) in C6 ( Figure 7B). These four CAZyme genes were differentially regulated by straw and the addition of glucose (Additional File 1: Table A19B).
Together, these results might explain why it was previously reported that gene clustering did not influence gene expression [18]. We propose that these smaller compartments are structurally and functionally similar to intrachromosomal loops or TADs.
It should be noted that occurrence of intergenic AT-rich blocks did not always result in differential gene expression. For example, a gene cluster with three contiguous CAZyme genes-the acetyl xylan estrease gene axe1 (TrE0669C; acetyl 22 xylan estrease), the cellulose-induced protein cip1 (TrE0670C) and the β -1,6-Nacetylglucosaminyl transferase gene egl4 (TrE0671C)-resides in the middle of ChV.
These three CAZyme genes are separated by four AT-rich blocks into three smaller compartments, but they were all highly induced by straw and then repressed by the addition of glucose (Additional File 1: Table A19D). In this case, each gene becomes a partitioned functional unit (see Discussion). Simultaneous expression of the axe1-cip1-egl4 triad might be independently controlled by other determinants, e.g., common transcription factors and/or similar chromosomal conformation.

Discussion
Trichoderma reesei QM6a and its derivatives have been widely used for nearly four decades to produce plant cell wall-degrading enzymes and heterologous recombinant proteins. In this study, we have obtained a high-quality complete genome sequence of QM6a. We readily uncovered many sequencing, assembly and gap-closing errors in earlier draft genomes. The seven telomere-to-telomere QM6a chromosomes can be used as better scaffolds for comparative genomic analyses, not only with industrial strains but also other Trichoderma reesei wild isolates (e.g., CBS999.97) and other species in the same genus (e.g., Trichoderma atroviride and Trichoderma virens). Our results also revealed much new genomic information never provided by earlier draft genomes, including 7 centromeres, 14 telomeres, 2328 AT-rich blocks, 1630 newlypredicted genes, 37 centromere-encoded genes and 24 centromeric repeats. Therefore, our complete QM6a genome sequence provides a comprehensive roadmap for further studies of this economically-important fungus, including industrial strain improvements and elucidation of the functional relationships between sequences, gene products and genome organization. 23 The central finding of this study is that Trichoderma reesei has evolved a robust RIP system. Firstly, the QM6a genome contains the lowest overall copy number of transposons among six studied filamentous ascomycetes (including Neurospora crassa). Secondly, as in Neurospora crassa, sexual crossing readily induced C-to-T point mutations on both tandem and unlinked duplicated sequences in Trichoderma reesei. Thirdly, almost all 2349 AT-rich blocks in QM6a were predicted by the RIPCAL software program to be affected by RIP. Considerable evidence suggests that AT-rich blocks establish a link between DNA sequence and chromatin architecture. In Intriguingly, our results reveal that the rDNA locus of QM6a is surrounded by two interspersed AT-rich blocks. It would be of interest to further investigate whether and how these two interspersed AT-rich blocks are involved in preventing rDNA from being affected by RIP as well as in regulating nucleolar organization, rRNA transcription and rDNA copy homeostasis.
From the results of this study, we postulate that RIP does not function solely as a genome defense mechanism to diminish the potentially deleterious effects caused by the spread of transposable elements. It may also have important roles in reshaping the Trichoderma reesei genome. We demonstrate that the widespread interspersed ATrich blocks lead to genome-wide partitioning of the gene clusters in QM6a. Our 24 findings can readily account for why gene clustering does not affect gene expression in Trichoderma reesei. Mechanistically, RIP-mediated C-to-T mutations presumably can transform duplicated sequences in a CAZyme gene cluster into interspersed ATrich blocks, thus dividing an ancestral gene cluster with multiple CAZyme or SSCP genes into multiple smaller compartments or TADs. Intriguingly, it has been reported previously that many pathogenic fungi (Leptosphaeria maculans, Magnaporthe oryzae, Fusarium spp.) comprise AT-rich blocks with RIP affected effector genes and transposable elements [72,73] and that RIP is a potential factor in Leptosphaeria maculans in creating the rapid sequence diversification (i.e., of the effector genes) needed for selection pressure [74] and concerted epigenetic regulation of their expression [75]. Partitioning of gene clusters by AT-rich blocks may also help to control simultaneous expression of the rDNA locus ( Figure 3) and of the three partitioned functional units in the axe1-cip1-egl4 triad (Additional file 1: Table A19D).

Further research will reveal how RIP provides evolutionary advantages to
Trichoderma reesei and other filamentous ascomycetes (Pezizomycotina) to survive in natural environments and pathogenic conditions.

Conclusion
The earlier genome drafts of QM6a and other T. reesei industrial strains have been useful in identifying key genes involved in several important biological processes.
However, due to numerous sequencing and assembly errors, they are not suitable for several other studies, e.g., genome-wide determination of gene order, chromosome architecture and expression dynamics as well as chromosome engineering for genetic modification(s). The complete QM6a genome sequence provides an unprecedented opportunity to overcome those obstacles associated with earlier draft genomes. To 25 avoid the limitations of working with incomplete datasets and the false leads that can come from trying to work with imperfect data, more caution should be exercised in utilizing the draft genome sequences solely determined by SGS and/or HiC for functional and comparative analyses.

Mapping of Illumina reads over three different genomic drafts
The two runs of Illumina MiSeq with 300-bp paired-end reads were combined into one. The forward and reverse data were 6.8 Gb. Reads were preprocessed with Trimmomatic V0.32 (http://www.usadellab.org/cms/?page=trimmomatic) to trim and remove reads (33.15%) that fell below a quality score threshold of 30 (Q30) and were shorter than 30bp.   Table S4. There were some differences between the PacBio and Illumina platforms, so the mapping results were used to extract a consensus sequence to adjust the bases to a final version of the chromosome sequences. We defined a threshold=2 to identify low coverage regions. For low coverage regions (i.e. threshold ≥ 2), sequences from the Illumina Miseq platform were used to construct the consensus sequence.
The seven telomere-to-telomere chromosomes were categorized with Roman numerals (ChI-ChVII), from the largest to the smallest. Artemis was used to determine the percentage of G+C in 500 bp non-overlapping windows. The centromere of each chromosome was their longest interspersed AT-rich block. The EMBOSS revseq software tool (v6.5.7; http://www.bioinformatics.nl/cgibin/emboss/revseq) was used to reverse and complement the nucleotide sequences of ChII, ChIV and ChVI, respectively, so that all seven chromosomes had a shorter (left) 29 arm at 5′ termini and a longer (right) arm at 3′ termini. The final sequences were first compared to the original version using BLAST searches (Additional File 1: Table A1) before being submitted to NCBI (accession numbers CP016232-CP016238).
reesei protein sequences from NCBI were used for ab initio gene predictions.
Finally, the BLASTP program was applied to compare our final predicted protein sequences to several publicly-available protein databases, including NCBI non- Tables B2).
Next, the bowtie-tophat program was used to align all QM6a (SOLiD) and replaced by Ns). To obtain high-confidence data, we first analyzed the genome sequences of Saccharomyces cerevisiae because the number and location of five different Ty elements in all its 16 yeast chromosomes had been reported previously [65,66]. When the preliminary RepeatMasker data were filtered with two parameters (length >= 140, Smith-Waterman local similarity scores >= 450), the final data (Additional File 1: Table A15) were quite consistent with the known results [65]. The same parameters were then applied to scan all other studied fungal genomes (Additional file 1: Table A16).

RIP and RIPCAL
The latest RICAL program (https://sourceforge.net/projects/ripcal/) was downloaded 34 [58,59]. An initial analysis was done to predict if the complete QM6a genome sequence had been mutated by RIP. All strains used in this study had been previously described [10]. The methods for sexual crossing, single ascospore isolation and preparation of genomic DNA have also been described before [8,10]. The full-length hgh cassettes in all F0 parental strains and representative F1 progeny were amplified by PCR and analyzed by Sanger sequencing technology. All the nucleotide sequences of primers and the hgh cassettes are listed in Additional file 1: Table A9 Table A1. Preliminary assembly results obtained by the Hierarchical Genome Assembly Process (HGAP 3.0). Table A2. Error corrections of the seven PacBio unitigs using the Illimina-MiSeq reads. Table A3. Characteristics and assembly of the seven QM6a chromosomes.   Table A11. Gene Ontology of some newly-predicted genes. Table A12.
Sequencing quality of 18 different fungal genomes. Table A13. RIP indices of various sequences in QM6a and Neurospora crassa. Table A14. Repeat-induced C-to-T mutations observed in the hph alleles in all F1 progeny. Table A15. Saccharomyces cerevisiae Ty elements by chromosome. Table A16. Transposable elements in 12 well-assembled fungal genomes. Table A17. Repetitive sequences in different chromosomal regions. Table A18. Repetitive sequences in seven QM6a chromosomes.