SorGSD: updating and expanding the sorghum genome science database with new contents and tools

Background As the fifth major cereal crop originated from Africa, sorghum (Sorghum bicolor) has become a key C4 model organism for energy plant research. With the development of high-throughput detection technologies for various omics data, much multi-dimensional and multi-omics information has been accumulated for sorghum. Integrating this information may accelerate genetic research and improve molecular breeding for sorghum agronomic traits. Results We updated the Sorghum Genome SNP Database (SorGSD) by adding new data, new features and renamed it to Sorghum Genome Science Database (SorGSD). In comparison with the original version SorGSD, which contains SNPs from 48 sorghum accessions mapped to the reference genome BTx623 (v2.1), the new version was expanded to 289 sorghum lines with both single nucleotide polymorphisms (SNPs) and small insertions/deletions (INDELs), which were aligned to the newly assembled and annotated sorghum genome BTx623 (v3.1). Moreover, phenotypic data and panicle pictures of critical accessions were provided in the new version. We implemented new tools including ID Conversion, Homologue Search and Genome Browser for analysis and updated the general information related to sorghum research, such as online sorghum resources and literature references. In addition, we deployed a new database infrastructure and redesigned a new user interface as one of the Genome Variation Map databases. The new version SorGSD is freely accessible online at http://ngdc.cncb.ac.cn/sorgsd/. Conclusions SorGSD is a comprehensive integration with large-scale genomic variation, phenotypic information and incorporates online data analysis tools for data mining, genome navigation and analysis. We hope that SorGSD could provide a valuable resource for sorghum researchers to find variations they are interested in and generate customized high-throughput datasets for further analysis. Supplementary Information The online version contains supplementary material available at 10.1186/s13068-021-02016-7.

become the preferred food crop all over the world in the future. Furthermore, sorghum is not only harvested for grain, but also often used to produce syrup, grazing and biomass production [2].
As a model organism that carries out C 4 photosynthesis, sorghum was the second sequenced cereal crop after the C 3 organism rice [3,4]. The comparatively small genome of sorghum makes it a potential genetic model for the design of bioenergy crops compared with the larger and more repetitive genomes of other major C 4 crops, such as maize and sugarcane. With the improvement of the reference genome (BTx623) [4,5] and the development of sequencing technologies, studies on domestication and genetic mechanism of distinct phenotype in sorghum have been greatly accelerated [2,[6][7][8][9][10][11][12][13][14][15][16][17].
During the past decade, diverse web resources have been constructed to exhibit numerous omics data, which is beneficial for the sorghum research community (Table 1). Plant specific genome databases such as Phytozome [18] and Gramene [19], as well as the most comprehensive Genome OnLine Database (GOLD) [20] are widely used as data sources and analysis platforms for sorghum research. On the other hand, sorghum included plant secondary databases such as PIGD [21], PlanTFDB [22], DNApod [23], PceRBase [24], PtRFdb [25] and GreenPhylDB [26] have vital modules about sorghum resources. Finally, the sorghum specific secondary  [27], PGSB [28], Sor-ghumFDB [29], Sorghum QTL Atlas [30], and Sorghum Genomics, are a cluster of websites dedicated to sorghum researches. Among them, SorghumFDB is the most comprehensive sorghum specific database, which contains extensive public genomic and functional annotations data, as well as useful analysis tools. With published sorghum genome re-sequencing data of 48 accessions, we developed a sorghum SNP database (SorGSD) in 2016, providing the sorghum user community with abundant SNPs and some other resources related to sorghum genetics and genomics [31].
Here, we announce and describe the second major release of the sorghum genome science database (SorGSD). The goal of the redesign is to construct a comprehensive database with sorghum genomic variations and phenotypes. Compared with the first version SorGSD which contains SNPs of 48 sorghum accessions, the second version provides a more extensive set of genomic variation data for both SNPs and small INDELs of 289 sorghum accessions, as well as characteristic phenotypic information and panicle pictures of critical sorghum lines. We also provide three useful tools in the new release, including ID Conversion, Homologue Search and Genome Browser. The back-end database framework and the web interface were redesigned as a part of the Genome Variation Map at the National Genomics Data Center (NGDC) and China National Center for Bioinformation (CNCB). We hope that these data and tools are beneficial for exploring genetic variations and evolution studies of sorghum and other species. The new version SorGSD is freely accessible at http:// ngdc. cncb. ac. cn/ sorgsd/.

New data contents
The new version SorGSD was mainly built on sorghum reference genome BTx623 (v3.1) with improved assembly and gene annotations [5]. Currently, SorGSD contains 33,825,236 SNPs and 5,722,385 small INDELs identified from the re-sequencing data of 289 sorghum lines [6,32,33], including three accessions of Sorghum propinquum, 50 wild/weedy sorghums and 236 cultivated sorghums (Additional file 1: Table S1). After annotation and calculation, we obtained detailed information about the distribution of variations in different genomic regions, including genic, intergenic, and intronic regions ( Table 2). On the other hand, we also collected about 70 kinds of phenotypic data over 183 accessions with plant ID (PI) from the U.S. National Plant Germplasm System (GRIN-Global) and panicle pictures of 174 critical accessions taken in our laboratory. Besides, we renewed the introduction about sorghum genome, sorghum resources websites including general information, genome and transcriptome databases, research institutions and sorghum producers around the world, as well as critical references about sorghum genetics and genomics.

New features of the database
SorGSD is free and open to the public with comprehensive functions ( Fig. 1; Additional file 2: Table S2). In this update, we put the main page under the National Genomics Data Center of the China National Center for Bioinformation (CNCB-NGDC) (Fig. 1a, h) [34]. Links to each page are shown at the menu bar (Fig. 1b), and a simple welcome message is displayed under the menu bar (Fig. 1c). Four shortcuts of core functions and prompt of citation can be found on the home page (Fig. 1d, e). Our laboratory's major publications and website browsing history could be acquired easily on the right side (Fig. 1f, g).
It is worth mentioning that we still keep the original version up and running, and users could browse it by clicking the "V1.0" button on the menu bar and switch back to the new version by clicking the "V2.0" button of the old version. We optimized the presentation interface to make it easier for users to search for variations. Phenotypic details of each accession could be searched directly. The browsing interface of critical references was redesigned for a better user experience. We also provided three new tools: ID conversion, Homologue Search and Genome Browser. Online documentation is provided to help users get familiar with the database. More detailed information is described as follows.

Improved variation search function
Users may search variation by typing in the variation type, genome position or gene ID. Furthermore, it is also possible to filter variation through consequence type and minor allele frequency (MAF) value. In our previous work, we found that the Dry gene encoded a plant-specific NAC transcription factor, which had a few loss-of-function mutations in sweet sorghum [33]. the conserved functional NAC domain could turn pithy stem into juicy stem, which is one reason for the origin of sweet sorghum. Here we take the Dry gene as an example to search this inframe deletion (Chr06:50898132). Firstly, we can enter the "Variation Search" page and choose the variation type as "INDELs"; secondly, type the gene ID of version 3.1 (Sobic.006g147400) in the edit box "Gene ID"; thirdly, tick "inframe deletion" in "MODERATE" under "Consequence Type"; finally, click "Submit" and we can get the list of target small INDELs at the region of Dry on the right hand of the page (Fig. 2a).
In the list, we could see that the first one is the target small INDELs we searched (Fig. 2a). The details of the variation could be obtained by clicking the variation ID. Users may browse the no-redundant and individual variations with text format in three tables, one alleles distribution diagram and the chromosomebased graphical Genome Browser interface (Fig. 2b). In the text format tables, variation details (e.g., chromosome location, reference allele and three-fifths flank sequences), individual alleles and details of the annotated gene of the variation are given. The alleles distribution diagram is used to infer evolutionary scenario of each variation during sorghum domestication and improvement. More importantly, the individual alleles of target variation can be downloaded to perform subsequent analysis, such as phylogenetic tree construction and association analysis. Users can enter the gene page by clicking the gene ID with a blue background in the "Gene Annotation" table. The gene detail, gene annotation and all the variations locating gene, including SNPs and small INDELs without filtered, will be listed in three tables, respectively (Fig. 2c).
On the other hand, the demand of searching all the SNPs in the position of Dry could be obtained on the "Variation Search" page ( Fig. 2a) by the following steps: (1) choose the variation type as "SNP"; (2) choose the chromosome as "Chr06"; (3) input the physical location (Chr06:50896169.50898604) and submit, we can get all the SNPs in the site of Dry.

New phenotype search function
A user-friendly web interface is provided for users to browse and retrieve phenotypic information (Fig. 3). On this page, users can search for important information of samples using several keywords, including sample ID, plant ID, plant name, origin, taxonomy and usage. When we input "sweet sorghum" in the search box, we can obtain all accessions with the keyword of individual information (Fig. 3a). A high-resolution image could be exhibited by clicking each sample's picture to see the detail of panicle and seed appearance. For example, sample "101" is an improved sweet sorghum from Zimbabwe. By clicking the "Sample ID: 101" tab, the result page will list all agronomic traits' values (Fig. 3b). It is noteworthy that users could also enter the phenotypic page to view the value of this trait from the variation detail page by clicking the tab of "Sample ID" in the "Individual Alleles" table (Fig. 2b).

New online tool
SorGSD provides three online tools (e.g., ID Conversion, Homologue Search and Genome Browser) for users a b to analyze their data. ID Conversion is a useful tool to convert sorghum gene IDs from one to other ID systems of v1.4, v2.1 and v3.1, as well as the IDs of UniProt and PANTHER databases. When we type the gene ID (v3.1) of Dry gene (Sobic.006g147400) in the search box and press "Convert", the corresponding ID of other versions and systems will be listed in the result table. Users could access directly to the corresponding pages of the IDs of UniProt and PANTHER through the hyperlink.
To better understand the evolution of sorghum genes, Homologue Search is built to identify homologous genes among sorghum, maize, rice and Arabidopsis. When we input the gene ID of Dry gene (Sobic.006g147400) in the "Gene Name" box and click "Submit", the list of homologues in other species will be displayed. Besides, we provided a Genome Browser to visualize the locus of variation in the genome. Users only need to type in the genome position (e.g., Dry gene, Chr06:50896169.50898604), corresponding transcript information of the gene and the positions of SNPs and INDELs in the relevant range will appear on the results page. We also provided the link to BLAST tool rested on CNCB-NGDC for comparing nucleotide or protein sequences with sorghum reference sequence database.

Revised resource page
The resource page is divided into three sections, including "Genome", "Website" and "Reference". The "Genome" part introduces the general information of sorghum genome. Users could enter the homepages of website resources promptly on the "Website" page. It is worth mentioning that we updated 162 vital publications of sorghum and classed them into six broad categories in "Reference". By clicking the class title heading in the directory on the left of the page, all papers in the target category will be listed on the right hand. Consumers could read the abstract or download the article from the links by clicking the button "Abstract".

Conclusions and future directions
SorGSD is committed to providing a wide range of sorghum genome data, including genomic information, detailed phenotypic data, sorghum resources and analysis tools for sorghum scientists and breeders. The interface of new version SorGSD is under the CNCB-NGDC and also an essential part of the Genome Variation Map (GVM), a data repository of genome variations of human, as well as cultivated plants and domesticated animals [35]. In this upgrade, we added 241 varieties of whole-genome variation data (including SNPs and small INDELs) based on the latest sorghum reference annotation (version 3.1). The total number of accessions (289) and variations (39.5 Mb) are 6 times and 1.4 times as much as that of the first version, respectively. We also added about 70 kinds of traits information of 183 accessions, which provides detailed reference data of each line for breeders. Tools of ID Conversion, Homologue Search and Genome Browser provide visual, convenient and quick queries for scientific workers engaged in sorghum study. Besides, we carried out a brand new page design to optimize the user experience and make the interaction friendlier. The simple and straight forward user guide allows users to be familiar with the web page's overall design and realize various functions of the webpage quickly.
In the future, we will update SorGSD regularly and add variations with newly available re-sequenced sorghum accessions. In the next step, we anticipate integrating phenotypic data, genomic variation data, transcriptome data, proteome data, and epigenomic data, as well as metabolomics and metabolic interaction networks to build a comprehensive sorghum research and analysis database. At the same time, we hope to receive comments and suggestions, aiming to make SorGSD a one-stop sorghum research platform with multi-faceted omics data and analysis tool.

Data resources
Currently, we collected the re-sequencing data with the unique average depth of 4.02-48.55 × coverage from three sets of sorghum germplasms comprising a total of 289 accessions of wild and cultivated sorghum. The most extensive set of germplasm is a diverse panel of 241 sorghum lines which we published to explore the origin of sweet sorghum through the selection of Dry gene [33]. The second dataset is 44 sorghum lines which revealed untapped genetic potential in Africa's indigenous cereal crop sorghum by Jordan's Lab in 2013 [6]. The last dataset is also our group's work which contains three accessions of cultivated sorghums [32]. The entire set of original sequence data could be obtained from Genome Sequence Archive [36]. Phenotypic data cover the breed and agronomic-trait information collected from GRIN-Global (npgsweb.ars-grin.gov/). Finally, panicle pictures were taken when the sorghum plant reached maturity in the experimental fields of the Institute of Botany, Chinese Academy of Sciences (Beijing, China) in 2019.

Data processing
After trimming the adapter and filtering low-quality reads of the second [6] and third [32] datasets in the first dataset [33], the remaining clean reads were mapped to the reference genome BTx623 (v3.1) with BWA (v0.7.8) [37]. The mapping results were converted to BAM format, and the duplicated reads and multi-aligned reads were eliminated by the SAMtools package (v1.3) [38]. GVCF files of these lines were generated by Haplotype-Caller in GATK (v3.1) [39]. All the GVCF files of the three datasets were used to call SNPs and INDELs by GenotypeGVCFs in GATK (v3.1) [39]. In total, 33,825,236 SNPs and 5,722,385 small INDELs were identified across 289 sorghum lines. Finally, we predicted and annotated the effects of variations by using the VEP program (v84) [40]. Besides, we also calculated the MAF of each variant using vcftools (v0.1.13) [41].

Database design and implementation
SorGSD was designed based on the framework of the iDog database [42], which was implemented using Spring Boot (http:// sping. io), a free and prevailing Model-View-Controller (MVC) framework, and Mybatis (https:// mybat is. org/ mybat is-3/), a first-class persistence framework with support for custom SQL, stored procedures and advanced mappings. In the back-end part, metadata and reference data were stored in MySQL (https:// www. mysql. com). Web user interfaces were developed using JSP, JQuery as well as BootStrap. The Biodalliance genome browser (http:// www. bioda llian ce. org/) was used for genome synteny visualization.