Search-About

The Search interface provides access to much genomic information, including gene annotations, CRISPR guide sequences, duplication types, homologous genes, synteny, and transcriptome data. All the genes of 81 Solanaceae species were annotated based on five protein databases. Gene annotation completeness varied, with coverage ranging from 85.34% in Solanum pimpinellifolium to 99.98% in Solanum lycopersicum. Users can search annotations by gene ID or by functional identifiers such as gene symbols (e.g., flowering locus), or login number (e.g., Gene Ontology GO:0035448 and Pfam: PF04874) for cross-species queries.

Analysis methods:

Part Methods
Annotations Four databases were used to annotate 81 Solanaceae species, namely, the Pfam database, UniProt database (Swiss-Prot, TrEMBL), Nonredundant protein sequence database (Nr), and Gene Ontology database (GO). In addition, we used goatools to obtain GO annotations. The python program was used to sort all the annotation information into a table in batches for display in the SOIR database.
CRISPR The CasFinder pipeline[1] was used to design the Cas9 target sites for CRISPR. First, the repetitive genome sequences of each species were screened using the RepeatMasker program. The index was subsequently created for each genome by the Bowtie program. Finally, the scripts CasValue_v2.pl and CasFinder.pl from the CasFinder pipeline were used to design the guide sequences for the CRISPR study. The candidate sequence was filtered by in-house Perl scripts to obtain the specific sequence for each gene.
Duplication type The program (duplicate_gene_classifier) in MCScanX[2] was used to infer the type of duplicate genes.
Homologs Orthologous, paralogous, and heterologous sequences were identified using OrthoFinder (v2.0)[3]. First, the similarity relationships between the protein sequences of all species were based on BLASTP similarity scores (E value < 1e-5).
Synteny According to the BLASTP results, the “-d” subroutine in WGDI[4] was used to construct the homologous lattice diagram, and the “-icl” subroutine was used to detect collinear genes. The maximum gap value of the collinear fragment was set to 50. The grape and tomato genomes were subsequently used as references to determine the collinearity between each species and grape and tomato plants. The proportional relationship was determined according to the duplication relationship of other species, and the collinear relationship between genes was visualized in the form of a circle diagram using the “-ci” subroutine. Finally, the results were drawn into a circle diagram and displayed in the database using the D3 library.
RNA-seq Transcriptome data were retrieved from public databases, including NCBI and NGDC (Comparison Table). Initial quality control analysis was conducted using fastp software[5]. Sequencing adapters were removed using Trimmomatic (v 0.36)[6] software. The filtered data were then aligned to the reference genome using hisat2 (v 2.2.1)[7] software. Gene expression quantification was performed using the run-featurecounts.R script[8]. Finally, all data were merged into a single file based on the FPKM and TPM calculation formulas to summarize the expression levels of each gene.

Tutorials for each sub-module:

References: