SoIR-Search-About

The Search interface provides access to much genomic information, including gene annotations, CRISPR guide sequences, duplication types, homologous genes, synteny, and transcriptome data. All the genes of 81 Solanaceae species were annotated based on five protein databases. Gene annotation completeness varied, with coverage ranging from 85.34% in Solanum pimpinellifolium to 99.98% in Solanum lycopersicum. Users can search annotations by gene ID or by functional identifiers such as gene symbols (e.g., flowering locus), or login number (e.g., Gene Ontology GO:0035448 and Pfam: PF04874) for cross-species queries.

Analysis methods:

Part	Methods
Annotations	Four databases were used to annotate 81 Solanaceae species, namely, the Pfam database, UniProt database (Swiss-Prot, TrEMBL), Nonredundant protein sequence database (Nr), and Gene Ontology database (GO). In addition, we used goatools to obtain GO annotations. The python program was used to sort all the annotation information into a table in batches for display in the SOIR database.
CRISPR	The CasFinder pipeline^[1] was used to design the Cas9 target sites for CRISPR. First, the repetitive genome sequences of each species were screened using the RepeatMasker program. The index was subsequently created for each genome by the Bowtie program. Finally, the scripts CasValue_v2.pl and CasFinder.pl from the CasFinder pipeline were used to design the guide sequences for the CRISPR study. The candidate sequence was filtered by in-house Perl scripts to obtain the specific sequence for each gene.
Duplication type	The program (duplicate_gene_classifier) in MCScanX^[2] was used to infer the type of duplicate genes.
Homologs	Orthologous, paralogous, and heterologous sequences were identified using OrthoFinder (v2.0)^[3]. First, the similarity relationships between the protein sequences of all species were based on BLASTP similarity scores (E value < 1e-5).
Synteny	According to the BLASTP results, the “-d” subroutine in WGDI^[4] was used to construct the homologous lattice diagram, and the “-icl” subroutine was used to detect collinear genes. The maximum gap value of the collinear fragment was set to 50. The grape and tomato genomes were subsequently used as references to determine the collinearity between each species and grape and tomato plants. The proportional relationship was determined according to the duplication relationship of other species, and the collinear relationship between genes was visualized in the form of a circle diagram using the “-ci” subroutine. Finally, the results were drawn into a circle diagram and displayed in the database using the D3 library.
RNA-seq	Transcriptome data were retrieved from public databases, including NCBI and NGDC (Comparison Table). Initial quality control analysis was conducted using fastp software^[5]. Sequencing adapters were removed using Trimmomatic (v 0.36)^[6] software. The filtered data were then aligned to the reference genome using hisat2 (v 2.2.1)^[7] software. Gene expression quantification was performed using the run-featurecounts.R script^[8]. Finally, all data were merged into a single file based on the FPKM and TPM calculation formulas to summarize the expression levels of each gene.

Tutorials for each sub-module:

References:

1.Aach J., Mali P. and Church G. CasFinder: Flexible algorithm for identifying specific Cas9 targets in genomes. bioRxiv 2014;

2.Wang YP, Tang HB, Debarry JD, Tan X, Li JP, et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res 2012; 40: e49.

3.Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol 2019; 20: 238.

4.Sun PC, Jiao BB, Yang YZ, Shan LX, Li T, et al. WGDI: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes. Mol Plant 2022; 15: 1841-1851.

5.Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018; 34: i884-i890.

6.Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014; 30: 2114-2120.

7.Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 2019; 37: 907-915.

8.Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 2018; 30: 923-930.