Documents

Data analysis

1. Prediction of protein-coding genes

2. RNA-seq in different tissues and RNA-seq data analysis

3. Multi-level Genomic annotation

4. Annotation of conserved non-coding RNA (ncRNA) genes.

5. Calling of heterozygous SNPs

6. Best Arabidopsis/ Rice Hit

7. Ortholog groups

8. Comparative Genome Search

9. Protein-protein interactions (PPIs)

1. Prediction of protein-coding genes

    

We build a 7- step pipeline to construct the gene model set.

1)The prediction software program, FgeneSH++ with gene model parameters trained from monocots, was used in ab initio gene rediction to build the preliminary gene models;

2)Coding sequences of each predicted gene model were aligned to both the Repbase TE library and the moso bamboo TE library created by RepeatModeler, using the BLASTN at E -value of 1e- 5;

3)The illumina RNA-seq sequences from five vegetative and two reproductive tissues were mapped onto the coding sequences of FgeneSH gene models by the aligner SMALT with parameters set to minimum Smith-Waterman (-m) at 80 (for 2*120 bp reads) or 60 (for 2*100 bp reads), maximum insert size (-i) at 1,500, and minimum insert size (-j) at 20. Only uniquely matched reads were selected in assistance with gene prediction. Information between unique matches and corresponding gene models were collected by using 2 threshold s to screen SMALT cigar outputs: A) cigar:S and cigar:A items with score at 50 or more were selected. B) For the selected cigar:A items, the available paired - end insert size should be at least 200 bp;

4)A total of 8,253 moso bamboo cDNAs, carrying entire coding sequences but not TE -derived, were selected from the 10,608 putative full- length cDNAs and were then mapped to the scaffolds by an mRNA/EST genome mapping program of GMAP with the parameters set to “- n 1 -f 2 - B 1 - A -t 4 ”;

5)The gene models were screened by integrating the information from outputs of the step 2), 3), and 4), using the following 4 thresholds.

A,the gene models coding TE- elements or overlapping TE- elements greater than 10% of gene coding region were firstly discarded.

B,the models aligned to the full- length cDNAs were preferentially collected. The splicing sites were manually adjusted according to the alignment.

C,the models detected not by the FgeneSH++ but by the cDNAs were created and their coding information was added into gene model set.

D,candidate gene models without the evidence of full-length cDNAs should be supported by 2 different uniquely matched RNA-seq sequences. And at least 20% of their coding region was covered by RNA-seq reads. A pair of PE reads was treated as a single RNA-seq sequence when counting number of the mapped transcriptome reads for each model.

6)Information of cDNA-supporting UTR ends was attached to the gene model set.

7)The single- exon genes were manually checked by expert s and the genes with no hits to homologs of grass genes were also discarded.

8)For the gene with different transcripts, the longest one was selected.

2. RNA-seq in different tissues and RNA-seq data analysis

    

Five vegetative tissues (young leaves, rhizome, root, tip of the 20cm -high shoot, and tip of the 50cm-high shoot) were collected in the Tianmu-Mountain National Nature Reserve in Zhejiang Province of China in spring, which was from the same individual used in genome sequencing. To perform the transcriptome sequencing of floral tissues, we spent over two years to look for the floral tissues of moso bamboo in 8 provinces of China because its flowering was too rare. Finally, in early summer of 2010, two reproductive tissues (panicles at early stage and panicles at flowering stage) were obtained in suburban Guilin (110º31'20.2”E, 25º10'42.7”N; 216 meter in elevation), Guangxi Province of Southern China, more than 1800 kilometers (1,100 miles) from our institute. The collected panicles from plant with no flowering or post-flowering spikelet were considered as panicle at early stage, while those from the plant growing at least 50% of flowering or post-flowering spikelet were considered as that at flowering stage.

RNA-seq data analysis, including expressed gene and expressed profile in different tissues, mainly use some software including bowtie, tophat and cufflink. Expressed gene were identified by FPKM>=1 in BambooGDB.

3. Multi-level Genomic annotation

    

1)Prediction of gene function motifs and domains were performed by InterProScan 5 Release Candidate 6 (5RC6) Released against available databases, including PRINTS, Pfam, Gene3D, Panther, InterPro accession, InterPro description, Gene Ontology (GO) accession and GO annotations.

2)The bamboo gene models were aligned to entries of sorghum, rice, and maize from the KEGG database (release till April 2011) by BLASTP under E-value 1e-10 to find the best hit for each gene. The similarity of each pathway is the ratio of number of shared enzymatic steps and sum of referenced enzymatic steps.

3)COG was predicted by BLASTP against COG database in NCBI under E-value 1e-6.

4)Structure features were predicted by Batch CD-Search.

5)Molecular weight (Mw) and theoretical isoelectric point (pI) were compute by pI/Mw tool in Swiss-Prot.

4. Annotation of conserved non-coding RNA (ncRNA) genes.

    

1)Identification of transfer RNAs (tRNA)

The tRNAScan-SE algorithms with default parameters were applied to prediction of tRNA genes in the Arabidopsis, sorghum, maize, rice, Brachypodium, and bamboo genomes.

2)Identification of rRNA genes

The rRNA fragments were identified by aligning the rRNA template sequences (Rfam database, release 10.0) of Arabidopsis thaliana, Oryza sativa, Sorghum bicolor, and Zea mays using BLASTN with E-value at 1e-10 and identity cutoff at 95% or more.

3)Identification of other non-coding RNA genes.

The miRNA and snRNA genes were predicted by INFERNAL software against the Rfam database (release 9.1, 1,412 families). To accelerate the speed, a rough filtering prior to INFERNAL was performed by BLASTN against the Rfam sequence database under E-value at 1e-10. For the miRNA prediction, the assemblies were aligned to the precursor sequences of Arabidopsis thaliana, Brachypodium distachyon, Oryza sativa, Sorghum bicolor, Saccharum officinarum, Triticum aestivum, Hordeum vulgare, and Zea mays, derived from the Rfam sequence database. The extended sequences, similar as that in miRNA prediction, were put into the INFERNAL prediction with cutoff score at 50 or more. Potential target sequences for the newly identified miRNAs were predicted using psRNATarget program with default parameters.

5. Calling of heterozygous SNPs

    

To detect the heterozygous sequence polymorphism, all of the used PE reads (around 120× coverage) were firstly mapped to the assembled scaffolds by aligner SMALT. The SNPs were then called by SSAHA_Pileup (version 0,8) and 6 thresholds were used to post-filter unreliable SNPs:

1)SSAHA_Pileup SNP score >= 20;

2)Ratio of two alleles between 3:17 to 17:3;

3)The highest sequencing depth of SNP position<=240;

4)The lowest sequencing depth for each allele >= 5;

5)The minimum distance for adjacent SNPs >= 10 bp;

6)Only one polymorphism detected at each SNP position.

6. Best Arabidopsis/ Rice Hit

    

Best match of a rice non-TE related protein to the Arabidopsis thaliana proteome (TAIR Release 10).

Best match of a rice non-TE related protein to the MUS Rice Genome Annotation (Release 7)

7. Ortholog groups

    

Orthologs are homologs seperated by speciation events. Detection of orthologs is becoming much more important with the rapid progress in genome sequencing. Ortholog groups were identified by OrthoMCL, which is a genome-scale algorithm for grouping orthologous protein sequences.

8. Comparative Genome Search

    

Some bamboo relative plant species (Arabidopsis thaliana, Brachypodium distachyon, Panicum virgatum, Oryza sativa, Sorghum bicolor, Setaria italic, Zea mays) were selected and compared based their results of ortholog by OrthoMCL algorithm. Phylogenetic tree was produced based on ribulose-1, 5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene

9. Protein-protein interactions (PPIs)

    

Based protein interaction network analysis (PINA) platform, PPI network were predicted using protein of moso bamboo. The final PPI network includes 2,202 proteins with 34,169 interactions.

# This site recommends the best viewd with 1024x768 in IE8 or above