- About
- Species
- Tools
- How To
- Contact
- Site map
- S.purpuratus Quick Search
Table of Contents
Introduction
Due to the lack of long range linkage deriving good gene models from Lytechinus variegatus genome assemblies is difficult. Therefore the Evidential Gene programs were also used to predict mRNAs and proteins. These programs take as input mRNA assemblies from Trinity and Velvet/Oases, and derive consensus sequences. The instructions suggested using both. We thank Don Gilbert for his help with this work. This analysis is based on these RNA-seq data sets:
SRX766497 SRR1661401 2-cell stage embryos
SRX766496 SRR1661399 60-cell stage embryos
SRX766495 SRR1661397 early blastula stage embryos
SRX766494 SRR1661409 hatched blastula stage embryos
SRX766493 SRR1661406 thickened vegetal plate stage embryos
SRX766174 SRR1661395 mesenchyme blastula stage embryos
SRX766173 SRR1661363 early gastrula stage embryos
SRX766172 SRR1661113 mid gastrula stage embryos
SRX766171 SRR1661112 late gastrula stage embryos
SRX766170 SRR1661111 early pluteus stage embryos
SRX766169 SRR1661090 late pluteus stage embryos
Trinity processing
One complication which was quickly encountered was that the current Trinity version at the time was assembling RNA-seq with lower BUSCO scores than had been seen with previous versions. So an initial assessment was made on several of the preceding data sets using different versions of Trinity and then scoring the results with BUSCO. The commands used were like this example:
perl ~/src/trinityrnaseq-2.2.0/Trinity \
--seqType fq --trimmomatic --max_memory 450G \
--left SRR1661401_1.fastq \
--right SRR1661401_2.fastq \
--CPU 40 \
--full_cleanup --output 22_trinity_SRR1661401 \
/null >22_trinity_SRR1661401.log 2>&1
Version 2.5.0 was also run with the extra flag --no_normalize_reads because the default value of that parameter had changed at that version. The table for SRR1661090 is shown below, the pattern was similar for the other data sets. The BUSCO scores for SRR1661395 were substantially lower than for the others, so it was excluded. The results from Trinity 2.2.0 runs were used going forward.
BUSCO result columns are:
Complete BUSCOs (C)
Complete and single-copy BUSCOs (S)
Complete and duplicated BUSCOs (D)
Fragmented BUSCOs (F)
Missing BUSCOs (M)
Total BUSCO groups searched(n)
Trinity version | C: | S: | D: | F: | M: | n: |
2.2.0 | 90.4% | 47.8% | 42.6% | 8.6% | 1.0% | 978 |
2.3.2 | 84.6% | 44.2% | 40.4% | 14.4% | 1.0% | 978 |
2.4.0 | 83.3% | 32.5% | 50.8% | 15.4% | 1.3% | 978 |
2.5.0 | 84.6% | 29.8% | 54.8% | 14.5% | 0.9% | 978 |
2.5nnr | 83.1% | 33.8% | 49.3% | 15.4% | 1.5% | 978 |
2.6.6 | 86.6% | 27.2% | 59.4% | 12.2% | 1.2% | 978 |
2.8.3 | 98.3% | 17.1% | 81.2% | 0.9% | 0.8% | 978 |
2.6.6 and 2.8.3 values added later, as Trinity improved further. These were not used in evigene.
Velvet/Oases processing
Each of the data sets was processed using an expanded version of the following code snippet:
KMIN=21
KMAX=31
STEP=4
KMERGE=27
velveth ${SRR}_dir $KMIN,$KMAX,$STEP -shortPaired -separate -fastq ${SRR}_1.fastq ${SRR}_2.fastq
while [ $KMER -le $KMAX ]
do
THEDIR=${SRR}_dir_$KMER
velvetg $THEDIR -read_trkg yes -ins_length 100 -cov_cutoff auto -exp_cov auto
oases $THEDIR -ins_length 100
KMER=$(( $KMER + $STEP))
done
velveth ${SRR}_merged $KMERGE -long -fasta ${SRR}_dir_*/transcripts.fa
velvetg $THEDIR -read_trkg yes -conserveLong yes -cov_cutoff auto -exp_cov auto
oases $THEDIR -merge yes
Evigene processing
The results from all Trinity 2.2.0 and Evigene runs were put into two fasta files using commands like:
mkfifo MERGE_FIFO
cat MERGE_FIFO >>all_trin.fasta & trformat.pl -pre SRR1661401 \
-in $TRINPATH/22_trinity_SRR1661401.Trinity.fasta -out /tmp/MERGE_FIFO -format=trinity
cat MERGE_FIFO >>all_oa.fasta & trformat.pl -pre SRR1661401 \
-in $OAPATH/SRR1661401_dir_31/transcripts.fa -out /tmp/MERGE_FIFO -format=velvet
cat MERGE_FIFO >>all_oa.fasta & trformat.pl -pre SRR1661401 \
-in $OAPATH/SRR1661401_dir_47/transcripts.fa -out /tmp/MERGE_FIFO -format=velvet
cat MERGE_FIFO >>all_oa.fasta & trformat.pl -pre SRR1661401 \
-in $OAPATH/SRR1661401_dir_63/transcripts.fa -out /tmp/MERGE_FIFO -format=velvet
then the two were merged
cat all_trin.fasta all_oa.fasta >all_trin_oa.fasta
and evigene run on both the merged file and the all_trin.fasta file with:
tr2aacds.pl \
-mrnaseq all_trin_oa.fasta -NCPU=40 -MAXMEM=100000 \
-logfile all_trin_oa.traacds_induced.log >/tmp/all_trin_oa.log 2>&1
tr2aacds.pl \
-mrnaseq all_trin.fasta -NCPU=40 -MAXMEM=100000 \
-logfile all_trin.traacds_induced.log >/tmp/all_trin.log 2>&1
The resulting "okaysets" were scored with BUSCO producing these results:
Data | C: | S: | D: | F: | M: | n: |
Trinity | 97.8% | 91.3% | 6.5% | 0.4% | 1.8% | 978 |
Trinity+Oases | 96.8% | 89.0% | 7.8% | 0.5% | 2.7% | 978 |
Since the Trinity data alone scored higher than when combined with the Velvet/Oases results, the former is the one available on this site.
The detailed Trinity BUSCO results were examined and those with the most "M" (missing) scores determined. The three found least often were:
OrthoDB | States | Uniprot Description |
EOG091G0GA7 | MMMMMMMMMM | EEF1A lysine methyltransferase 1 |
EOG091G0IMI | FMMMMMMMMM | Spo11/DNA topoisomerase VI subunit A |
EOG091G0OPI | CMMMMMMMMM | rRNA methyltransferase 3, mitochondrial |
Summary
This method 'best-transcript-set'-from-many-libraries was used. Trinity 2.8.3 was employed. The resulting Lv_common_mRNA and Lv_common_pep databases contain the best common representation of each gene as determined from Trinity 2.8.3 runs over many RNA-seq libraries. There are 27035 entries in each file, or very roughly one representation per gene. In brief the method attempts to find the most commonly used ends and to group together splicing variants for the same gene, so that a single representative may be selected from that group. This produces about 2.7X fewer sequences than Evigene, mostly by removing near duplicates. The transcriptome does not have an NCBI accession number. The cutoff parameter of 50% which was used was determined from this table of BUSCO (peptide) scores:
Cutoff | C: | S: | D: | F: | M: |
99 | 98.9% | 90.5% | 8.4% | 0.2% | 0.9% |
90 | 98.9% | 91.1% | 7.8% | 0.2% | 0.9% |
80 | 98.9% | 91.1% | 7.8% | 0.2% | 0.9% |
70 | 98.9% | 91.1% | 7.8% | 0.2% | 0.9% |
60 | 98.9% | 91.1% | 7.8% | 0.2% | 0.9% |
50 | 98.9% | 91.1% | 7.8% | 0.2% | 0.9% |
40 | 98.8% | 91.0% | 7.8% | 0.3% | 0.9% |
30 | 97.6% | 89.9% | 7.7% | 0.4% | 2.0% |
Introduction
The method used to determine repeat families is based on this web page:
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_C...
The method was applied to the LvMSCB genome assembly.
Install software and reference data
Assign site specific values to these variables:
TOPDIR= #path to where software is installed.
WORKDIR= #path to directory where work is done
CPUS= #number of CPUs to use in parallel, was 48
Download and unpack patches and other changes. These will be referenced later via the path variable NEWFILES.
cd /tmp
wget find_repeats_pieces.tar.gz
gunzip -c find_repeats_pieces.tar.gz | tar -xf -
rm find_repeats_pieces.tar.gz
export NEWFILES=/tmp/find_repeats_pieces
Download and unpack MITE_Hunter
cd $TOPDIR/src
wget http://target.iplantcollaborative.org/mite_hunter/MITE%20Hunter-11-2011.zip
unzip 'MITE Hunter-11-2011.zip'
rm MITE\ Hunter-11-2011.zip
mv 'MITE Hunter' MITE_Hunter
Install dependencies for MITE_HUNTER and later processing (there may be others not listed here, especially perl modules)
Install some other needed data
cd $TOPDIR/etc
wget http://gtrnadb2009.ucsc.edu/download/tRNAs/eukaryotic-tRNAs.fa.gz
gunzip eukaryotic-tRNAs.fa.gz
wget http://www.hrt.msu.edu/uploads/535/78637/Tpases020812.gz
gunzip Tpases020812.gz
wget http://www.hrt.msu.edu/uploads/535/78637/Tpases020812DNA.gz
gunzip Tpases020812DNA.gz
complete installation of MITE_HUNTER
cd $TOPDIR/src/MITE_Hunter
#trailing "/" after -d option is needed or the paths merge with the script names
perl MITE_Hunter_Installer.pl \
-d $TOPDIR/src/MITE_Hunter/ \
-f $TOPDIR/bin/formatdb \
-b $TOPDIR/bin/blastall \
-m $TOPDIR/bin/mdust \
-M $TOPDIR/bin/muscle
Modify the MITE_HUNTER perl script to allow better/new parallel processing.
This forces it to use blastall on a single CPU in N processes, which speeds things
up in that section by roughly a factor of 2 versus running one blastall process on N threads.
Change step 6 so that it runs in parallel, resulting in a very large speed up for that portion.
mv $TOPDIR/src/MITE_Hunter_worker3.pl $TOPDIR/src/MITE_Hunter_worker3.pl.dist
mv $TOPDIR/src/MITE_Hunter_manager.pl $TOPDIR/src/MITE_Hunter_manager.pl.dist
cp $NEWFILES/MITE_Hunter_worker3.pl $NEWFILES/MITE_Hunter_manager.pl \
$TOPDIR/src/MITE_Hunter
cp $NEWFILES/many_blasts_1cpu.sh $TOPDIR/bin
Run MITE Hunter
mkdir -p $WORKDIR/MITE_Hunter_work
cd $WORKDIR/MITE_Hunter_work
ln -s ../LvMSCB.fasta Lv_genome
$TOPDIR/MITE_Hunter/MITE_Hunter_manager.pl \
-i Lv_genome -n 48 -c 48 \
-S 12345678 >MH.log 2>&1 &
The results are in $WORKDIR/MITE_Hunter_work/genome_Step8_singlet.fa with 322 sequences comprising 152189 bp, ranging in length from 45 bp to 1866 bp.
ProtExcluder1.2 processes blastx output but the format has changed from when it was written (2.4.0+) to the current time (2.9.0+). Two of the scripts in this package must be modified or the final phase of the pipeline fails - it will not remove known proteins (which are in multiple copies). This is not a huge error since the number of repeat sequences affected is small and unless the organism is very well characterized some of the protein calls employed at that step might themselves be incorrect. To fix this format discrepancy make these changes:
mv $TOPDIR/src/ProtExcluder1.2/blastformatProt.pl $TOPDIR/src/ProtExcluder1.2/blastformatProt.pl.dist
mv $TOPDIR/src/ProtExcluder1.2/matchtract.pl $TOPDIR/src/ProtExcluder1.2/matchtract.pl.dist
cp $NEWFILES/blastformatProt.pl $NEWFILES/matchtract.pl \
$TOPDIR/src/ProtExcluder1.2
Run the rest of the commands
cd $WORKDIR
mkdir -p $WORKDIR/Repeat_Library
cd $WORKDIR/Repeat_Library
cp $NEWFILES/commands.sh .
chmod 755 commands.sh
# commands.sh contains all the program run commands described in the
# source Repeat_Library_Construction-Advanced web page.
# Phase numbers in the script correspond to section numbers in that web page
./commands.sh >commands.log 2>&1 &
The results are in $WORKDIR/Repeat_Library/Lv_repeats.fasta with 3525 sequences comprising 1896774 bp, ranging in length from 33 bp to 8076 bp.
Back to the top of the page