Lv Other Assemblies

Table of Contents

  1. Evidential gene (evigene)
  2. Common set of Trinity Transcripts
  3. Repeat families

Evidential gene (evigene)


Due to the lack of long range linkage deriving good gene models from Lytechinus variegatus genome assemblies is difficult. Therefore the Evidential Gene programs were also used to predict mRNAs and proteins. These programs take as input mRNA assemblies from Trinity and Velvet/Oases, and derive consensus sequences. The instructions suggested using both. We thank Don Gilbert for his help with this work. This analysis is based on these RNA-seq data sets:

SRX766497 SRR1661401 2-cell stage embryos
SRX766496 SRR1661399 60-cell stage embryos
SRX766495 SRR1661397 early blastula stage embryos
SRX766494 SRR1661409 hatched blastula stage embryos
SRX766493 SRR1661406 thickened vegetal plate stage embryos
SRX766174 SRR1661395 mesenchyme blastula stage embryos
SRX766173 SRR1661363 early gastrula stage embryos
SRX766172 SRR1661113 mid gastrula stage embryos
SRX766171 SRR1661112 late gastrula stage embryos
SRX766170 SRR1661111 early pluteus stage embryos
SRX766169 SRR1661090 late pluteus stage embryos

Trinity processing

One complication which was quickly encountered was that the current Trinity version at the time was assembling RNA-seq with lower BUSCO scores than had been seen with previous versions. So an initial assessment was made on several of the preceding data sets using different versions of Trinity and then scoring the results with BUSCO. The commands used were like this example:

perl ~/src/trinityrnaseq-2.2.0/Trinity \
   --seqType fq --trimmomatic --max_memory 450G \
   --left SRR1661401_1.fastq \
   --right SRR1661401_2.fastq \
   --CPU 40 \
   --full_cleanup --output 22_trinity_SRR1661401 \
   /null >22_trinity_SRR1661401.log 2>&1

Version 2.5.0 was also run with the extra flag --no_normalize_reads because the default value of that parameter had changed at that version. The table for SRR1661090 is shown below, the pattern was similar for the other data sets. The BUSCO scores for SRR1661395 were substantially lower than for the others, so it was excluded. The results from Trinity 2.2.0 runs were used going forward.
BUSCO result columns are:

Complete BUSCOs (C)
Complete and single-copy BUSCOs (S)
Complete and duplicated BUSCOs (D)
Fragmented BUSCOs (F)
Missing BUSCOs (M)
Total BUSCO groups searched(n)

Trinity version C: S: D: F: M: n:
2.2.0 90.4% 47.8% 42.6% 8.6% 1.0% 978
2.3.2 84.6% 44.2% 40.4% 14.4% 1.0% 978
2.4.0 83.3% 32.5% 50.8% 15.4% 1.3% 978
2.5.0 84.6% 29.8% 54.8% 14.5% 0.9% 978
2.5nnr 83.1% 33.8% 49.3% 15.4% 1.5% 978
2.6.6 86.6% 27.2% 59.4% 12.2% 1.2% 978
2.8.3 98.3% 17.1% 81.2% 0.9% 0.8% 978

2.6.6 and 2.8.3 values added later, as Trinity improved further. These were not used in evigene.

Velvet/Oases processing

Each of the data sets was processed using an expanded version of the following code snippet:
velveth ${SRR}_dir $KMIN,$KMAX,$STEP -shortPaired -separate -fastq ${SRR}_1.fastq ${SRR}_2.fastq
while [ $KMER -le $KMAX ]
   velvetg $THEDIR -read_trkg yes -ins_length 100 -cov_cutoff auto -exp_cov auto
   oases $THEDIR -ins_length 100
   KMER=$(( $KMER + $STEP))
velveth ${SRR}_merged $KMERGE -long -fasta ${SRR}_dir_*/transcripts.fa
velvetg $THEDIR -read_trkg yes -conserveLong yes -cov_cutoff auto -exp_cov auto
oases $THEDIR -merge yes

Evigene processing

The results from all Trinity 2.2.0 and Evigene runs were put into two fasta files using commands like:

cat MERGE_FIFO >>all_trin.fasta & -pre SRR1661401 \
   -in $TRINPATH/22_trinity_SRR1661401.Trinity.fasta -out /tmp/MERGE_FIFO -format=trinity
cat MERGE_FIFO >>all_oa.fasta & -pre SRR1661401 \
   -in $OAPATH/SRR1661401_dir_31/transcripts.fa -out /tmp/MERGE_FIFO -format=velvet
cat MERGE_FIFO >>all_oa.fasta & -pre SRR1661401 \
   -in $OAPATH/SRR1661401_dir_47/transcripts.fa -out /tmp/MERGE_FIFO -format=velvet
cat MERGE_FIFO >>all_oa.fasta & -pre SRR1661401 \
   -in $OAPATH/SRR1661401_dir_63/transcripts.fa -out /tmp/MERGE_FIFO -format=velvet

then the two were merged

cat all_trin.fasta all_oa.fasta >all_trin_oa.fasta

and evigene run on both the merged file and the all_trin.fasta file with: \
   -mrnaseq all_trin_oa.fasta -NCPU=40 -MAXMEM=100000 \
   -logfile all_trin_oa.traacds_induced.log >/tmp/all_trin_oa.log 2>&1 \
   -mrnaseq all_trin.fasta -NCPU=40 -MAXMEM=100000 \
   -logfile all_trin.traacds_induced.log >/tmp/all_trin.log 2>&1

The resulting "okaysets" were scored with BUSCO producing these results:

Data C: S: D: F: M: n:
Trinity 97.8% 91.3% 6.5% 0.4% 1.8% 978
Trinity+Oases 96.8% 89.0% 7.8% 0.5% 2.7% 978

Since the Trinity data alone scored higher than when combined with the Velvet/Oases results, the former is the one available on this site.

The detailed Trinity BUSCO results were examined and those with the most "M" (missing) scores determined. The three found least often were:

OrthoDB States Uniprot Description
EOG091G0GA7 MMMMMMMMMM EEF1A lysine methyltransferase 1
EOG091G0IMI FMMMMMMMMM Spo11/DNA topoisomerase VI subunit A
EOG091G0OPI CMMMMMMMMM rRNA methyltransferase 3, mitochondrial

Back to the top of the page

Common set of Trinity Transcripts


This method 'best-transcript-set'-from-many-libraries was used. Trinity 2.8.3 was employed. The resulting Lv_common_mRNA and Lv_common_pep databases contain the best common representation of each gene as determined from Trinity 2.8.3 runs over many RNA-seq libraries. There are 27035 entries in each file, or very roughly one representation per gene. In brief the method attempts to find the most commonly used ends and to group together splicing variants for the same gene, so that a single representative may be selected from that group. This produces about 2.7X fewer sequences than Evigene, mostly by removing near duplicates. The transcriptome does not have an NCBI accession number. The cutoff parameter of 50% which was used was determined from this table of BUSCO (peptide) scores:

Cutoff C: S: D: F: M:
99 98.9% 90.5% 8.4% 0.2% 0.9%
90 98.9% 91.1% 7.8% 0.2% 0.9%
80 98.9% 91.1% 7.8% 0.2% 0.9%
70 98.9% 91.1% 7.8% 0.2% 0.9%
60 98.9% 91.1% 7.8% 0.2% 0.9%
50 98.9% 91.1% 7.8% 0.2% 0.9%
40 98.8% 91.0% 7.8% 0.3% 0.9%
30 97.6% 89.9% 7.7% 0.4% 2.0%

Back to the top of the page

Repeat families


The method used to determine repeat families is based on this web page:

The method was applied to the LvMSCB genome assembly.

Install software and reference data

Assign site specific values to these variables:
TOPDIR= #path to where software is installed.
WORKDIR= #path to directory where work is done
CPUS= #number of CPUs to use in parallel, was 48

Download and unpack patches and other changes. These will be referenced later via the path variable NEWFILES.
cd /tmp
wget find_repeats_pieces.tar.gz
gunzip -c find_repeats_pieces.tar.gz | tar -xf -
rm find_repeats_pieces.tar.gz
export NEWFILES=/tmp/find_repeats_pieces

Download and unpack MITE_Hunter
cd $TOPDIR/src
unzip 'MITE'
rm MITE\
mv 'MITE Hunter' MITE_Hunter

Install dependencies for MITE_HUNTER and later processing (there may be others not listed here, especially perl modules)

Install some other needed data

cd $TOPDIR/etc
gunzip eukaryotic-tRNAs.fa.gz
gunzip Tpases020812.gz
gunzip Tpases020812DNA.gz

complete installation of MITE_HUNTER

cd $TOPDIR/src/MITE_Hunter
#trailing "/" after -d option is needed or the paths merge with the script names
perl \
   -d $TOPDIR/src/MITE_Hunter/ \
   -f $TOPDIR/bin/formatdb \
   -b $TOPDIR/bin/blastall \
   -m $TOPDIR/bin/mdust \
   -M $TOPDIR/bin/muscle

Modify the MITE_HUNTER perl script to allow better/new parallel processing.
This forces it to use blastall on a single CPU in N processes, which speeds things
up in that section by roughly a factor of 2 versus running one blastall process on N threads.
Change step 6 so that it runs in parallel, resulting in a very large speed up for that portion.

mv $TOPDIR/src/ $TOPDIR/src/
mv $TOPDIR/src/ $TOPDIR/src/

Run MITE Hunter

mkdir -p $WORKDIR/MITE_Hunter_work
cd $WORKDIR/MITE_Hunter_work
ln -s ../LvMSCB.fasta Lv_genome
$TOPDIR/MITE_Hunter/ \
   -i Lv_genome -n 48 -c 48 \
   -S 12345678 >MH.log 2>&1 &

The results are in $WORKDIR/MITE_Hunter_work/genome_Step8_singlet.fa with 322 sequences comprising 152189 bp, ranging in length from 45 bp to 1866 bp.

ProtExcluder1.2 processes blastx output but the format has changed from when it was written (2.4.0+) to the current time (2.9.0+). Two of the scripts in this package must be modified or the final phase of the pipeline fails - it will not remove known proteins (which are in multiple copies). This is not a huge error since the number of repeat sequences affected is small and unless the organism is very well characterized some of the protein calls employed at that step might themselves be incorrect. To fix this format discrepancy make these changes:

mv $TOPDIR/src/ProtExcluder1.2/ $TOPDIR/src/ProtExcluder1.2/
mv $TOPDIR/src/ProtExcluder1.2/ $TOPDIR/src/ProtExcluder1.2/

Run the rest of the commands

mkdir -p $WORKDIR/Repeat_Library
cd $WORKDIR/Repeat_Library
chmod 755
# contains all the program run commands described in the
# source Repeat_Library_Construction-Advanced web page.
# Phase numbers in the script correspond to section numbers in that web page
./ >commands.log 2>&1 &

The results are in $WORKDIR/Repeat_Library/Lv_repeats.fasta with 3525 sequences comprising 1896774 bp, ranging in length from 33 bp to 8076 bp.
Back to the top of the page