RNA-seq read data processing

I’ve started to dip my toe in the pool of bioinformatics methods, using our recent data sets as an incentive to learn what I’m doing. In reading more about using Salmon, it seems I should build a “decoy-aware” index. So the commands below source accomplish that.

grep "^>" <(gunzip -c Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz) | cut -d " " -f 1 > decoys.txt
sed -i.bak -e 's/>//g' decoys.txt
cat Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz > gentrome.fa.gz
salmon index -t gentrome.fa.gz -d decoys.txt -p 12 -i salmon_index --gencode

This pegs the CPU at 100% but does not fill the 8 GB of RAM on this machine. This took about 7 min on this machine.

After this I will do another trial run with a few reads and compare the mapping rate with this decoy-aware index compared to the raw transcriptome I used yesterday (I renamed the output from yesterday quants_1).