This exercise picks up from the last point in the introductory presentation (see above). It explores methods for automating repetitive tasks.
Our analysis goal in this exercise is to map paired end Illumina reads from an RNA sequencing run onto a de-novo assembled transcriptome. There are 12 samples and we need to align each of them. We will then run some summary stats on the aligned reads.
Change directory into the exercise_1
folder and then use ls
to display the files present
cd bcc2020cle/exercise_1
ls
Inputs
H_mac_transcripts_codingseq.fasta
Use the rstudio
interface to create a new text file inside the exercise_1
directory. Name your new file 01_download.sh
Copy the following code into your script. This code comes from the last few slides of the introductory presentation
base_url="https://s3-ap-southeast-2.amazonaws.com/bc3203/EM_10k/"
for sample_num in $(seq 1 2)
do
for r in R1 R2
do
echo "${base_url}EM${sample_num}_${r}.fastq.gz"
done
done
Now try running the script
bash 01_download.sh
The script emits four URLs corresponding to 2 samples.
Now try downloading the first URL using the wget
command
wget https://s3-ap-southeast-2.amazonaws.com/bc3203/EM_10k/EM1_R1.fastq.gz
If you need help you can view the final answer here
We will now align each of the pairs of reads to the transcriptome using bowtie2
. Before we can do this we first need to index the transcriptome using bowtie2-build
.
bowtie2-build H_mac_transcripts_codingseq.fasta H_mac_transcripts_codingseq
While you wait for this command to finish create a new script file. Call it 02_align.sh
and save it in the exercise_1
directory.
This file will contain commands for iterating over the samples and running the bowtie2
alignment program separately for each sample. Start by practising the command for a single sample. Try running bowtie2
with no arguments like this
bowtie2
You should see a whole lot of text printed. These are all the bowtie2
options. Scroll to the top to see basic usage which looks like this
Usage: bowtie2 [options]* -x
-1 -2
For paired-end reads we need to supply three arguments
-x
: The indexed reference. In this case H_mac_transcripts_codingseq
-1
: File containing forward reads-2
: File containing reverse readsFor example
bowtie2 -x H_mac_transcripts_codingseq -1 EM1_R1.fastq.gz -2 EM1_R2.fastq.gz
This prints output in sam
format directly to standard output. We could redirect this to a sam
file like so
bowtie2 -x H_mac_transcripts_codingseq -1 EM1_R1.fastq.gz -2 EM1_R2.fastq.gz > EM1.sam
Now that we have a method for aligning a single sample our goal is to write a script to automate the process for all 12 samples.
Start by pasting the following code into your 02_align.sh
file
for r1 in *R1.fastq.gz;do
echo $r1
done
Then run your file like this
bash 02_align.sh
This is step 1 of the process. We have a method for iterating over each of the forward read files. We still need to figure out how to do the following;
echo
)Note that we are using echo
here for debugging purposes. It allows us to build up the script gradually, verifying that we have each piece of information before we put it all into a working command.
Now try the following (just in your Terminal don’t enter it into the 02_align.sh
script)
r1=EM1_R1.fastq.gz
echo $r1
echo ${r1/R1/R2}
The second echo
command uses an example of parameter expansion which is a very powerful feature of bash. It is often use to manipulate text and is very useful for transforming filenames. The form we used above has the general form:
${parameter/pattern/string}
And will substitute string
for the first occurrence of pattern
.
Now try another form of parameter expansion
.
r1=EM1_R1.fastq.gz
echo ${r1%_R1.fastq.gz}
This time parameter expansion chopped off the trailing part of the $r1
variable to produce the sample name.
Now we have the pieces we need to implement 02_align.sh
Replace the code in 02_align.sh
with the following code
for r1 in *R1.fastq.gz;do
r2=${r1/R1/R2}
sample=${r1%_R1.fastq.gz}
bowtie2 -x H_mac_transcripts_codingseq -1 $r1 -2 $r2 > $sample.sam
done
And run it like this
bash 02_align.sh
sam
to bam
The sam
format is human readable but isn’t very efficient. For further processing it is often better to convert to bam
which is a more efficient binary format. One (of many) ways to do this is using the samtools view
command. For example to convert one file
samtools view -b -S EM1.sam > EM1.bam
We might want to do this for all the .sam
files. We will use this an example to introduce the parallel
tool which allows us to run operations in parallel in a simple way.
Create a new empty script file, call it 03_sam2bam.sh
and save it inside the example_1
directory.
Paste the following code into the file and save it
sam2bam(){
sample=${1%.sam}
samtools view -b -S $sample.sam > $sample.bam
}
export -f sam2bam
parallel -j 12 sam2bam ::: *.sam
Now try running the script
bash 03_sam2bam.sh
Type ls *.bam
to see all the new files it generated.
What is going on in this script?
There are two new concepts here
sam2bam()
parallel
command. This allows us to run sam2bam()
on 12 processes simultaneously.Let’s play with these in the Terminal to get familiar with them.
First the parallel command
parallel -j 4 echo ::: $(seq 1 12)
On the right of the :::
we have generated a sequence 1 through to 12. Each of these is passed to echo
but parallel is running them 4 at a time.
When running complex sequences of commands with parallel
it is often easier to capture these in a function. Try defining a function as follows;
myecho() { sleep 1; echo "my $1"; }
This waits one second and then prints it’s argument along with a my
prefix.
Now try running this with parallel
export -f myecho
parallel -j 4 myecho ::: $(seq 1 12)
sam
filesTry viewing the contents of one of the .sam
files with samtools view
samtools view -S EM1.sam
This looks complicated but it is just a tabular format. The columns are all described here. Each line in the file represents an alignment (a read and its position in the reference). A single read can have multiple alignments.
Our data is extremely sparse. We have just 200 reads for each sample and something like 22.5k transcript sequences in the reference. One nice aspect of this is that it allows us to see the paired nature of the reads easily. Since each pair comes from the same cDNA fragment there should usually be exactly 2 reads for each gene. This is because the data is very sparse we will rarely have independent reads from the same gene.
You can verify this as follows
awk
to extract the third column of the file which contains the gene name
samtools view -S EM1.sam | awk '{print $3}'
sort
and uniq
to count occurrences
samtools view -S EM1.sam | awk '{print $3}' | sort | uniq -c