In this exercise we will attempt to design probes for a random subset of variant loci in a genome
Change directory into the exercise_4 folder and then use ls to display the files present.
cd ~/bcc2020cle/exercise_4
ls
Inputs
vcf formatted file containing variant sites aten_example.vcfaten_filal_0.11.fastaOur strategy is as follows;
grep and awk to extract coordinates of all variants in the variant fileshuf to take a random subsample of 10000 of thesesamtools faidx to extract a small flanking region around each variantStart by taking a look at the vcf file. THe first few lines are header content
head aten_example.vcf
Using tail shows the actual variant entries
tail aten_example.vcf
All the header entries start with a # character. We can exclude them using grep as follows;
grep -v '^#' aten_example.vcf | head
These are vcf entries with lots of information but we only need the first and second columns. The first column denotes the chromosome (or genomic scaffold) and the second gives the genomic position of the variant. We can extract these easily with awk
grep -v '^#' aten_example.vcf | awk '{print $1,$2}'
Now let’s skip to the final step and explore how we can use this information to retrieve the genomic sequence flanking this SNP. With samtools faidx we can extract a specific region with the syntax
chrom:start-end
For example;
samtools faidx aten_final_0.11.fasta Sc0000000:2118-2138
If we want probes 50bp either side of a SNP we can modify our awk command to print these intervals
grep -v '^#' aten_example.vcf | awk '{printf("%s:%s-%s\n",$1,$2-50,$2+50)}'
These are sorted according to genomic position. If we want a random subset we can do
grep -v '^#' aten_example.vcf | awk '{printf("%s:%s-%s\n",$1,$2-50,$2+50)}' | shuf -n 1000
Finally we can pipe this directly to samtools faidx via xargs
grep -v '^#' aten_example.vcf | awk '{printf("%s:%s-%s\n",$1,$2-50,$2+50)}' | shuf -n 1000 | xargs samtools faidx aten_final_0.11.fasta
Note. This strategy illustrates some bash principles but it isn’t perfect. We can run off the ends of chromosomes. A more precise way to do this is using the
bedtoolsutility.