The goal of today’s lab is to learn about what types of information exists within a genome, the formats that data can take, how to visualize and obtain publicly available data, and how to use various data sources to identify relevant subsets of a genome. In order to carry out today’s assignment, you will be making extensive use of the UCSC Genome Browser for data visualization, filtering and downloading data, and analysis. The aim of this assignment is to look at the prevalence (and therefor the infered genomic tolerance for) SNPs within various genomic features and across varying SNP frequencies in the population.
Important Please commit and push as you finish each part (in the case of 2.1, commit and push as you do each task within the step). This will really help us gauge the where people are being challenged and better help you all and future students.
You will be provided with a set of bed files containing a set of common SNPs for human chromosome 1 (genome build hg38) partitioned by minor allele fequency (MAF). These SNPs were obtained from the UCSC Genome Browser from the dbSNP release 151 common SNPs. There is also a bed file dividing chromosome 1 into contiguous 20Kb bins.
You’ll start by downloading the SNP files from here. You will need to unpack the contents using the following command:
tar xzf chr1_snps.tar.gz
You should now see 6 files, chr1_snps_0.*.bed for values 1-5 and genome_chr1.bed.
Using the Table Browser tool in the genome browser, download the GenCode v46 known genes from the Genes and Gene Predictions
group. Make sure you restrict the region of interest to chromosome 1. You will also want to specify under the Output format
menu that you only want Selected fields from the primary and related tables
. Also, specify a file name to download the data to. Then press the Get output
button. You can now select the specific fields you want. For this assignment, you only want the chromosome, txStart (transcription start), and txEnd (transcription end) positions.
Repeat the above process, except this time select chromosome, exonStarts, and exonEnds. This will give you a bed file with all known coding exons.
You will also need to download a set of regulatory elements using the Table Browser. The procedure should be identical to step 1.2 except you will find these data in the Regulation
group under the track Encode cCREs
. Again, you only need the chromosome, chromStart, and chromEnd fields.
Unfortunately, because there are many isoforms of genes, many regions in the genome are represented multiple times in the gene and exon bed files that you just downloaded. In order to create a file without overlapping ranges, you will need to use bedtools merge
to eliminate the redundancy in each file. If you simply try to run bedtools on each file as is, you will get an error telling you that the file needs to be sorted by chromosome and coordinate. You can do this with bedtools sort
.
For the TAs’ sake, name your final feature files <feature>_chr1.bed
where <feature>
is each feature name (genes, exons, or cCREs) for a total of 3 files.
It would be nice to look at introns as well as exons. Because of the multiple isoform issue, many introns contain exons that are skipped in a particular isoform. To get around this, you will be using the merged gene and merged exon files that you created in step 1.4 to infer regions of the chromosome that are purely intronic by subtracting exonic intervals from the gene intervals using bedtools subtract
to create the intron_chr1 bed file.
For the TAs’ sake, name your intron feature file introns_chr1.bed
.
Finally, we want to see how SNP density changes in regions of chromosome 1 that are not covered by the other features of interest. To create an other
interval bed file, you can use the same approach as in step 1.5, but using the genome_chr1.bed
file as the target and subtracting the exon, intron, and cCRE files (this can be done in a single call of bedtools subtract
). You should now have 4 feature bed files for the downstream analysis, exons, introns, cCREs, and other.
For the TAs’ sake, name your intron feature file other_chr1.bed
.
Using the datafiles you created in part 1 along with the SNP bed files partitioned by MAF, you will be finding how many SNPs overlap each, finding the mean SNP/bp for each feature and then calculating the enrichment of each feature SNP density for each MAF level.
Write a bash script that loops through each MAF file and each feature bed file (this sounds a lot like a nested for
loop situation) and use bedtools coverage
to find how many SNPs fall within each set of features. You don’t actually need the set of SNPs that overlap each feature set, but instead are interested in the total count of SNPs. The resulting files should have all the information you need to find the SNP density enrichments.
You will need to sum two values, the number of SNPs and the total size of the feature. Double check what the output columns of bed coverage
are. In order to sum the values in a colum, remember that you can use the awk
command covered in lecture, subbing in the correct column number for $1
(i.e. column 3 would be $3
) and putting in your file name for <filename>
.
SUM=$(awk '{s+=$1}END{print s}' <filename>)
You will need to calculate the ratio of these sums to determine the SNP density. You can use bc
for this. Just remember to use the -l
flag to give you a decimal answer. You will also need the -e
flag before your math equation (remember to put it in quotation marks). Also, don’t forget that you want an enrichment value, not a ratio. How do you get an enrichment? Divide by the background (background = total # SNPs for given MAF / chromosome length).
Finally, you need to save the results in a file. I suggest creating the file by putting in an informative header, like:
echo -e "MAF\tFeature\tEnrichment" > snp_counts.txt
You should have 5 SNP files (different MAF levels) and 4 feature files, for a total of 20 combinations. Each line should have 3 elements, the MAF, the feature, and the enrichment. Please name this file snp_counts.txt
for ease of grading.
To help you, here is a pseudo code to scaffold the general approach:
Create the results file with a header
Loop through each possible MAF value
Use the MAF value to get the file name for the SNP MAF file
Find the SNP coverage of the whole chromosome
Sum SNPs from coverage
Sum total bases from coverage
Calculate the background
Loop through each feature name
Use the feature value to get the file name for the feature file
Find the SNP coverage of the current feature
Sum SNPs from coverage
Sum total bases from coverage
Calculate enrichment
Save result to results file
Once you have your SNP enrichment values for each MAF and feature, write an R script that loads them into a dataframe and plots them using ggplots2. The plot should have 4 lines, one for each feature with the MAF on the X-axis and the log2-transformed enrichment on the Y-axis. You should include a legend and label your axes.
Please name this plot snp_enrichments.pdf
.
Based on the plotted data from step 2.2, you should be able to draw some conclusions. Put answers in the file README.md
1. Which feature appears to be under the highest level of purifying (negative) selection? How can you tell? 2. Why do you think most features start out more enriched at lower MAFs and become more depleted as the MAF increases? 3. Does the relative levels of enrichment/depletion of features with respect to each other make sense? Explain.
Another thing that you can do with the genome browser is visualize your own data alongside the large number of publicly available tracks. You can also perform some analyses right in the browser.
With bedtools coverage
, create a bedgraph file with the number of SNPs in each interval of genome_chr1.bed
file. You will need to first concatenate all of the MAF SNP files into a single file prior to running the coverage tool. You will need to select only the relevant columns from the bedtools output using the command line tool cut
and selecting the chromosome, start position, end position, and count. In order for the genome browser to properly interpret your bedgraph, add the following line of text to the beginning of your bedgraph file:
track type=bedGraph name="20Kb binned SNPs" visibility="full" windowingFunction="mean"
Load your new track into the genome browser. Click the Manage custom tracks
button under the track display. Then use Choose File
to select your bedgraph file and finally click Submit
. You should now be able to go back to the browser and see your SNP coverage track.
Let’s compare it to another track, Cons 30 Primate
under the Comparative Genomics
section. Click on the track name and set each view to hide
except for Basewise_Conservation_(phyloP)
which should be set to full
. Then click submit
.
1. Can you see any relationship between these tracks?
Sometimes it is hard to tell whether a relationship is real or just perceived. To test how these tracks compare to each other, under the Tools
menu select Table Browser
. Select the conservation track you just turned on. Then select the region chr1:20000000-50000000
. Now click the button Correlate
and set the window to 20000 bases and click calculate
.
1. How do these data relate to each other? Is it what you expected comparing them by eye? 2. How would you explain this relationship between SNP density and conservation score?
Given the trends observed in the plot from step 2.2, let’s highlight these and get some significance values to go with them.
Add fitted lines using the geom_smooth
with the method lm
. Then run a linear regression, grouping the data by feature and save the report.
1. Are any of the linear models significant at alpha=0.05%? If so, which ones?
Part 1 (2.5 pts):
Total Points: 10