The GTEx gene expression dataset consists of samples taken from many human subjects across a variety of tissues. From this dataset we have identified a set of highly-expressed tissue-specific genes. In this exercise, you will be pulling the expressing data for those genes but only in their primary tissue of expression and investigating how much variability exists across the sample population.
Biological Learning Objectives
Computational Learning Objectives
for
loop~/qbXX-answers/python_dictionaries.ipynb
.# Use this dictionary for identifying genes of interest and their corresponding tissue of interest
gene_tissue = {
'ENSG00000042832.11': 'Thyroid',
'ENSG00000091704.9': 'Pancreas',
'ENSG00000118245.2': 'Testis',
'ENSG00000122304.10': 'Testis',
'ENSG00000122852.14': 'Lung',
'ENSG00000132693.12': 'Liver',
'ENSG00000134812.7': 'Stomach',
'ENSG00000135346.8': 'Pituitary',
'ENSG00000137392.9': 'Pancreas',
'ENSG00000142515.14': 'Prostate',
'ENSG00000142615.7': 'Pancreas',
'ENSG00000142789.19': 'Pancreas',
'ENSG00000158874.11': 'Liver',
'ENSG00000162438.11': 'Pancreas',
'ENSG00000164816.7': 'Small Intestine - Terminal Ileum',
'ENSG00000164822.4': 'Small Intestine - Terminal Ileum',
'ENSG00000168925.10': 'Pancreas',
'ENSG00000168928.12': 'Pancreas',
'ENSG00000170890.13': 'Pancreas',
'ENSG00000171195.10': 'Minor Salivary Gland',
'ENSG00000172179.11': 'Pituitary',
'ENSG00000175535.6': 'Pancreas',
'ENSG00000175646.3': 'Testis',
'ENSG00000182333.14': 'Stomach',
'ENSG00000187021.14': 'Pancreas',
'ENSG00000203784.2': 'Testis',
'ENSG00000204983.13': 'Pancreas',
'ENSG00000219073.7': 'Pancreas',
'ENSG00000229859.9': 'Stomach',
'ENSG00000240338.5': 'Pancreas',
'ENSG00000254647.6': 'Pancreas',
'ENSG00000256713.7': 'Stomach',
'ENSG00000259384.6': 'Pituitary',
}
Load sample metadata from “~/Data/GTEx/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt” and create a dictionary of sample metadata associating each sample ID with the tissue it was sampled from
\t
in the .split()
methodSAMPID
and SMTSD
, the sample ID and sample tissue, respectivelySAMPID
as the key and SMTSD
as the valuemetadata_fname = "/Users/cmdb/Data/GTEx/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt"
# Create a dictionary to hold the metadata
# Step through each line of the metadata file
# Remove the newline character from the line you just read in and split it by the tab character
# Add the data to your metadata dictionary, using the "SAMPID" column value as the dictionary key and the "SMTSD" column value as the dictionary value
Load and process the headeer from the GTEx gene expression file “~/Data/GTEx/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz”
# Because this file is a gzip-compressed file, you will be using the `gzip` library that comes with python
import gzip
expression_fname = "/Users/cmdb/Data/GTEx/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz"
# Open the file using the gzip library
fs = gzip.open(expression_fname)
# The file starts with two lines of information before the header, so these are skipped using 2 calls of `.readline()` without keeping the returned data
_ = fs.readline()
_ = fs.readline()
# Gzipped text files are read in as byte strings, not regular strings, so you will see that `.decode()` is included after `.readline()` to convert the input into a string
header = fs.readline().decode().rstrip().split("\t")
# Make sure to close the file when you are done reading it
fs.close()
# Create a list to hold the tissue names
tissues = []
# Step through each value in the header
# Check if the value has an entry in the metadata dictionary, adding either the tissue corresponding to that entry or "missing" to your tissues list
Create a list of expression values for each gene in the gene_tissue
dictionary (given at the start of the exercise), keeping only expression values from the tissue of interest for that gene
log2
function from the built-in library math
# Import the log2 function from the math library to transform the expression counts
from math import log2
# Create a dictionary to hold each gene's expression data
gene_expression = {}
# Because this is a large file, for testing purposes you can use only a small portion of the file to test your code
line_counter = 0
# Step through each line of the file
for line in gzip.open(expression_fname):
############# REMOVE THIS AND RERUN ONCE YOUR CODE IS WORKING ####################
# For debugging, use only the first 500 lines of the file.
line_counter += 1
if line_counter == 500:
break
############# REMOVE THIS AND RERUN ONCE YOUR CODE IS WORKING ####################
# Convert the line into a string from a byte string, remove the newline character at the end, and split it by tab characters
fields = line.decode().rstrip().split("\t")
# Check if this line contains data for a gene we are interested in. If not, skip it
# Determine which tissue we care about for this gene
# Create a list in the gene expression dictionary to hold the gene's data
gene_expression[fields[0]] = []
# Look at each column position, one by one
for i in range(len(tissues)):
# For each column, see if it is from the tissue we care about (using the tissues list). If so, add that field to our gene expression list, first transforming it with log2(1 + float(field[i]))
For each gene, calculate the mean, standard deviation, and relative standard deviation and write them to a results file
mean
and stdev
functions from the built-in statistics
library. Both functions accept a list as their input# Load the functions to caculcate the mean and standard deviation from a list
from statistics import stdev, mean
# Open a file to write the results to
# Step through each gene and its expression value list
# Calculate the mean and standard deviation of expression
# Calculate the relative standard deviation (coefficient of variation, std / mean)
# Write the tissue namne, gene name, mean, standard deviation, and relative standard deviation to your results file
# Close your results file
Looking at your results, answer the following questions