For this assignment, you will be looking at how much tissue expression varies across individuals for each of the highly expressed and tissue specific genes. To do this, you will be using the gene-tissue pair results from the morning advanced excercise (which is provided) to pull individual sample expression values for each of these gene-tissue pairs. You will also need the complete GTEx expression data file GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct
and the sample attribute file GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt
.
For a description of GTEx, see the GTEx Portal.
Because the expression data file is so large, you should create a smaller file to test and debug your script on. You can do this using the command:
head -n 500 GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct > test_data.gct
Please submit your answers as a .py
script with comments explaining the logic of each code block, a .R
script for plotting the results, and a .pdf
plot (step #7). (Remember that comment lines start with #
and are ignored by Python and R). For questions regarding interpretation or discussion of your results, please include your answers as comments interspersed with your code.
gene_tissue.tsv
file.GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct
(you could use less
to look at the file), you will see that each column is labeled by a sampleID.GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt
, you will see that each sampleID corresponds to a particular subject and tissue. The columns that you are interested are SAMPID
(the sample ID) and SMTSD
(the specific tissue label).setdefault
, e.g. sample_dict.setdefault(tissue, [])
)fs = open(fname)
) and then you can use readline
to get one line at a time. If you wanted to skip 4 lines, you would call fs.readline()
4 times and not save the returned line. You can still read lines in using a for
loop after using readline
using the syntax for line in fs:
. The python will continue reading in lines where you left off with readline
.GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct
.less
to take a look at the first few lines of the file and determine which row has these names.for
loop to step through the tissue names followed by stepping through the sampleIDs associated with that tissue (the dictionary you created in step 2).if sample in sample_list:
.For each tissue type, see how many samples have expression data.
Which tissue types have the largest number of samples? The fewest?
geom_violin()
).dplyr::mutate(Tissue_Gene=paste0(Tissue, " ", GeneID))
)dplyr::mutate
for this step as well)coord_flip()
)Make sure to label your axes
Given the tissue specificity and high expression level of these genes, are you surprised by the results?
What tissue-specific differences do you see in expression variability? Speculate on why certain tissues show low variability while others show much higher expression variability.
in
to see if a value is in the step 1 dictionary’s keys just like a list.for
loop. The outer for
loop looks at each gene while the inned for
loop reads each expression value for that gene.corrcoef
to find the Pearson correlation, and because this function returns a correlation matrix, you will need to select item [0, 1]
).Look up the functions of the genes using GeneCards (you need to remove the period and number following it for gene IDs to look up the genes).
Do genes associated with a given tissue appear to be related in function? If so, what function?