For a description of GTEx, see the GTEx Portal.
Certain genes have high tissue specificity while some tissues are highly enriched in certain proteins. In this assignment, you will be find the intersection between these two, identifying genes with a ten-fold enrichment in a single tissue and that are the highest expressed gene in their enriched tissue. To do this, you will be using a summary dataset that has the median expression levels across samples for each gene and tissue, named GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct
.
Please submit your answers as a .py
script with comments explaining the logic of each code block and for questions regarding interpretation or discussion of your results, please include your answers as comments interspersed with your code or a separate text file (Remember that comment lines start with #
and are ignored by Python and R). For advanced exercises, please also include the .tsv
file with gene-tissue pairs.
open
function discussed in previous sections.less GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct
on the command line. Note that there are lines at the beginning that you want to skip.You can use the filestream method readline()
to read in one line at a time. This will allow you to read in header lines without saving the results. If you wanted to skip 4 lines, you would call fs.readline()
4 times and not save the returned line. You can still read lines in using a for
loop after using readline
using the syntax for line in fs:
. Python will continue reading in lines from where you left off with readline
.\t
(a tab), so make sure to specify that when using the split()
method.In the case of text names, numpy will determine what datatype is appropriate. For your expression data, you will need to specify the data type as float
using the argument dtype=float
.
Why do you need to tell numpy what type of data the expression data is but not for the other data lists?
for
loops.For the first 10 genes (rows, axis 0), use a nested for
loop to find the mean gene expression and print them. Think about which axis needs to be the outer for
loop and which the inner.
mean
and printing them.You will need to use the axis
argument and set it equal to the axis corresponding to tissues since you want to find the average across tissues.
Do they match the means you found with the nested for
loop approach?
axis
argument.The numpy functions for these statistics are median
and mean
.
What can you infer from the difference between these statistics?
numpy.log2
).Because there are zeros in the expression data, you will need to add a pseudo-count (a small value to each expression) to avoid taking the log of zero. In this case, use a pseudo-count of one.
Now how do the median and mean transformed expression values compare to each other? To the non-transformed values?
numpy.copy
.numpy.sort
). You will need to specify an axis to sort along. Given that numpy reads axis 0 as rows and axis 1 as columns, think about which axis you need to specify with the axis=
argument.diff_array
). You want to be able to identify genes that are highly tissue-specific.numpy.zeros_like
or numpy.zeros
and data_array.shape
).numpy.argmax
, which returns the index of the highest value.for
loop (for
loops are always slow in python, whereas numpy is extremely fast), you can use two list of indices, the first for the row and the second for the column:row_index = numpy.arange(data.shape[0])
col_index = numpy.argmax(...fill in yourself...)
diff_array
using the following syntax ((diff_array < 10)
).high_tissue
array is 2D. To reshape it, use the syntax ((diff_array < 10).reshape(-1, 1)
).high_tissue
array by this reshaped array. The reshaped array will be expanded to match the dimensions of the high_tissue
array. The result will be a 2D array with ones in positions where highly tissue-specific genes are the highest expressed for a given tissue.numpy.where
.Print out and save each gene-tissue pair. Remember that you have the gene names and tissue names saved and these can be indexed using the appropriate 1D array resulting from the numpy.where
command.
Do these genes make sense for the corresponding tissue? You can look up information about each gene at GeneCards (you need to remove the period and number following it for gene IDs to look up the genes as the number after the last period is a version number).