Using dictionaries to pull specific samples from GTEx data
For this assignment, you will be looking at how much tissue expression varies across individuals for each of the highly expressed and tissue specific genes. To do this, you will be using the gene-tissue pair results from the morning advanced excercise (which is provided) to pull individual sample expression values for each of these gene-tissue pairs. You will also need the complete GTEx expression data file GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct and the sample attribute file GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt.
For a description of GTEx, see the GTEx Portal.
Because the expression data file is so large, you should create a smaller file to test and debug your script on. You can do this using the command:
head -n 500 GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct > test_data.gct
Please submit your answers as a .py script with comments explaining the logic of each code block, a .R script for plotting the results, and a .pdf plot (step #7). (Remember that comment lines start with # and are ignored by Python and R). For questions regarding interpretation or discussion of your results, please include your answers as comments interspersed with your code.
1. The first step is to load the gene-tissue pairs from the gene_tissue.tsv file.
  - This file contains three columns, geneID, gene name, and tissue, separated by tabs.
- These data are genes that are highly expressed in a single tissue and the corresponding tissue.
- Because you will be checking this information later by the geneID, it makes the best sense to create a dictionary keyed by the geneID with the tissue as the value.
  - If you look at the complete gene expression file GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct(you could uselessto look at the file), you will see that each column is labeled by a sampleID.
- Looking at the metadata file GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt, you will see that each sampleID corresponds to a particular subject and tissue. The columns that you are interested areSAMPID(the sample ID) andSMTSD(the specific tissue label).
- In the following steps you will be looking things up by the tissue name so you will want to create a dictionary using the tissue as the key.
- For each tissue, you will have multiple sampleIDs so you will need the dictionary value to be a list that you append sample IDs to (so you don’t need to check if a tissue has already been added to the dictionary, you can use setdefault, e.g.sample_dict.setdefault(tissue, []))
- Make sure to look at the structure of the file to know if there are lines that need skipping. In order to skip a line, you need to first open the file (fs = open(fname)) and then you can usereadlineto get one line at a time. If you wanted to skip 4 lines, you would callfs.readline()4 times and not save the returned line. You can still read lines in using aforloop after usingreadlineusing the syntaxfor line in fs:. The python will continue reading in lines where you left off withreadline.
3. You now need to get the list of sampleIDs that are present in the gene expression file GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.
  - To do this, you will need the column names (sampleIDs). You can use lessto take a look at the first few lines of the file and determine which row has these names.
- Load the sampleIDs (and only the sampleIDs) that correspond to the columns in the GTEx expression data file into a list. Pay attention to which row they are in and which column the sampleIDs start.
- You do not need to load the complete file to get this information, only up to the line containing the column header.
4. Because each sampleID corresponds to a specific tissue and you are only interested in specific tissues for specific genes, it will be useful to know which columns (i.e. which sampleIDs) are relevant for each tissue.
  - To get the relevant columns, the first step is to create a dictionary keyed by each sampleID (from step 3) with the column index as its value.
- Be aware, not every sampleID in the metadata file is in the expression data file.
5. Now you can create the your dictionary of each tissue and which columns in the gene expression file correspond to it.
  - To do this, you will need to create a new dictionary keyed by each tissue name with the value being a list of column indexes.
- Use a nested forloop to step through the tissue names followed by stepping through the sampleIDs associated with that tissue (the dictionary you created in step 2).
- For each sampleID, check if it is present in your sampleID-column index dictionary from step 4. If it is, add that column index to the correct tissue list in your new dicionary.
- Remember that you can check if an ID is in the sampleID list using a command like if sample in sample_list:.
- 
    For each tissue type, see how many samples have expression data. Which tissue types have the largest number of samples? The fewest? 
  - Your comments should demonstrate that you understand not only the purpose of a piece of code, buy also how it is accomplishing that purpose
  - For categories, create a combination of tissue names and gene IDs (dplyr::mutate(Tissue_Gene=paste0(Tissue, " ", GeneID)))
- You will need to log-transform your data with a psuedo-count of one (you can use dplyr::mutatefor this step as well)
- Switch the axes for the violin plot so the categories are on the y-axis (coord_flip())
- 
    Make sure to label your axes Given the tissue specificity and high expression level of these genes, are you surprised by the results?
 What tissue-specific differences do you see in expression variability? Speculate on why certain tissues show low variability while others show much higher expression variability.
 
Advanced exercises
  - To check if the gene is from the gene-tissue pair file, you can use the keyword into see if a value is in the step 1 dictionary’s keys just like a list.
- If the gene is in the gene-tissue pair set, determine which tissue that gene is associated with, get the column indices for that tissue, and pull out only the expression values for the corresponding tissue.
- Unlike lists, numpy arrays can use a list of indices to pull out multiple values at the same time and much faster. With this in mind, when you find a target gene you should convert its expression values into a numpy array and then you can use the index list that you made step 5 to extract the tissue-specific expression values in one step.
- You will potentially have different numbers of expression values for each gene, so saving your data in a numpy array doesn’t make sense. Instead, you can use a couple of different approaches. A list with the geneID, tissue, and expressions is one option. A dictionary keyed by the geneID and tissue name with the expression value array as the coresponding value is another.
  - 
    Look up the functions of the genes using GeneCards (you need to remove the period and number following it for gene IDs to look up the genes). Do genes associated with a given tissue appear to be related in function? If so, what function?