Unix and Python Scripts

Overview

Biological Learning Objectives

Explore genome annotation files in BED format and GTF format

Computational Learning Objectives

Use Unix commands to examine text files
Create Python scripts for text wrangling

Instructions

Document your answers in ~/qbXX-answers/unix-python-scripts

Upload your scripts to https://github.com after each exercise and do not wait until the end of the session

Place all of your Unix commands in a single file named unix-commands.sh along with the output as a comment e.g.

#!/bin/bash

tail ce11_genes.bed | head -n 2

# chrIII	13768540	13771741	NM_067444.8	515	-
# chrIII	13769876	13769953	NR_003432.1	9	-

Exercises

Explore ce11_genes.bed using Unix

Calculate each of the following statistics by constructing a single command using one or more (linked together with a |) of the following commands: cut, grep, sort, uniq, wc
- How many features (lines)?
- How many features per chr? e.g. chrI, chrII
- How many features per strand? e.g. +, -
Recalculate ce11_genes.bed scores using Python

Write a script that for each feature (line) recalculates the score (column 5) such that
- new_score = original_score * feature_size
- new_score is positive or negative based on the strand (column 6)
Print out all six columns in BED format
Explore GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt using Unix
- Which three SMTSDs (Tissue Site Detail) have the most samples?
- How many lines have “RNA”?
- How many lines do not have “RNA”?
Transform GTEx data using Python

Write a script that extracts expression values for the first gene (DDX11L1) which is stored on a single line spread across more than 17,000 columns and transposes the data so that the expression in each sample is stored on a separate line.

Open the expression data file GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct
- Skip the first two lines using .readline()
- Read the next header line and save the fields after splitting
- Read the next data line and save the fields after splitting
- Create a dictionary by looping through the fields, using header[i] as the key to store data[i] as the value
Open the metadata file GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt
- If the SAMPID in the first column is present in the dictionary, print out the SAMPID, expression, and SMTSD e.g.
```
 GTEX-ZZPU-2426-SM-5E44I 0       Artery - Tibial
 GTEX-ZZPU-2626-SM-5E45Y 0.01965 Muscle - Skeletal
 GTEX-ZZPU-2726-SM-5NQ8O 0.02522 Adipose - Subcutaneous
```
What are the first three tissues that have >0 expression?

Optional

Explore ~/Data/References/hg38/gencode.v46.basic.annotation.gtf using Unix
- How many entries are there for each feature type? Look at column 3 and be sure to skip any lines that begin with #
- How many lncRNA entries are on each chromosome?
Export gene features to BED format using Python

Write a script that takes gencode.v46.basic.annotation.gtf and
- Ignores a line if it startswith #
- Only prints output for gene features (column 3)
- Subtracts 1 from the start position to convert to zero-based coordinate system
- Correctly parses gene name from the attribute (column 9)
- Prints chr, start, stop, gene name

Grading

unix-commands.sh – 3 pt
recalculate-score.py – 2 pt
sample-by-gene.py – 3 pt
gtf2bed.py – 2 pt