Learning Objectives
Document each answer in a separate .py file stored in ~/qbb2024-answers/day2-homework.
Please git push after each exercise and do not wait until the end of the session.
grep.py – Warm up by implementing a basic grep program to practice indexing the sys.argv list, processing a text file line-by-line, and removing the newline character.
$ grep.py FIS1 gencode.v46.basic.annotation.gtf
chr7 HAVANA gene 101239458 101252316 . - . gene_id "ENSG0000021 ...
chr7 HAVANA transcript 101239472 101245081 . - . gene_id "ENSG0000021 ...
chr7 HAVANA exon 101244960 101245081 . - . gene_id "ENSG0000021 ...
gtf2bed.py – Create a program that converts genome annotation information from .gtf format to .bed format. Specifically, print out just the chromosome, start, stop, and gene_name, stripping off both the beginning gene_name " and ending ".
$ gtf2bed.py genes.gtf
chr1 11869 14409 DDX11L2
chr1 12010 13670 DDX11L1
chr1 14696 24886 WASH7P
tally-fixed.py – Starting with tally.py, identify and fix the three bugs in this code. The output of this program should match the output of cut -f 1 | uniq -c e.g.
$ grep -v "#" gencode.v46.basic.annotation.gtf | cut -f 1 | uniq -c
197188 chr1
152253 chr2
125674 chr3
A. Modify your grep.py program to accept an optional -v flag that inverts the match
grep transcript_id or grep -v transcript_idB. Implement tail
f.read()C. Create a program that for each gene in prints out one line containing the gene_name followed by each transcript_name
FIS1 FIS1-201 FIS1-207 FIS1-203
D. cut.py – Implement a basic cut program where the 1st command line argument specifies the fields to be output, separated by commas . Use .split() to separate this argument into a list, providing you a list that you can loop over to extract the specified fields from each line in the file (i.e. nested for loops). If you store the specified fields in a list, you can combine the entire list into a string using "\t".join(my_list).
Test this on `hg38-gene-metadata-feature.tsv` to avoid parsing header lines that begin with `#`.
```
$ cut.py 0,2,6 hg38-gene-metadata-feature.tsv
ensembl_gene_id chromosome_name gene_biotype
ENSG00000228037 1 lncRNA
ENSG00000142611 1 protein_coding
```