Learning Objectives
Document each answer in a separate .py file stored in ~/qbb2024-answers/day2-homework
.
Please git push
after each exercise and do not wait until the end of the session.
grep.py
– Warm up by implementing a basic grep
program to practice indexing the sys.argv
list, processing a text file line-by-line, and removing the newline character.
$ grep.py FIS1 gencode.v46.basic.annotation.gtf
chr7 HAVANA gene 101239458 101252316 . - . gene_id "ENSG0000021 ...
chr7 HAVANA transcript 101239472 101245081 . - . gene_id "ENSG0000021 ...
chr7 HAVANA exon 101244960 101245081 . - . gene_id "ENSG0000021 ...
gtf2bed.py
– Create a program that converts genome annotation information from .gtf format to .bed format. Specifically, print out just the chromosome, start, stop, and gene_name, stripping off both the beginning gene_name "
and ending "
.
$ gtf2bed.py genes.gtf
chr1 11869 14409 DDX11L2
chr1 12010 13670 DDX11L1
chr1 14696 24886 WASH7P
tally-fixed.py
– Starting with tally.py
, identify and fix the three bugs in this code. The output of this program should match the output of cut -f 1 | uniq -c
e.g.
$ grep -v "#" gencode.v46.basic.annotation.gtf | cut -f 1 | uniq -c
197188 chr1
152253 chr2
125674 chr3
A. Modify your grep.py
program to accept an optional -v
flag that inverts the match
grep transcript_id
or grep -v transcript_id
B. Implement tail
f.read()
C. Create a program that for each gene in prints out one line containing the gene_name followed by each transcript_name
FIS1 FIS1-201 FIS1-207 FIS1-203
D. cut.py
– Implement a basic cut
program where the 1st command line argument specifies the fields to be output, separated by commas . Use .split()
to separate this argument into a list, providing you a list that you can loop over to extract the specified fields from each line in the file (i.e. nested for
loops). If you store the specified fields in a list, you can combine the entire list into a string using "\t".join(my_list)
.
Test this on `hg38-gene-metadata-feature.tsv` to avoid parsing header lines that begin with `#`.
```
$ cut.py 0,2,6 hg38-gene-metadata-feature.tsv
ensembl_gene_id chromosome_name gene_biotype
ENSG00000228037 1 lncRNA
ENSG00000142611 1 protein_coding
```