QBB2024 - Day 2 - Lunch Exercises

Overview

Biological Learning Objectives

Summarize gene metadata retreived from BioMart
Explore the GENCODE genome annotation

Computational Learning Objectives

Practice working at the command line
Explore text files using core Unix programs

Instructions

Complete the following exercises using a combination of cut, sort, uniq, and grep. Document each command you use, along with any short answers, in a single README.md file stored in ~/qbb2024-answers/day2-lunch. Use Markdown headings, lists, and code blocks to organize your answers into sections and format content for proper display e.g.

## Answer 1

- `wc -l hg38-gene-metadata-feature.tsv`
- There are 61633 lines

Please git push after each exercise and do not wait until the end of the session.

Exercises

BioMart provides a way to obtain genome annotation information (referred to as Attributes) such as gene IDs, transcripts, phenotypes, GO terms, structures, orthologues, variants, and more. This information can be retrieved through the web-based tool as well as the Bioconductor biomaRt package.

Tally the number of each gene_biotype in hg38-gene-metadata-feature.tsv. How many protein_coding genes are there? Pick one biotype you would want to learn more about and explain why.
Which ensembl_gene_id in hg38-gene-metadata-go.tsv has the most go_ids? Create a new file that only contains rows corresponding to that gene_id, sorting the rows according to the name_1006 column. Describe what you think this gene does based on the GO terms.
```
 ENSG000XYZ  GO:0090425  acinar cell differentiation
 ENSG000XYZ  GO:0016323  basolateral plasma membrane
 ENSG000XYZ  GO:0045296  cadherin binding
```

GENCODE works to annotate the human and mouse genome using biological evidence such as long-read RNA-seq, Ribo-seq, and other targeted approaches. This gene set is used by many projects including Genotype-Tissue Expression (GTEx), The Cancer Genome Atlas (TCGA), and the Human Cell Atlas (HCA).

Complete the following exercises using the gene.gtf that we created together

grep -w gene gencode.v46.basic.annotation.gtf > gene.gtf

Immunoglobin (Ig) genes are present in over 200 copies throughout the human genome. How many IG genes (not pseudogenes) are present on each chromosome? You can use a dot (.) in a regular expression pattern to match any single character. How does this compare with the distribution of IG pseudogenes?
Why is grep pseudogene gene.gtf not an effective way to identify lines where the gene_type key-value pair is a pseudogene (hint: look for overlaps_pseudogene)? What would be a better pattern? Describe it in words if you are having trouble with the regular expression.
Convert the annotation from .gtf format to .bed format. Specifically, print out just the chromosome, start, stop, and gene_name. As cut splits lines into fields based on the tab character, first use sed to create a new file where spaces are replaced with tabs.
```
 sed "s/ /\t/g" gene.gtf > gene-tabs.gtf
```

Just for fun

A. Explore the GENCODE mouse annotation noting similaries and differences with the human annotation