Biological Learning Objectives
Computational Learning Objectives
Complete the following exercises using a combination of cut
, sort
, uniq
, and grep
.
Document each command you use, along with any short answers, in a single README.md
file stored in ~/qbb2024-answers/day2-lunch
.
Use Markdown headings, lists, and code blocks to organize your answers into sections and format content for proper display e.g.
## Answer 1
- `wc -l hg38-gene-metadata-feature.tsv`
- There are 61633 lines
Please git push
after each exercise and do not wait until the end of the session.
BioMart provides a way to obtain genome annotation information (referred to as Attributes) such as gene IDs, transcripts, phenotypes, GO terms, structures, orthologues, variants, and more. This information can be retrieved through the web-based tool as well as the Bioconductor biomaRt package.
Tally the number of each gene_biotype
in hg38-gene-metadata-feature.tsv. How many protein_coding
genes are there? Pick one biotype you would want to learn more about and explain why.
Which ensembl_gene_id
in hg38-gene-metadata-go.tsv has the most go_ids
? Create a new file that only contains rows corresponding to that gene_id, sorting the rows according to the name_1006
column. Describe what you think this gene does based on the GO terms.
ENSG000XYZ GO:0090425 acinar cell differentiation
ENSG000XYZ GO:0016323 basolateral plasma membrane
ENSG000XYZ GO:0045296 cadherin binding
GENCODE works to annotate the human and mouse genome using biological evidence such as long-read RNA-seq, Ribo-seq, and other targeted approaches. This gene set is used by many projects including Genotype-Tissue Expression (GTEx), The Cancer Genome Atlas (TCGA), and the Human Cell Atlas (HCA).
Complete the following exercises using the gene.gtf
that we created together
grep -w gene gencode.v46.basic.annotation.gtf > gene.gtf
Immunoglobin (Ig) genes are present in over 200 copies throughout the human genome. How many IG genes (not pseudogenes) are present on each chromosome? You can use a dot (.
) in a regular expression pattern to match any single character. How does this compare with the distribution of IG pseudogenes?
Why is grep pseudogene gene.gtf
not an effective way to identify lines where the gene_type
key-value pair is a pseudogene (hint: look for overlaps_pseudogene)? What would be a better pattern? Describe it in words if you are having trouble with the regular expression.
Convert the annotation from .gtf format to .bed format. Specifically, print out just the chromosome, start, stop, and gene_name. As cut
splits lines into fields based on the tab character, first use sed
to create a new file where spaces are replaced with tabs.
sed "s/ /\t/g" gene.gtf > gene-tabs.gtf
A. Explore the GENCODE mouse annotation noting similaries and differences with the human annotation