Deadline: Monday, October 20
Resubmission deadline: Friday, October 31
The goal of today’s lab is to use linear regression and related statistical methods to investigate the relationship between paternal age, maternal age, and the number of de novo mutations (DNMs) in a proband (offspring). Today’s assignment will build familiarity with manipulating tabular datasets containing mixed data types using the tidyverse in R. Specifically, you will import a table of de novo mutations and manipulate it to calculate the number of maternal and paternal DNMs per individual. You will then fit and interpret linear models with stats::lm (and optionally stats::glm for Poisson regression), and tidy results with broom.
Data are taken from Halldorsson, B. V. et al. (2019). Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science, 363(6425).
Read the abstract from the above paper to understand the context of the datasets you will be using. The data you need for this assignment are available from Dropbox at:
You may copy these into your submission directory (and add to your .gitignore
).
Before beginning the assignment, take a quick look at both files (e.g., with less -S
in Unix) to confirm their structure.
Load the tidyverse and broom packages.
Load aau1043_dnm.csv
into a tibble.
Create a per-proband summary with counts of maternally and paternally inherited DNMs. Ignore DNMs without a specified parent of origin.
Load aau1043_parental_age.csv
.
Join the two tibbles by proband ID.
Use your merged data frame for the following. All plots should be clearly labeled and easily interpretable.
1) Create a scatter plot of the count of maternal DNMs vs. maternal age → save as ex2_a.png
2) Create a scatter plot of the count of paternal DNMs vs. paternal age → save as ex2_b.png
Fit a simple linear regression model relating maternal age to the number of maternal de novo mutations.
In README.md
, answer: 1. What is the “size” (i.e., slope) of this relationship? Interpret the slope in plain language. Does it match your plot? 2. Is the relationship significant? How do you know? Explain the p-value in plain but precise language.
Repeat the step above but for paternal age vs. paternal DNMs.
Use the paternal regression model to predict the expected number of paternal DNMs for a father of age 50.5. You are welcome to do this manually or using a built-in function, but show your work in README.md
.
Plot both distributions on the same axes as semi-transparent histograms; save as ex2_c.png
.
We have paired observations per proband (maternal vs. paternal). The paired t-test assumes that the within-pair differences are approximately normally distributed.
Apply a paired t-test in R using t.test(merged$maternal_dnm, merged$paternal_dnm, paired = TRUE)
.
In README.md
, answer: 1. What is the “size” of this relationship (i.e., the average difference in counts of maternal and paternal DNMs)? Interpret the difference in plain language. Does it match your plot? 2. Is the relationship significant? How do you know? Explain the p-value in plain but precise language.
Note that the paired t-test is equivalent to using the difference between the maternal and paternal DNM counts per proband as the response variable and fitting a model with only an intercept term (indicated with 1
on the right side of the model formula). Fit this model using lm()
and compare to the results of the paired t-test. How would you interpret the coefficient estimate for the intercept term?
Choose a dataset from the bottom of the TidyTuesday README. Record which one you chose in README.md
.
Generate figures and note any interesting patterns in README.md
; save figures as ex4_<something>.png
.
State a hypothesis, fit a linear model, evaluate fit, and report results in README.md
.
README.md
file with answers to questions.t.test
and lm(diff ~ 1)
) and interpret results (1.5 pts)Total Points: 10