The goal of this exercise is to use linear regression and other statistical methods to investigate the relationship between paternal age, maternal age, and the number of de novo mutations (DNMs) in a proband (offspring). Today’s assignment will build familiarity with working with data frames. Specifically, you will import a table of de novo mutations and will manipulate it to calculate the number of maternal and paternal DNMs per individual, which you will then model with linear regression.
Read the abstract from the above paper to understand the context of the datasets you will be using. There are two files we’ll be using for this assignment: 1. information about the number and parental origin of each de novo mutation detected in an offspring individual (i.e. “proband”), stored in aau1043_dnm.csv
2. ages of the parents of each proband, stored in aau1043_parental_age.csv
Before beginning the assignment, you should examine the two files (with less -S
perhaps) to make sure you understand how they’re organized.
You’ll start by exploring the data in aau1043_dnm.csv
. First, load this data into a tibble.
Use group_by()
and summarize()
to tabulate the number of paternally and maternally inherited DNMs in each proband. Note that the maternal versus paternal origin of the mutations are recorded in the column titled Phased_combined
.
Now, load the data from aau1043_parental_age.csv
into a new dataframe.
You now have two dataframes with complementary information. It would be nice to have all of this in one dataframe. Use the left_join()
function to combine your dataframe from step 2 with the dataframe you just created in step 3 based on the shared column Proband_id
.
Using the merged dataframe from the previous section, you will be exploring the relationships between different features of the data.
First, you’re interested in exploring if there’s a relationship between the number of DNMs and parental age. Use ggplot2 to plot the following. All plots should be clearly labelled and easily interpretable.
1. the count of maternal de novo mutations vs. maternal age
2. the count of paternal de novo mutations vs. paternal age
Now that you’ve visualized these relationships, you’re curious whether they’re statistically significant. Fit a linear regression model to the maternal age and maternal DNM model using the lm()
function.
What is the “size” of this relationship? In your own words, what does this mean? Does this match what you observed in your plots in step 2.1?
Is this relationship significant? How do you know? In your own words, what does this mean?
As before, fit a linear regression model, but this time to test for an association between paternal age and paternally inherited de novo mutations.
What is the “size” of this relationship? In your own words, what does this mean? Does this match what you observed in your plots in step 2.1?
Is this relationship significant? How do you know? In your own words, what does this mean?
Using your results from step 2.3, predict the number of paternal DNMs for a proband with a father who was 50.5 years old at the proband’s time of birth. Record your answer and your work (i.e. how you got to that answer).
Next, you’re curious whether the number of paternally inherited DNMs match the number of maternally inherited DNMs. Plot the distribution of maternal DNMs per proband (as a histogram). In the same panel (i.e. the same set of axes) plot the distribution of paternal DNMs per proband. Make sure to make the histograms semi-transparent so you can see both distributions.
Now that you’ve visualized this relationship, you want to test whether there is a significant difference between the number of maternally vs. paternally inherited DNMs per proband. What would be an appropriate statistical model to test this relationship? Fit this model to the data.
After performing your test, answer the following questions:
What statistical test did you choose? Why?
Was your test result statistically significant? Interpret your result as it relates to the number of paternally and maternally inherited DNMs.
Note that standard linear regression assumes a continuous response variable. When we want to work with response variables that are “counts”, such as the number of de novo mutations, we should technically use an approach such as “Poisson regression” that is designed for count data. To fit a Poisson regression model use the glm()
function with the argument family = "poisson"
.
Re-fit the models above (steps 2 and 3 in Exercise 2) using Poisson regression.
The interpretation of parameter estimates from Poisson regression differs from that of ordinary least squares, as the reported coefficients are on the log scale. Predictions can be converted back to the normal response scale through exponentiation: exp(x)
.
Using the relevant Poisson regression model that you fit, predict the number of paternal de novo mutations for a proband with a father who was 40.2 years old at the proband’s time of birth. Record your answer and your work.
Doing the analyses we tell you to do is surely fun, but isn’t it more fun to do your own analyses?
Select a new dataset from those listed at the bottom of this website. The corresponding data can generally be found as a .csv
file in the tidytuesday/data/<year>/<date>
subdirectory of the GitHub repository. Record which dataset you picked.
Generate figures to explore these data. What patterns do you notice? Record your observations and submit any figures you make.
Pose a hypothesis about the data that can be tested with a linear regression model.
Fit your model, evaluate the model fit, and test your hypothesis. Record your hypothesis and results.