Chapter 16 Non-linear regression models

16.1 LOESS regression

LOESS (`Locally estimated scatterplot smoothing’, aka LOWESS; ‘Locally weighted scatterplot smoothing’) is a modeling technique that fits a curve (or surface) to a set of data using a large number of local linear regressions. Local weighted regressions are fit at numerous regions across the data range, using a weighting function that drops off as you move away from the center of the fitting region (hence the “local aspect). LOESS combines the simplicity of least squares fitting with the flexibility of non-linear techniques and doesn’t require the user to specify a functional form ahead of time in order to fit the model. It does however require relatively dense sampling in order to produce robust fits.

Formally, at each point \(x_i\) we estimate the regression coefficients \(\hat{\beta}_j(x)\) as the values that minimize: \[ \sum_{k=1}^n w_k(x_i)(y_k - \beta_0 - \beta_1 x_k - \ldots - \beta_d x_k^2)^2 \] where \(d\) is the degree of the polynomial (usually 1 or 2) and \(w_k\) is a weight function. The most common choice of weighting function is called the “tri-cube” function as is defined as:

\[\begin{align*} w_k(x_i) &= (1-|x_i|^3)^3, \mbox{for}\ |x_i| \lt 1 \\ &= 0, \mbox{for}\ |x_i| \geq 1 \end{align*}\] where \(|x_i|\) is the normalized distance (as determined by the span parameter of the LOESS model) of the observation \(x_i\) from the focal observation \(x_k\).

The primary parameter that a user must decide on when using LOESS is the size of the neighborhood function to apply (i.e. over what distance should the weight function drop to zero). This is referred to as the “span” in the R documentation, or as the parameter \(\alpha\) in many of the papers that discuss LOESS. The appropriate span can be determined by experimentation or, more rigorously by cross-validation.

We’ll illustrate fitting a Loess model using data on Barack Obama’s approval ratings over the period from 2008 to 2001 (obama-polls.txt).

polls <- read_delim('https://github.com/Bio723-class/example-datasets/raw/master/obama-polls-2008-2011.txt',
                    delim="\t", trim_ws=TRUE)
# note that we needed to use "trim_ws" above because there were 
# some lurking spaces in the fields of that tab delimited data file

head(polls)
## # A tibble: 6 × 6
##   Pollster         Dates      `N/Pop` Approve Disapprove Undecided
##   <chr>            <chr>      <chr>     <dbl>      <dbl> <chr>    
## 1 Rasmussen        9/17-19/11 1500 LV      46         52 -        
## 2 Rasmussen        9/14-16/11 1500 LV      45         55 -        
## 3 Gallup           9/13-15/11 1500 A       39         52 -        
## 4 CBS/Times        9/10-15/11 1452 A       43         50 7        
## 5 Marist/McClatchy 9/13-14/11 825 RV       39         52 9        
## 6 Rasmussen        9/11-13/11 1500 LV      45         54 -

Notice that the Dates column is not very tidy. Each “date” is actually a range of dates of the form Month/DayStart-DayEnd/Year (e.g. “9/1/09” is September 01, 2009). Even nastier, some dates are in the form Month/Day/Year (only a single day) or MonthStart/DayStart-MonthEnd/DayEnd/Year (e.g. “2/26-3/1/11” is February 26,2011 to March 01, 2011) . Whoever formatted the data in this fashion must really hate tidy data! To deal with this nightmare we’re going to use the tidyr::extract() function to employ regular expressions (regex) to parse this complicated data field into it’s constituent parts. For more details on regular expression see the R Regular Expession Cheat Sheet and R for Data Science.

polls <- 
  polls %>% 
  
  # first separate left most and right most fields as month and year respectively
  tidyr::extract("Dates", c("month", "day.range", "year"), regex="(\\d+)/(.+)/(\\d+$)", convert = TRUE) %>%
  
  # now deal with the complicated middle field. For simplicities sake we're just
  # going to focus on extracting the start day
  tidyr::extract("day.range", c("day.start", "day.other"), regex = "(\\d+)(.+)", convert = TRUE) %>%
  
  # finally convert YY to 20YY
  mutate(year = 2000 + year) 

head(polls)
## # A tibble: 6 × 9
##   Pollster         month day.start day.o…¹  year `N/Pop` Approve Disap…² Undec…³
##   <chr>            <int>     <int> <chr>   <dbl> <chr>     <dbl>   <dbl> <chr>  
## 1 Rasmussen            9        17 -19      2011 1500 LV      46      52 -      
## 2 Rasmussen            9        14 -16      2011 1500 LV      45      55 -      
## 3 Gallup               9        13 -15      2011 1500 A       39      52 -      
## 4 CBS/Times            9        10 -15      2011 1452 A       43      50 7      
## 5 Marist/McClatchy     9        13 -14      2011 825 RV       39      52 9      
## 6 Rasmussen            9        11 -13      2011 1500 LV      45      54 -      
## # … with abbreviated variable names ¹​day.other, ²​Disapprove, ³​Undecided

For the next steps we’ll need the lubridate library (install if needed):

library(lubridate)

polls <-
  polls %>%
  mutate(date = make_date(year = year, month=month, day = day.start))

head(polls)
## # A tibble: 6 × 10
##   Polls…¹ month day.s…² day.o…³  year `N/Pop` Approve Disap…⁴ Undec…⁵ date      
##   <chr>   <int>   <int> <chr>   <dbl> <chr>     <dbl>   <dbl> <chr>   <date>    
## 1 Rasmus…     9      17 -19      2011 1500 LV      46      52 -       2011-09-17
## 2 Rasmus…     9      14 -16      2011 1500 LV      45      55 -       2011-09-14
## 3 Gallup      9      13 -15      2011 1500 A       39      52 -       2011-09-13
## 4 CBS/Ti…     9      10 -15      2011 1452 A       43      50 7       2011-09-10
## 5 Marist…     9      13 -14      2011 825 RV       39      52 9       2011-09-13
## 6 Rasmus…     9      11 -13      2011 1500 LV      45      54 -       2011-09-11
## # … with abbreviated variable names ¹​Pollster, ²​day.start, ³​day.other,
## #   ⁴​Disapprove, ⁵​Undecided
polls.plot <-
  polls %>%
  ggplot(aes(x = date, y = Approve)) +
  geom_point(alpha=0.5, pch=1) + 
  labs(x = "Date", y = "Approval Rating",
       title = "Barack Obama's Approval Ratings, 2008-2011")

polls.plot

We can fit the LOESS as so, and get back the predicted values using the predict() function:

loess.approval <- loess(Approve ~ as.numeric(date), data = polls)
loess.predicted.values <- predict(loess.approval)
head(loess.predicted.values)
## [1] 44.55653 44.59349 44.60572 44.64216 44.60572 44.63006

Usually we’ll want to visualize the LOESS regression, which we can conveniently do with ggplot::geom_smooth without having to explicitly calculate the LOESS:

polls.plot +
  geom_smooth(color='red', method="loess", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'

Here’s the same data fit with a smaller span (the paramater that controls the “local neighborhood” size in LOESS):

polls.plot +
  geom_smooth(color='red', method="loess", se=FALSE, span=0.1)
## `geom_smooth()` using formula = 'y ~ x'

The high density of the polling justifies the smaller span, and the additional deviations apparent when the LOESS is fit with the smaller span likely reflect real world changes in approval, induced by a variety of political and other news events.

For example, we can zoom in on 2011:

polls.plot +
  geom_smooth(color='red', method="loess", se=FALSE, span=0.1) + 
  coord_cartesian(xlim=c(ymd(20110101), ymd(20110901)), ylim=c(35,65)) + 
  scale_x_date(date_breaks="1 month", date_label="%B") + 
  labs(title="Barack Obama's Approval Ratings, Jan - Sep 2011")
## `geom_smooth()` using formula = 'y ~ x'

Increased approval ratings in January coincide with the approval of a tax deal and a speech to the nation following the shooting of congresswoman Gabbie Giffords in Tuscson, AZ (https://www.cnbc.com/id/41139968). The spike apparent in early May coincides with the death of Osama Bin Laden. You might take a look at major policitcal events in other years to see if you can identify drivers behind other approval rating shifts.

16.2 Logistic regression

Logistic regression is used when the dependent variable is discrete (often binary). The explanatory variables may be either continuous or discrete.

Examples:

  • Whether a gene is turned off (=0) or on (=1) as a function of levels of various proteins
  • Whether an individual is healthy (=0) or diseased (=1) as a function of various risk factors.
  • Whether an individual died (=0) or survived (=1) some selective event as a function of behavior, morphology, etc.

We model the binary response variable, \(Y\), as a function of the predictor variables, \(X_1\), \(X_2\), etc as :

\[ P(Y = 1|X_1,\ldots,X_p) = f(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p) \]

So we’re modeling the probability of the state of Y as a function of a linear combination of the predictor variables.

For logistic regression, \(f\) is the logistic function: \[ f(z) = \frac{e^z}{1+e^z} = \frac{1}{1 + e^{-z}} \]

Therefore, the bivariate logistic regression is given by: \[ P(Y = 1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]

Note that \(\beta_0\) here is akin to the intercept in our standard linear regression.

16.2.1 A web app to explore the logistic regression equation

To help you develop an intuition for the logistic regression equation, I’ve developed a small web app, that allows you to explore how the shape of the regression curve responds to changes in the regression coefficients \(\beta_0\) and \(\beta_1\). Open the app in another browser window and play with the sliders that control the coeffients \(B_0\) and \(B_1\). In the assignment associated with today’s class you’ll be asked to answer some specific questions based on this app.

16.2.2 Titanic data set

titanic.csv contains information about passengers on the Titanic. Variables in this data set include information such as sex, age, passenger class (1st, 2nd, 3rd), and whether or not they survived the sinking of the ship (0 = died, 1 = survived).

library(tidyverse)
library(broom)
library(cowplot)
library(ggthemes)
titanic <- read_csv("http://bit.ly/bio304-titanic-data")
names(titanic)
##  [1] "pclass"    "survived"  "name"      "sex"       "age"       "sibsp"    
##  [7] "parch"     "ticket"    "fare"      "cabin"     "embarked"  "boat"     
## [13] "body"      "home.dest"

16.2.3 Subsetting the data

We’ve all heard the phrase, “Women and children first”, so we might expect that the probability that a passenger survived the sinking of the Titanic is related to their sex and/or age. Let’s create separate data subsets for male and female passengers.

male <- filter(titanic, sex == "male")
female <- filter(titanic, sex == "female")

16.2.4 Visualizing survival as a function of age

Let’s create visualizations of survival as a function of age for the male and female passengers.

fcolor = "lightcoral"
mcolor = "lightsteelblue"

female.plot <- ggplot(female, aes(x = age, y = survived)) + 
  geom_jitter(width = 0, height = 0.05, color = fcolor) +
  labs(title = "Female Passengers")

male.plot <- ggplot(male, aes(x = age, y = survived)) + 
  geom_jitter(width = 0, height = 0.05, color = mcolor) + 
  labs(title = "Male Passengers")

plot_grid(female.plot, male.plot)

The jittered points with Y-axis value around one are passengers who survived, the point jittered around zero are those who died.

16.2.5 Fitting the logistic regression model

The function glm (generalized linear model) can be used to fit the logistic regression model (as well as other models). Setting the argument family = binomial gives us logistic regression. Note that when fitting the model the dependent variable needs to be numeric, so if the data is provided as Boolean (logical) TRUE/FALSE values, they should be converted to integers using as.numeric().

First we fit the regression for the famale passengers.

fit.female <- glm(survived ~ age, family = binomial, female)
tidy(fit.female)
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   0.493    0.254        1.94 0.0523 
## 2 age           0.0225   0.00854      2.64 0.00834

The column “estimate” gives the coefficients of the model. The “intercept”” estimate corresponds to \(B_0\) in the logistic regression equation, the “age” estimate corresponds to the coefficient \(B_1\) in the equation.

Now we repeat the same step for the male passengers.

fit.male <- glm(survived ~ age, family = binomial, male)
tidy(fit.male)
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  -0.661    0.225       -2.94 0.00329
## 2 age          -0.0238   0.00728     -3.27 0.00109

Notice that the female coefficients are both positive, while the male coefficients are negative. We’ll visualize what this means in terms of the model below.

16.2.6 Visualizing the logistic regression

To visualize the logistic regression fit, we first use the predict function to generate the model predictions about probability of survival as a function of age.

ages <- seq(0, 75, 1) # predict survival for ages 0 to 75

predicted.female <- predict(fit.female, 
                            newdata = data.frame(age = ages),
                            type = "response")

predicted.male <- predict(fit.male,
                          newdata = data.frame(age = ages),
                          type = "response")
                            

Having generated the predicted probabilities of survival we can then add these prediction lines to our previous plot using geom_line.

female.logistic.plot <- female.plot + 
  geom_line(data = data.frame(age = ages, survived = predicted.female),
            color = fcolor, size = 1)

male.logistic.plot <- male.plot + 
  geom_line(data = data.frame(age = ages, survived = predicted.male),
            color = mcolor, size = 1)

plot_grid(female.logistic.plot, male.logistic.plot)

We see that for the female passengers, the logistic regression predicts that the probability of survival increases with passenger age. In contrast, the model fit to the male passengers suggests that the probability of survival decreases with passenger age. For the male passengers, the data is consistent with “children first”; for female passengers this model doesn’t seem to hold. However, there are other factors to consider as we’ll see below.

16.2.7 Quick and easy visualization

Here’s an alternative “quick and easy” way to generate the plot above using the awesome power of ggplot. The downside of this approach is we don’t generate the detailed information on the model, which is something you’d certainly want to have in any real analysis.

ggplot(titanic, aes(x=age, y=survived, color=sex)) + 
  geom_jitter(width = 0, height = 0.05) +
  geom_smooth(method="glm",  method.args = list(family="binomial"))  + 
  labs(x = "Age", y = "P(Survival)") +
  facet_wrap(~ sex) +
  scale_color_manual(values = c(fcolor, mcolor))
## `geom_smooth()` using formula = 'y ~ x'

16.2.8 Impact of sex and passenger class on the models

In our previous analysis we considered the relationship between survival and age, conditioned (facted) on passenger sex. In a complex data set like this one, it is often useful to condition on multiple variables simultaneously. Lets extend our visualization to look at the regression faceted on both class and sex, using facet_grid:

ggplot(titanic, aes(x=age, y=survived, color=sex)) + 
  geom_jitter(width = 0, height = 0.05) +
  geom_smooth(method="glm",  method.args = list(family="binomial"))  + 
  labs(x = "Age", y = "P(Survival)") +  
  facet_grid(pclass ~ sex) +
  scale_color_manual(values = c(fcolor, mcolor)) + 
  theme_few()
## `geom_smooth()` using formula = 'y ~ x'

Having conditioned on both sex and ticket class, our figure now reveals a much more complex relationship between age and survival. Almost all first class female passengers survived, regardless of age. For second calss female passengers, the logistic regression suggests a very modest decrease in survival with increasing age. The negative relationship between age and survival is stronger still for third class females. Male passengers on the other hand show a negative relationship between sex and survival, regardless of class, but the models suggest that there are still class specific differences in this relationship.

16.2.9 Fitting multiple models based on groupings use dplyr::do

In the figure above we used ggplot and facet_grid to visualize logistic regression of survival on age, conditioned on both sex and class. What if we wanted to calculate the terms of the logistic regressions for each combination of these two categorical variables? There are three passenger classes and two sexes, meaning we’d have to create six data subsets and fit the model six times if we used the same approach we used previously. Luckily, dplyr provides a powerful function called do() that allows us to carry out arbitrary computations on grouped data.

There are two ways to use do(). The first way is to give the expressions you evaluate in do() a name, in which case do() will store the results in a column. The second way to use do() is for the expression to return a data frame.

In this first example, the model fits are stored in the fits column. When using do() you can refer to the groupings using a period (.):

grouped.models <-
  titanic %>%
  group_by(sex, pclass) %>%
  do(fits = glm(survived ~ age, family = binomial, data = .))

grouped.models
## # A tibble: 6 × 3
##   sex    pclass fits  
##   <chr>   <dbl> <list>
## 1 female      1 <glm> 
## 2 female      2 <glm> 
## 3 female      3 <glm> 
## 4 male        1 <glm> 
## 5 male        2 <glm> 
## 6 male        3 <glm>

Notice that the “fits” column doesn’t explicitly print out the details of the model. The object returned by glm() can’t be simply represented as text string (it’s a list), so we seea place holder string that tells us that there is data here represented a glm object. However, we can access the the columns with the fits just like any other variable:

# get the summary of the second logistic regression (Female, 2nd Class) 
tidy(grouped.models$fits[[2]])
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   3.49      0.904       3.86 0.000112
## 2 age          -0.0450    0.0255     -1.77 0.0773

Now we illustrate the second approach to using do(). When no name is provided, do() expects its expression to return a dataframe. Here we use the broom::tidy() function to get the key results of each fit model into a data frame:

titanic %>%
  group_by(sex, pclass) %>%
  do(tidy(glm(survived ~ age, family = binomial, data = .)))
## # A tibble: 12 × 7
##    sex    pclass term        estimate std.error statistic    p.value
##    <chr>   <dbl> <chr>          <dbl>     <dbl>     <dbl>      <dbl>
##  1 female      1 (Intercept)  2.90       1.24       2.34  0.0193    
##  2 female      1 age          0.00960    0.0326     0.294 0.769     
##  3 female      2 (Intercept)  3.49       0.904      3.86  0.000112  
##  4 female      2 age         -0.0450     0.0255    -1.77  0.0773    
##  5 female      3 (Intercept)  0.289      0.341      0.846 0.398     
##  6 female      3 age         -0.0178     0.0136    -1.31  0.190     
##  7 male        1 (Intercept)  0.880      0.527      1.67  0.0951    
##  8 male        1 age         -0.0374     0.0128    -2.93  0.00335   
##  9 male        2 (Intercept)  1.04       0.599      1.74  0.0818    
## 10 male        2 age         -0.112      0.0249    -4.50  0.00000672
## 11 male        3 (Intercept) -0.761      0.342     -2.23  0.0259    
## 12 male        3 age         -0.0339     0.0134    -2.53  0.0113

Using this approach we get a nice data frame showing the logistic regression coefficients, and associated statistics (standard error, P-values, etc) for the regression of survival on age, for each combination of sex and class.