Local Regression (LOESS)

Sometimes we have bivariate data that are not well represented by a linear function (even if the variables are transformed). We might be able to see a relationship between the data in a scatterplot, but we are unable to fit a parametric model that properly describes the relationship between outcome and predictor. This might be particularly common when our predictor is a time variable and the outcome is a time-series. In these situations, one way to grasp and convey the relationship is with “local regression,” which fits a nonparametric curve to a scatterplot. Note: Local regression also works in multivariate contexts, but we'll focus on the bivariate form here for sake of simplicity.

Let's create a simple bivariate relationship and a complex one to see how local regression works in both cases.

set.seed(100)
x <- sample(1:50, 100, TRUE)
y1 <- 2 * x + rnorm(50, 0, 10)
y2 <- 5 + rnorm(100, 5 * abs(x - 25), abs(x - 10) + 10)

Fitting and visualizing local regressions

We can fit the local regression using the loess function, which takes a formula object as its argument, just like any other regression:

localfit <- loess(y1 ~ x)

We can look at the summary of the localfit object, but - unlike parametric regression methods - the summary won't tell us much.

summary(localfit)
## Call:
## loess(formula = y1 ~ x)
## 
## Number of Observations: 100 
## Equivalent Number of Parameters: 4.51 
## Residual Standard Error: 12 
## Trace of smoother matrix: 4.92 
## 
## Control settings:
##   normalize:  TRUE 
##   span       :  0.75 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2

Local regression doesn't produce coefficients, so there's no way to see the model in tabular form. Instead we have to look at its predicted values and plot them visually.

We can calculate predicted values at each possible value of x:

localp <- predict(localfit, data.frame(x = 1:50), se = TRUE)

The result is a vector of predicted values:

localp
## $fit
##      1      2      3      4      5      6      7      8      9     10 
##     NA  1.817  3.919  6.025  8.136 10.255 12.383 14.522 16.675 18.843 
##     11     12     13     14     15     16     17     18     19     20 
## 21.035 23.249 25.475 27.699 29.910 32.156 34.459 36.770 39.038 41.213 
##     21     22     23     24     25     26     27     28     29     30 
## 43.297 45.333 47.332 49.304 51.261 53.158 54.964 56.707 58.418 60.124 
##     31     32     33     34     35     36     37     38     39     40 
## 61.856 63.577 65.254 66.916 68.595 70.319 72.120 73.977 75.854 77.754 
##     41     42     43     44     45     46     47     48     49     50 
## 79.680 81.636 83.624 85.649 87.711 89.808 91.937 94.099 96.292 98.515 
## 
## $se.fit
##     1     2     3     4     5     6     7     8     9    10    11    12 
##    NA 5.600 4.910 4.287 3.736 3.261 2.867 2.557 2.331 2.180 2.092 2.048 
##    13    14    15    16    17    18    19    20    21    22    23    24 
## 2.037 2.046 2.066 2.075 2.083 2.123 2.190 2.239 2.230 2.214 2.249 2.327 
##    25    26    27    28    29    30    31    32    33    34    35    36 
## 2.372 2.329 2.252 2.209 2.230 2.289 2.323 2.295 2.249 2.228 2.245 2.278 
##    37    38    39    40    41    42    43    44    45    46    47    48 
## 2.280 2.248 2.208 2.168 2.137 2.128 2.158 2.244 2.409 2.668 3.024 3.475 
##    49    50 
## 4.014 4.634 
## 
## $residual.scale
## [1] 12.04
## 
## $df
## [1] 94.97

To see the loess curve, we can simply plot the fitted values. We'll do something a little more interesting though. We'll start by plotting our original data (in blue), then plot the standard errors as polygons (using the polygon) function (for 1-, 2-, and 3-SEs), then overlay the fitted loess curve in white. The plot nicely shows the fit to the data and the increasing uncertainty about the conditional mean at the tails of the independent variable. We also see that these data are easily modeled by a linear regression, which we could add to the plot.

plot(y1 ~ x, pch = 15, col = rgb(0, 0, 1, 0.5))
# one SE
polygon(c(1:50, 50:1), c(localp$fit - localp$se.fit, rev(localp$fit + localp$se.fit)), 
    col = rgb(1, 0, 0, 0.2), border = NA)
# two SEs
polygon(c(1:50, 50:1), c(localp$fit - 2 * localp$se.fit, rev(localp$fit + 2 * 
    localp$se.fit)), col = rgb(1, 0, 0, 0.2), border = NA)
# three SEs
polygon(c(1:50, 50:1), c(localp$fit - 3 * localp$se.fit, rev(localp$fit + 3 * 
    localp$se.fit)), col = rgb(1, 0, 0, 0.2), border = NA)
# loess curve:
lines(1:50, localp$fit, col = "white", lwd = 2)
# overlay a linear fit:
abline(lm(y1 ~ x), lwd = 2)

plot of chunk unnamed-chunk-6

Loess works well in a linear situation, but in those cases we're better off fitting the linear model because then we can get directly interpretable coefficients. The major downside of local regression is that we can only see it and understand it as a graph.

We can repeat the above process for our second outcome, which lacks a clear linear relationship between predictor x and outcome y2:

localfit <- loess(y2 ~ x)
localp <- predict(localfit, data.frame(x = 1:50), se = TRUE)
plot(y2 ~ x, pch = 15, col = rgb(0, 0, 1, 0.5))
# one SE
polygon(c(1:50, 50:1), c(localp$fit - localp$se.fit, rev(localp$fit + localp$se.fit)), 
    col = rgb(1, 0, 0, 0.2), border = NA)
# two SEs
polygon(c(1:50, 50:1), c(localp$fit - 2 * localp$se.fit, rev(localp$fit + 2 * 
    localp$se.fit)), col = rgb(1, 0, 0, 0.2), border = NA)
# three SEs
polygon(c(1:50, 50:1), c(localp$fit - 3 * localp$se.fit, rev(localp$fit + 3 * 
    localp$se.fit)), col = rgb(1, 0, 0, 0.2), border = NA)
# loess curve:
lines(1:50, localp$fit, col = "white", lwd = 2)
# overlay a linear fit and associated standard errors:
lmfit <- lm(y2 ~ x)
abline(lmfit, lwd = 2)
lmp <- predict(lmfit, data.frame(x = 1:50), se.fit = TRUE)
lines(1:50, lmp$fit - lmp$se.fit, lty = 2)
lines(1:50, lmp$fit + lmp$se.fit, lty = 2)

plot of chunk unnamed-chunk-7

In contrast to the data where y1 was a simple function of x, these data are far messier. They are not well-represented by a straight line fit (as evidenced by our overlay of a linear fit to the data). Instead, the local regression approach shows how y2 is not a clean function of the predictor. In these situations, the local regression curve can be helpful for understanding the relationship between outcome and predictor and potentially for building a subsequent parametric model that approximates the data better than a straight line.