---
title: Appendix A -- HSF Examples
author: "John Fieberg, Johannes Signer, Brian Smith, Tal Avgar"
preprint: no
output:
html_document:
toc: true
toc_float: true
number_sections: false
csl: mee.cls
bibliography:
- book.bib
- sample.bib
biblio-style: apalike
editor_options:
chunk_output_type: console
---
```{r setup, include = FALSE}
library(knitr)
library(tidyverse)
library(amt)
library(broom)
knitr::opts_chunk$set(cache=TRUE, warning = FALSE, message = FALSE, error = TRUE)
options(digits = 3)
set.seed(1323)
colorize <- function(x, color="red") {
if (knitr::is_latex_output()) {
sprintf("\\textcolor{%s}{%s}", color, x)
} else if (knitr::is_html_output()) {
sprintf("%s", color,
x)
} else x
}
```
# Overview
In this appendix, we will walk the user through the process of fitting and interpreting habitat-selection functions using connections between logistic regression, Inhomogeneous Poisson Point-Process (IPP) Models, and weighted distribution theory.
### Inhomogeneous Poisson Point-Process Model
The IPP model provides a simple framework for modeling the density of points in space as a log-linear function of spatial predictors through a spatially-varying intensity function, $\lambda(s)$:
\begin{equation}
\log(\lambda(s)) =\beta_0 + X_1(s)\beta_1 + \ldots X_k(s)\beta_k
\end{equation}
where $s$ is a location in geographic space, and $X_1(s), \ldots, X_k(s)$ are spatial predictors associated with location $s$. The intercept, $\beta_0$, determines the log-density of points (within a small homogeneous area around $s$) when all $X_i(s)$ are 0, and the slopes, $\beta_1, \ldots, \beta_k$, describe the effect of spatial covariates on the log density of locations in space.
### Weighted Distribution Theory
We can think of the *habitat-selection* function, $w(X(s); \beta)$, as providing a set of weights that takes us from the distribution of available habitat, $a(s)$ to the distribution of used habitat, $u(s)$:
\begin{equation}
u(s) = \frac{w(X(s),\beta)a(s)}{\int_{g \in G}w(X(g),\beta)a(g) dg} \mbox{,}
\end{equation}
where the denominator integrates over a geographic area, $G$, that is assumed to be available to the animal and $g$ is a dummy variable for integration.
# Prepare Data
We will begin by preparing our data to allow us to estimate parameters in the habitat-selection function. This will involve loading our location data and covariates, sampling random available locations from an availability domain, and associating the habitat covariates with the points.
## Load Data
First, let's read the fisher location and spatial covariate data. These data are available as objects in the `amt` package (`amt_fisher` and `env_covar`), and we access them easily if the package is attached. The object `amt_fisher` is already of class `track_xyt` from the package `amt`. The other three objects are all of class `RasterLayer`, which can be created by reading most standard raster files with the function `raster::raster()`. For ease, we have already loaded them, but your own data will likely need to be read in using functions from other packages.
```{r, data_amt}
fisher <- amt_fisher
landuse <- amt_fisher_covar$landuse
elevation <- amt_fisher_covar$elevation
popden <- amt_fisher_covar$popden
```
In this example, we will work with data from a single individual, "Lupe", so let's subset the `fisher` data and sort the locations using the timestamp of the observations.
```{r subset}
dat <- fisher %>%
filter(name == "Lupe") %>%
arrange(t_)
```
The raw land-use data we loaded here has numerical codes, which we will group into different landcover types using a custom-written function, `reclass_landuse()`, below. We will apply this function when preparing the data.
```{r reclass fun}
reclass_landuse <- function(x) {
fct_collapse(factor(x),
forest = c("30","40","50","60", "70","80", "90","100"),
grass = c("120", "140"),
wet = c("160"))
}
```
Although regular sampling is not required when fitting HSFs, it is a good idea to first check the sampling rate. Your device may be programmed to collect data at regular time steps, but there will typically be some small amount of variation in the time between locations.
We can check the actual sampling rate using the `summarize_sampling_rate()` function in the `amt` package:
```{r sampling}
summarize_sampling_rate(dat)
```
We can see summary statistics for the difference between locations in minutes (see the final column for the unit).
## Generating Available Points
We will take advantage of `amt`'s [piped](https://magrittr.tidyverse.org/reference/pipe.html) (`%>%`) workflow here to prepare data for the HSF. This workflow will help make our code both readable and compact. Most of the functions from `amt` have sensible defaults for each argument to get you started. You can easily see these defaults by checking the help file for that function (e.g., `?random_points`) or change them as you see fit.
In this workflow, we begin with Lupe's location data (`dat`), generate available points (`random_points()`), and attach the environmental covariates to both the observed and available points (`extract_covariates()`). Finally, we'll use `dplyr::mutate()` to scale and center our covariates and add a weight to the background points.
```{r rsf_data}
rsf_dat <- dat %>%
random_points() %>%
extract_covariates(landuse) %>%
extract_covariates(elevation) %>%
extract_covariates(popden) %>%
mutate(elevation = scale(elevation)[, 1],
popden = scale(popden)[, 1],
landuseC = reclass_landuse(landuse),
forest = landuseC == "forest",
weight = ifelse(case_, 1, 1e3)
)
```
Here's what the resulting data set looks like now. Each row is a single point with all of its associated attributes. The indicator variable, `case_`, is equal to `TRUE` for the used locations and `FALSE` for random locations.
```{r check ssf_dat}
print(rsf_dat, n = 3, width = Inf)
```
## Availability Domain and Number of Available Points
```{r echo=FALSE}
set.seed(1323) # set seed of random number generator
```
Let's have a closer look at what is happening under the hood when applying the `random_points()` function. This function conveniently calculates a minimum convex polygon around Lupe's locations. It then samples 10 random locations for every used location within this polygon. These locations are meant to capture what is assumed to be available to the animal. Other methods exist for sampling available points. For example, it is possible to sample from within an outer counter of a kernel density estimate applied to Lupe's locations or to sample using a regular grid (see `vignette("rsf")` for more options in `amt`).
A common question is, "How many available points do I need to sample?" As discussed in the main text, the available points are used to estimate the integral in the denominator of the use-availability likelihood:
\begin{equation}
L(\beta; s_1, \ldots, s_n) = \prod_{i=1}^{y_s}\frac{w(X(s_i); \beta)}{\int_{s\in G}w(X(s); \beta) ds}
\end{equation}
where $w(X(s); \beta) = \exp(\beta_1X_1(s) + \ldots \beta_kX_k(s))$ is our habitat-selection function and $y_G$ is the number of used points in our data set.
Below, we evaluate how $\hat{\beta}$ changes as we increase the number of available locations from 1 available location per used location to 100 available locations per used location.
```{r fig_2}
# Explore sensitivity of HSF coefficients to the number of available points
n.frac <- c(1, 5, 10, 50, 100)
n.pts <- ceiling(nrow(dat) * n.frac)
n.rep <- 20
res1 <- tibble(
n.pts = rep(n.pts, n.rep),
frac = rep(n.frac, n.rep),
res = map(
n.pts, ~
dat %>% random_points(n = .x) %>%
extract_covariates(landuse) %>%
extract_covariates(elevation) %>%
extract_covariates(popden) %>%
mutate(landuseC = reclass_landuse(landuse),
elevation = scale(elevation),
popden = scale(popden),
w = ifelse(case_, 1, 5000)) %>%
glm(case_ ~ elevation + popden + landuseC,
weight = w, data = ., family = binomial()) %>%
tidy()))
res1 %>% unnest(cols = res) %>%
mutate(terHSF.Lupe1 = recode(term, "(Intercept)"="Intercept",
popden = "human population density",
landuseCgrass = "landuse == grass",
landuseCother = "landuse == other",
landuseCshrub = "landuse == shrub",
landuseCwet = "landuse == wet",
)) %>%
ggplot(aes(factor(frac), y = estimate)) +
geom_boxplot() + facet_wrap(~ terHSF.Lupe1, scale ="free") +
geom_jitter(alpha = 0.2) +
labs(x = "Number of available points (multiplier of no. of used locations)", y = "Estimate") +
theme_light()
```
```{r echo=FALSE}
ggsave("figures/Figure2.png", width = 19, units = "cm", height = 10,
dpi = 600)
```
We see that the intercept decreases as we increased the number of available points (as it is roughly proportional to the log difference between the numbers of used and available points), but the slope parameter estimates, on average, do not change much once we included at least 10 available points per used point. Thus, we conclude that, in this particular case, having 10 available points per used point is sufficient for interpreting the slope coefficients. Using more available points reduces Monte Carlo error, however, so we will proceed with 100 available points per used point.
Lets take a look at the proportion of used and available locations in each `landuseC` class using a data set created by combining our observed locations with 100 available points per used location.
```{r}
#' Use the largest sample size here for the rest of the paper
Lupe.dat <- dat %>%
random_points(n = max(n.pts)) %>%
extract_covariates(landuse) %>%
extract_covariates(elevation) %>%
extract_covariates(popden) %>%
mutate(landuseC = reclass_landuse(landuse),
elevation = scale(elevation),
popden = scale(popden))
Lupe.dat %>%
group_by(case_, landuseC) %>%
summarize(n = n()) %>%
mutate(prop = n / sum(n),
label = paste0(round(prop * 100, 1), "%")) %>%
ggplot(aes(landuseC, prop, fill = case_, group=case_,label = label)) +
geom_col(position = position_dodge2()) +
geom_text(size = 4, vjust = -0.25, position = position_dodge(width = 1)) +
labs(x = "Land use class", y = "Proportion", fill = "case_")+
scale_fill_brewer(palette = "Paired", name="case_",
breaks=c("FALSE", "TRUE"), labels=c("Available", "Used")) +
theme_light()
```
```{r echo=FALSE}
ggsave("figures/Figure1.png")
```
# Fitting and Interpetting Parameters in a Habitat-Selection Function (HSF)
As discussed in the main text, @Fithian2013 showed that assigning "infinite weights" to available points ensures that the logistic regression estimators of slope coefficients converge to the those of the IPP model. Therefore, when fitting logistic regression models to use-availability data, we suggest assigning a large weight (say 1000 or more) to each available location and a weight of 1 to all observed locations (larger values can be used to verify that results are robust to this choice). Below, we use the `ifelse` function in R to assign these weights. We then fit a weighted logistic regression model with the same habitat covariates introduced in the main text, human population density (`popden`), elevation (`elevation`), and land-use category (`landuseC`). Note that we used the same model structure when evaluating the number of available points needed to accurately estimate the integral in the denominator of the likelihood function (see **Availability Domain and Number of Available Points**).
```{r fit basic rsf}
Lupe.dat$w <- ifelse(Lupe.dat$case_, 1, 5000)
HSF.Lupe1 <- glm(case_ ~ elevation + popden + landuseC,
data = Lupe.dat, weight = w,
family = binomial(link = "logit"))
```
## Interpreting Habitat-Selection Parameters in Terms of Relative Intensities or Rates of Use
Note that we will use the terms log relative intensity and log relative selective strength interchangeably. In this section, we provide R code supporting the relative intensity calculations in the main text (note, in Supplementary Appendix B, we demonstrate the use of the `log_rss` function in the `amt` package for more complicated situations). We begin by exploring the coefficients in the fitted model using the `summary()` function:
```{r habitat summary}
summary(HSF.Lupe1)
```
We can exponentiate habitat-selection parameters to compare the relative intensity or rate of use of two locations that differ by 1 unit of the explanatory variable but are otherwise equivalent - i.e., they should be equally accessible and have identical values for all other explanatory variables.
### Quantitative Predictors
We begin by considering two locations, both in the same landcover class and with the same associated population density, but differing by 1 unit in `elevation` (since we have scaled this variable, a difference of 1 implies that the two observations differ by 1 SD in the original units of elevation):
- Location s1: elevation = 3, popden = 1.5, landC = "wet"
- Location s2: elevation = 2, popden = 1.5, landC = "wet"
The relative use of location $s_1$ versus location $s_2$ is given by:
\begin{equation}
\frac{u(s_1)}{u(s_2)} = \frac{\exp(3\beta_{elevation}+ 1.5\beta_{pop\_den}+0\beta_{grass}+1\beta_{wet})a(s_1)}{\exp(2\beta_{elevation}+ 1.5\beta_{pop\_den}+0\beta_{grass}+1\beta_{wet})a(s_2)} = \exp(\beta_{elevation})\frac{a(s_1)}{a(s_2)}
\end{equation}
Further, if we assume both locations are equally available, then $a(s_1)=a(s_2)$, and this ratio simplifies to $\exp(\beta_{elevation}) = \frac{\lambda(s_1)}{\lambda(s_2)}$. Thus, this ratio is also equivalent to the ratio of intensity functions for $s_1$ and $s_2$.
```{r}
exp(3 * coef(HSF.Lupe1)["elevation"] +
1.5 * coef(HSF.Lupe1)["popden"] + coef(HSF.Lupe1)["landuseCwet"]) /
exp(2 * coef(HSF.Lupe1)["elevation"] + 1.5 * coef(HSF.Lupe1)["popden"] +
coef(HSF.Lupe1)["landuseCwet"])
exp(coef(HSF.Lupe1)["elevation"])
```
We get the same ratio whenever comparing any two locations that differ by 1 unit of `elevation`, provided they are equally available and have the same values for `popden` and `landuseC`.
```{r}
exp(11 * coef(HSF.Lupe1)["elevation"] + 1.5 * coef(HSF.Lupe1)["popden"] +
coef(HSF.Lupe1)["landuseCwet"]) /
exp(10 * coef(HSF.Lupe1)["elevation"] + 1.5 * coef(HSF.Lupe1)["popden"] +
coef(HSF.Lupe1)["landuseCwet"])
exp(111 * coef(HSF.Lupe1)["elevation"] + 1.5 * coef(HSF.Lupe1)["popden"] +
coef(HSF.Lupe1)["landuseCwet"]) /
exp(110 * coef(HSF.Lupe1)["elevation"] + 1.5 * coef(HSF.Lupe1)["popden"] +
coef(HSF.Lupe1)["landuseCwet"])
exp(-4 * coef(HSF.Lupe1)["elevation"] + 1.5 * coef(HSF.Lupe1)["popden"] +
coef(HSF.Lupe1)["landuseCwet"]) /
exp(-5 * coef(HSF.Lupe1)["elevation"] + 1.5 * coef(HSF.Lupe1)["popden"] +
coef(HSF.Lupe1)["landuseCwet"])
```
### Categorical Predictors
Next, we compare two observations that differ in terms of their `landuseC` variable, but have the same values for `popden` and `elevation`. We again consider two locations, $s_1$ and $s_2$ assumed to be equally available.
- Location s1: elevation = 2, popden = 1.5, landuseC = "wet"
- Location s2: elevation = 2, popden = 1.5, landuseC = "forest"
Lupe's relative intensity or rate of use of location 1 relative to location 2 is given by:
\begin{equation}
\frac{u(s_1)}{u(s_2)} = \frac{\exp(2\beta_{elevation}+ 1.5\beta_{pop\_den}+0\beta_{grass}+1\beta_{wet})a(s_1)}{\exp(2\beta_{elevation}+ 1.5\beta_{pop\_den}+0\beta_{grass}+0\beta_{wet})a(s_2)} = \exp(\beta_{wet})\frac{a(s_1)}{a(s_2)}
\end{equation}
Again, if we assume the two locations are equally available, then $a(s_1)=a(s_2)$, and the ratio simplifies to $\exp(\beta_{wet})$.
```{r}
exp(coef(HSF.Lupe1)["landuseCwet"])
```
Thus, we conclude Lupe would be `r round(exp(coef(HSF.Lupe1)["landuseCwet"]),2)` as likely to use the location in `wet` than in `forest`, assuming these two locations are equally available.
What if we change the reference level from `forest` to `wet` and refit the logistic regression model?
```{r}
Lupe.dat <- mutate(Lupe.dat, landuseC1 = fct_relevel(landuseC, "wet"))
levels(Lupe.dat$landuseC)
levels(Lupe.dat$landuseC1)
HSF.Lupe2 <- glm(case_ ~ elevation + popden + landuseC1,
data = Lupe.dat, weight = w, family = binomial(link = "logit"))
summary(HSF.Lupe2)
```
We see that the coefficients for `elevation` and `popden` do not change, but the coefficients for the landcover classes do. As discussed in the main text, the coefficients for categorical predictors reflect use:availability ratios for each level of the predictor relative to the use:availability ratio for the reference class. In this case, the coefficient for `forest` is negative, despite Lupe using `forest` more than its availability (i.e., $u(s, s \in forest) > a(s, s \in forest)$) and Lupe spending more than 95% of her time in the forest! The coefficient for `forest` is negative because the use:availability ratio for `forest` is less than the use:availability ratio for the reference class, `wet`.
### Adjusting for Differences in Habitat Availability
#### Naive Adjustment Using Overall Habitat Availability
To compare Lupe's relative use of `forest` versus `wet` habitat, we could attempt to correct for differences in availability of these landcover classes within the MCP surrounding Lupe's locations. One simple approach is to multiply our ratio, calculated assuming equal availability, by the ratio of overall habitat availability for `wet` relative to `forest` habitats with Lupe's MCP.
```{r}
# Availability of forest and wet within Lupe's MCP
a.forest <- with(Lupe.dat[Lupe.dat$case_ == 0, ], sum(landuseC == "forest"))
a.wet <- with(Lupe.dat[Lupe.dat$case_ == 0, ], sum(landuseC == "wet"))
# Multiply by the ratio of availabilities
exp(coef(HSF.Lupe1)["landuseCwet"])*a.wet/a.forest
# Or, comparing wet to forest (instead of forest to wet)
1/(exp(coef(HSF.Lupe1)["landuseCwet"])*a.wet/a.forest)
```
This calculation suggests we are $(1/0.030 = 33)$ times more likely to find Lupe in `forest` than `wet` habitat. Unfortunately, with this calculation we had to naively assume that the availability distributions for `popden` and `elevation` were the same in both `wet` and `forest` cover classes. In reality, if Lupe decides to move from `forest` to `wet`, it is likely that she will experience a change in `elevation` and `popden` too (i.e., these factors will not be held constant).
#### Adjusting using Integrated Spatial Intensities
To compare Lupe's relative use of `forest` versus `wet` habitat, while also accounting for the effects other environmental characteristics that are associated with these habitat types, we can use integrated intensities -- i.e., we can integrate the spatial utilization distribution, $u(s)$, over all `forest` and `wet` habitats:
\begin{equation}
\frac{u(s, s \in forest)}{u(s, s \in wet)} = \frac{\int_{G} u(s)I(s \in forest)ds}{\int_{G}u(s)I(s \in wet)ds}
\end{equation}
where $I(s \in forest)$ and $I(s \in wet)$ are indicator functions equal to 1 when location $s$ is in `forest` or `wet`, respectively (and 0 otherwise). We can estimate this ratio using:
\begin{equation}
\frac{\hat{u}(s, s \in forest)}{\hat{u}(s, s \in wet)} = \frac{\sum_{i=1}^{n_a}\hat{w}(X(s_i); \hat{\beta})I(s_i \in forest)}{\sum_{i=1}^{n_a}\hat{w}(X(s_i); \hat{\beta}))I(s_i \in wet)}.
\end{equation}
where the sum is over the distribution of available locations.
```{r}
# Filter to include only available locations, then integrate over landuse categories
available.locs <- filter(Lupe.dat, case_ == 0)
available.locs$wx <- exp(predict(HSF.Lupe1, available.locs))
with(available.locs, sum(wx[landuseC == "forest"]) / sum(wx[landuseC == "wet"]))
# Now, compare to the observed ratio of Lupe's use of forest and and wet habitat
with(Lupe.dat[Lupe.dat$case_ == 1, ],
sum(landuseC == "forest") / sum(landuseC == "wet"))
```
We find here that the naive adjustment and integrated spatial intensities estimate are nearly identical but this will not always be the case as will be demonstrated below, by repeating these calculations using `wet` and `grass` categories.
```{r}
# No adjustment
exp(coef(HSF.Lupe1)["landuseCwet"]) / exp(coef(HSF.Lupe1)["landuseCgrass"])
# Naive adjustment
a.grass <- with(Lupe.dat[Lupe.dat$case_ == 0, ],
sum(landuseC == "grass"))
(exp(coef(HSF.Lupe1)["landuseCwet"]) /
exp(coef(HSF.Lupe1)["landuseCgrass"])) * a.wet / a.grass
# Integrated Spatial Intensities
# Filter to include only available locations, then integrate over landuse categories
with(available.locs, sum(wx[landuseC == "wet"]) / sum(wx[landuseC == "grass"]))
```
In this case, the estimate using integrated spatial intensities agrees best with data - the relative amount of time that Lupe was in `wet` versus `grass` habitat is shown below:
```{r}
# Now, compare to the observed ratio in the data set
with(Lupe.dat[Lupe.dat$case_==1, ], sum(landuseC == "wet") /
sum(landuseC == "grass"))
```
# Interactions
Lastly, we consider the interaction model described in the main text. We begin by plotting the distribution of elevation values at used and available locations across the different landcover classes.
```{r}
ggplot(Lupe.dat, aes(landuseC, elevation, fill=case_))+
geom_boxplot() +
scale_fill_brewer(palette = "Paired", name="case_",
breaks=c("FALSE", "TRUE"), labels=c("Available", "Used"))+
theme_light()
```
```{r echo=FALSE}
ggsave(file="figures/Figure3.png")
```
We then fit a model with an interaction between `elevation` and `landuseC`:
```{r}
HSF.Lupe3 <- glm(
case_ ~ elevation + popden + landuseC + elevation:landuseC,
data = Lupe.dat,
weight=w,
family = binomial)
summary(HSF.Lupe3)
```
We see that R created two new variables `elevation:landuseCgrass` equal to `elevation` when `landuseC` is `grass` and is 0 otherwise and `elevation:landuseCwet` equal to `elevation` when `landuseC` is `wet` and is 0 otherwise. The coefficients associated with these predictors quantify the change in slope (i.e., change in the effect of `elevation`) when the locations fall in `grass` or `wet` relative to the slope when the locations fall in `forest`. Using these results, we can easily derive that the relative intensity of use of two equally available locations that differ by 1 SD unit of `elevation` is equal to $\exp(0.313)=1.37$ when the two locations are in `forest`, $\exp(0.313+0.112)= 1.53$ when the locations are in `grass`, and $\exp(0.313-0.499)= 0.83$ when the locations are in `wet` habitat. Thus, Lupe selects for higher `elevation` when in `forest` or `grass`, but avoids higher elevations when in `wet`. Alternatively, we can consider how `elevation` changes Lupe's view of the different landcover categories, noting that $\beta_{grass} = -1.471 + 0.112$`elevation` and $\beta_{wet} = 0.183-0.499$`elevation`. Thus, we see that Lupe's strength of avoidance of `grass` (relative to `forest`) and selection for `wet` (relative to `forest`) both decline with `elevation`, and Lupe's inherent ranking of these 3 habitat types will change as `elevation` increases.
```{r echo=FALSE}
save(HSF.Lupe1, HSF.Lupe2, HSF.Lupe3, file="data/models.Rdata")
```
# Conclusion
We have highlighted how connecting habitat-selection functions to IPP models and weighted distribution theory helps with interpreting parameters in habitat-selection functions using simple examples involving a mix of categorical and quantitative variables.
# Session
```{r}
sessioninfo::session_info()
```
# References