#' ---
#' title: "Case Study III: Model Selection Uncertainty" 
#' output: html_document
#' ---

#' ## Objective
#'    
#' This example demonstrates how the bootstrap can be used to explore model uncertainty.  
#'  
#' #### Load R libraries	
#' 	
#+warning=FALSE, message=FALSE 	
library(knitr)	
library(rms)	# for validate function
library(MASS) # for stepAIC


#' ## Setting the seed of the random number generator	
#' 	
#' Use the **set.seed()** function in R to initialize the random number generator.  	
#' 	
set.seed(2041971)	

#' ## Modeling abundance of longnose dace	
#' 	
#' Read in the data:	
dace<- read.csv("data/longnosedace.csv")	

#' ### Predictors
#' 
#'  - acreage = area (in acres) drained by the stream
#'  - do2 = the dissolved oxygen (in mg/liter)
#'  - depth = the maximum depth (in cm) of the 75-meter segment of stream
#'  - no3 = nitrate concentration (mg/liter)
#'  - so4 = sulfate concentration (mg/liter)
#'  - temp = water temperature on the sampling date (in degrees C).
#'  

#' Fit a model using all 6 predictors, then use stepAIC to implement backwards
#' selection to choose a "best" model.
fullmod.lm<-lm(longnosedace~acreage+do2+maxdepth+no3+so4+temp,data=dace)	
stepAIC(fullmod.lm)


#' ##  Bootstrap validation	
#' 	
#' Validate will use the bootstrap to calculate "honest" measures of model fit. 
#'  We can also visualize "model uncertainty" in the "best model" by using bw=T 
#'  (which tells R to use backwards selection to choose the best model). The
#'  "*" below indicate, which variables are included in the "optimal model"
#'  for each bootstrap replicate.
#'  
#' After applying a backwards model selection algorithm, we end up with 
#' a model containing only acreage and no3. The $R^2$ of this model = 0.24, 
#' which describes the variance in longnosedace explained by these two predictors.
#'  If we were to apply this same model to a new data set, we would expect
#' the amount of variance that would be explained to be much lower. We can
#' obtain a more "honest" measure of the variance by: a) creating 2 bootstrap
#' data sets (one for model training and one for model testing); b) applying our
#' model selection algorithm to the training data set and calculating
#' the resulting $R^2$; c) use the same model to predict the response in the
#' bootstrap test data set and use these predictions to calculate
#' a second $R^2$;  d) calculate a measure of "optimism" by subtracting the 
#' average $R^2$ from part c from the average $R^2$ in part b; e) subtract
#' this estimate of optimism from the $R^2$ obtained from our original data set. 
#' The validate function will do this for us!
#' 
fullmod.ols<-ols(longnosedace~acreage+do2+maxdepth+no3+so4+temp,data=dace, x=T, y=T)	
validate(fullmod.ols, bw=T, B=100)

#' ## Conclusions
#' 
#' 1. We see that the different bootstrap samples result in different models 
#' being chosen as optimal. The number of predictor variables included ranges 
#' from 1 (in 6 models) to 6 (in 1 model). 
#' 
#' 2. We see that our original estimate of $R^2$ (0.24) is likely quite
#' optimistic (our estimate of optimism = 0.20).  Thus, we end up
#' with a corrected estimate of $R^2$ = 0.037 (quite depressing!).  
#'

#' ## Footer	
#' 	
#' 	
# Session Information:	
sessionInfo()