-------------------
GENERAL INFORMATION
-------------------


1. Title: R Code, Data, and Output Supporting:  Facilitating effective collaboration to prevent aquatic invasive species spread


2. Author Information:
 a) Amy Kinsley, Veterinary Population Medicine, College of Veterinary Medicine, University of Minnesota; University of Minnesota, Minnesota Aquatic Invasive Species Research Center (MAISRC). 0000-0002-0384-8731
 b) *Alex Bajcz, University of Minnesota, Minnesota Aquatic Invasive Species Research Center (MAISRC). 0000-0002-0384-8731
 c) Robert Haight, USDA Forest Service, Northern Research Station, St. Paul, Minnesota.
 d) Nicholas Phelps, University of Minnesota, Minnesota Aquatic Invasive Species Research Center (MAISRC); University of Minnesota, Department of Fisheries, Wildlife, and Conservation Biology. 0000-0003-3116-860X

*Corresponding author:
bajcz003@umn.edu


3. Description: This repository contains R code, raw and processed data, and associated outputs supporting the results reported in: Kinsley, A, Bajcz A, Haight R, and Phelps N. 2023. Facilitating effective collaboration to prevent aquatic invasive species spread. Biological Invasions [in press]. 
 
--------------------------
Version History
-------------------------- 
09/XX/2023 -- First upload to DRUM.

--------------------------
SHARING/ACCESS INFORMATION
-------------------------- 


1. Licenses/restrictions placed on the data and code: 

Attribution-NonCommercial-ShareAlike 3.0 United States


2. Links to publications that cite or use the data: 
[Forthcoming--publication is currently in press. 


3. Links to other publicly accessible locations of the data: 
None.

4. Recommended citation for this archive: Bajcz A, Kinsley A, Haight R, Phelps N. 2023. R Code, Data, and Output Supporting:  Facilitating effective collaboration to preventaquatic invasive species spread. Data Repository of the University of Minnesota. 

---------------------
FILE OVERVIEW
---------------------

There are more files provided in the archive than are summarized here. However, all those not specifically described in this section are referenced, described, or created in the files listed below. Please consult the files listed here for details.

1. statelevel_modelMSI.R, countylevel_model.R, and collablevel_model.R - These files run ssentially the same optimization model problem: How do we place inspection stations around the state of Minnesota at lakes within various jurisdictions optimally to intercept the most boats at risk of carrying and spreading aquatic invasive species as we can? The three variations on this model reflected by these three files revolve around changing how jurisdictional boundaries and authority work. The "MSI" tag on the first of these files indicates that, to successfully run the code, the Minnesota Supercomputing Institute (MSI) was used--that code may not be runnable on a standard personal computer.

2. interpretting_lakes_selected_x_files.R - This files does the post-processing needed to convert the raw vectors of lakes selected in each optimization model run (the lakes_selected_county, lakes_selected_collabs_, and lakes_selected_state RDS files) into numbers of risky boats intercepted, as reflected in the model_results.png graphic. 

3. statelevel_modelMSI_makemod.R, processing_statewide_model.R, and making boats_reduce files.R - These scripts are supplementary R code files that help to generate inputs or outputs of the other files listed above and are explained further via annotations within the documents themselves.

4. statewide_modelMSI.sh and statelevel_modelMSI_makemod.sh - These are the bash batch script files used to run the state-level optimization model on MSI. 

5. countymod_budgets.csv and collabmod_budgets_collabsandcounties.csv - These two CSV files list, along the rows, the various counties in Minnesota receiving budgets in the optimization models with which to place inspection states. Along the columns are several possible budget levels ranging from 10 up 700 stations. In the cells are the numbers of stations alloted to each county. The former is used in the county-level model and the latter in the collaborations model, the only difference being that some counties have been essentially "combined" into one unit for the latter. These budget allocations were arrived at using a workflow outlined in another file, the budget_determination_workflow.xlsx file, towards the bottom in comments attached to specific cells.


6. lake_info.csv and Lakes_total_collabs.csv - These contain data that are raw inputs to the optimization models and are explained in greater depth in the DATA-SPECIFIC INFORMATION section below.

7. boats_adjSWF, boats_reduce, and boats_reduce_noswf - These are RDS data files that are inputs to the optimization models. They are explained in greater detail in the making boats_reduce files.R. 

8. statewide_model_results - This folder contains per-budget-level vector RDS objects containing the specific lakes selected for inspection stations in the statewide model. This alternative format for these results is due to the fact that the each budget level was ran in a separate batch job on MSI to efficiently use processing time and thus must be "stitched back together" to take on a similar form as the outputs of the other two versions -- see the files processing_statewide_model.R for details.

9. Kaos' boater networks - These are products of an earlier study; they are the estimated numbers of boat trips occurring between every pair of lakes in the state of Minnesota yearly, as predicted by a machine learning model. These are used here to generate the boats.n.ij object that is a key input to the models. 


The scripts should be run in the following order: 
1. making boats_reduce files.R
2. statelevel_modelMSI_makemod.R
3. state_modelMSI.R
4. countylevel_model.R & collablevel_model.R & processing_statewide_model.R
5. interpretting lakes_selected_x files.R

--------------------------
METHODOLOGICAL INFORMATION
--------------------------

1. Methods: For complete methodological details, please refer to Bajcz et al.

2. Instrument- or software-specific information needed to interpret the data:

Programs were written for Program R (R Core Team 2023), and all packages used are noted in each of the R files included towards the top of the script file. All code except that which was run on MSI was run on a machine with the following specifications: 

R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C LC_TIME=English_United States.utf8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] igraph_1.4.3 lattice_0.21-8 ROI.plugin.symphony_1.0-0 Rsymphony_0.1-33 slam_0.1-50
[6] ompr.roi_1.0.1 ompr_1.0.3 ROI_1.0-1 Matrix_1.5-4.1 runjags_2.2.1-7
[11] rjags_4-14 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.2
[16] purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 tidyverse_2.0.0
[21] ggplot2_3.4.2 stars_0.6-1 sf_1.0-13 abind_1.4-5 coda_0.19-4
[26] spOccupancy_0.6.1

loaded via a namespace (and not attached):
[1] splines_4.2.3 foreach_1.5.2 pillar_1.9.0 backports_1.4.1 glue_1.6.2 checkmate_2.2.0 minqa_1.2.5
[8] colorspace_2.1-0 listcomp_0.4.1 pkgconfig_2.0.3 scales_1.2.1 RANN_2.6.1 tzdb_0.4.0 lme4_1.1-33
[15] timechange_0.2.0 proxy_0.4-27 generics_0.1.3 pacman_0.5.1 withr_2.5.0 lazyeval_0.2.2 cli_3.6.1
[22] magrittr_2.0.3 fansi_1.0.4 doParallel_1.0.17 nlme_3.1-162 MASS_7.3-60 lwgeom_0.2-13 class_7.3-22
[29] tools_4.2.3 registry_0.5-1 data.table_1.14.8 hms_1.1.3 lifecycle_1.0.3 munsell_0.5.0 compiler_4.2.3
[36] e1071_1.7-13 rlang_1.1.1 classInt_0.4-9 units_0.8-2 grid_4.2.3 nloptr_2.0.3 iterators_1.0.14
[43] rstudioapi_0.14 boot_1.3-28.1 gtable_0.3.3 codetools_0.2-19 DBI_1.1.3 R6_2.5.1 fastmap_1.1.1
[50] utf8_1.2.3 KernSmooth_2.23-21 stringi_1.7.12 parallel_4.2.3 Rcpp_1.0.10 vctrs_0.6.2 tidyselect_1.2.0


Code run on an MSI VM used the following: 
R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /panfs/jay/groups/22/carr0603/bajcz003/.conda/envs/optmod-env/lib/libopenblasp-r0.3.23.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Chicago
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] igraph_1.5.1              Matrix_1.6-1
 [3] lattice_0.21-8            tidyr_1.3.0
 [5] ROI.plugin.symphony_1.0-0 Rsymphony_0.1-33
 [7] slam_0.1-50               ompr.roi_1.0.1
 [9] ompr_1.0.3                ROI_1.0-1
[11] dplyr_1.1.2

loaded via a namespace (and not attached):
 [1] vctrs_0.6.3       cli_3.6.1         rlang_1.1.1       purrr_1.0.2
 [5] generics_0.1.3    data.table_1.14.8 glue_1.6.2        backports_1.4.1
 [9] registry_0.5-1    fansi_1.0.4       grid_4.3.1        tibble_3.2.1
[13] fastmap_1.1.1     lifecycle_1.0.3   compiler_4.3.1    pacman_0.5.1
[17] pkgconfig_2.0.3   R6_2.5.1          tidyselect_1.2.0  utf8_1.2.3
[21] pillar_1.9.0      listcomp_0.4.1    magrittr_2.0.3    checkmate_2.2.0
[25] lazyeval_0.2.2


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR:  "budget_determination_workflow.xlsx" 
-----------------------------------------

The following describes the formula contained in this document to calculate the total budget allocation used in countymod_budgets.csv and collabmod_budgets_collabsandcounties.csv. 

h91: Put here the total B for inspection stations intended (e.g., 500 total inspection stations for the state).
i90: Column H above takes the target number of inspection stations and multiplies it by the proportion of the statewide budget allocated to each county to get the number of inspection stations that should be allocated to each county if they were to be allocated fully proportionally. Then, column I removes the remainder from these values to get straight counts, and this value here is the sum of that straight count. Since the remainders will often be >0, we will have underallocated inspection stations to this point and will need to make up the difference.
i91: The difference that needs making up is stored here (e.g., 43 more stations need to be allotted. Column J holds just the remainders from column H. We will now rank these from largest to smallest in Column K, giving 1 more inspection station to each of the X highest-ranked counties, where X is equal to the value in this cell. The values in column L are 1 if a county is receiving one more inspection station due to the rankings displayed in column K and is 0 otherwise.
m90: In Column M, we add the first allotment of inspection stations to each county (column I) to the extra allotment determined in column L. This value holds the sum of all those total allotments, which should equal the value in cell H 91.
m91: If H91 and M90 are equal, this value should read TRUE. 


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR:  "Lakes_total_collabs.csv"
-----------------------------------------

Number of variables: 14

Number of cases/rows: 9696

Variable List: 
A. DOW = In Minnesota, every state has a unique identifying number known as a DOW. That is what is contained in this column--the first two digits reflect the lake's home county, the middle four digits are a unique subidentifier, and the last two digits indicate subbasin information, if applicable. 
B. lake_name = Lake name as known to locals in the state or else as listed in public records. 
C. acre = Area of the lake in acres.
D. utm_x, utm_y = Geographical location data for the center of the lake polygon in Zone 15, UTM North format. Not used in analysis. 
E. county = The two digit county code for a lake's home county. 
F. county_name = The lake's home county. 
G. inspect = Irrelevant and not used in analysis but reained for continuity with other analyses using this file.
H. zm2019, ss2019, ew2019, sf2019 = As of 2019, the infestation statuses of each lake for four aquatic invasive species: zebra mussel, starry stonewort, eurasian watermilfoil, and spin water flea, respectively. 1 indicates a positive infestation. 
I. Total = A 1 indicates the lake is infested with at least one invasive species. Not used in analysis. 
J. collab_name = Indicates membership in a collaboration in the collaboration-level model and otherwise repeats the county_name column value if the county in question is not a part of a collaboration. 

-----------------------------------------
DATA-SPECIFIC INFORMATION FOR:  "lake_info.csv"
-----------------------------------------

Number of variables: 10

Number of cases/rows: 9697

All information contained in this file is identical to that in the Lakes_total_collabs.csv file.


-----------------------------------------
REFERENCES
-----------------------------------------

1. R Core Team (2023). R: A language and environment for statistical computing. R
  Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

2. Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A,
  Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson
  D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019).
  “Welcome to the tidyverse.” _Journal of Open Source Software_, *4*(43), 1686.
  doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.

3. Rinker, T. W. & Kurkiewicz, D. (2017). pacman: Package Management for R. version 0.5.0.
  Buffalo, New York. http://github.com/trinker/pacman.

4. Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A Grammar of Data
  Manipulation_. R package version 1.1.2, <https://CRAN.R-project.org/package=dplyr>.

5. Hornik K, Meyer D, Schwendinger F, Theussl S (2023). _ROI: R Optimization
  Infrastructure_. R package version 1.0-1, <https://CRAN.R-project.org/package=ROI>.

6. Schumacher D (2022). _ompr: Model and Solve Mixed Integer Linear Programs_. R package
  version 1.0.3, <https://CRAN.R-project.org/package=ompr>.

7.  Schumacher D (2022). _ompr.roi: A Solver for 'ompr' that Uses the R Optimization
  Infrastructure ('ROI')_. R package version 1.0.1,
  <https://CRAN.R-project.org/package=ompr.roi>.

8. Hornik K, Meyer D, Buchta C (2022). _slam: Sparse Lightweight Arrays and Matrices_. R
  package version 0.1-50, <https://CRAN.R-project.org/package=slam>.

9. Harter R, Hornik K, Theussl S (2021). _Rsymphony: SYMPHONY in R_. R package version
  0.1-33, <https://CRAN.R-project.org/package=Rsymphony>.

10. Theussl S (2020). _ROI.plugin.symphony: 'SYMPHONY' Plug-in for the 'R' Optimization
  Interface_. R package version 1.0-0,
  <https://CRAN.R-project.org/package=ROI.plugin.symphony>.

11. Wickham H, Vaughan D, Girlich M (2023). _tidyr: Tidy Messy Data_. R package version
  1.3.0, <https://CRAN.R-project.org/package=tidyr>.

12. Sarkar, Deepayan (2008) Lattice: Multivariate Data Visualization with R. Springer, New
  York. ISBN 978-0-387-75968-5.

13. Bates D, Maechler M, Jagan M (2023). _Matrix: Sparse and Dense Matrix Classes and
  Methods_. R package version 1.6-0, <https://CRAN.R-project.org/package=Matrix>.

14. Csardi G, Nepusz T (2006). “The igraph software package for complex network research.”
  _InterJournal_, *Complex Systems*, 1695. <https://igraph.org>.