1. Introduction

Welcome to the working with stream data in R workshop–we have three goals:

provide code and examples for downloading, loading, wrangling, and visualizing fish and various stream-based spatial data sets in R,
provide instructions to spatially link up those different types of data together with fish data for exploratory data analysis (EDA),
provide instructions for iterating figure making for EDA. That is, column-wise and row-wise functional programming (without using a for loop)

Case Study: West Fork of the Kickapoo

We will use the West Fork Kickapoo River as our case study. We will gather fish, watershed, and some environmental data associated with this watershed, link it all together using stream reaches, and try to answer a few common questions like: How did the drought/flood (i.e., precipitation) from 2018 affect trout populations in 2019?

1.1. Prerequisites

This is a more advanced workshop and assumes some previous experience working in R. Importantly, some knowledge of using R, RStudio, Rprojects, and RMarkdown files. Peter Euclide’s Workshop on Getting the Most out of RStudio in this year’s R-Expo is an excellent resource for this. Topics include importing and exporting data from Excel, creating scripts, markdown files, and projects, downloading R-packages, and accessing pre-made cheatsheets. Below, I’ll provide some very basic information to get you started.

1.1.1. R

To download R, go to CRAN, the comprehensive R archive network. Don’t try and pick a mirror that’s close to you: instead use the cloud mirror, https://cloud.r-project.org, which automatically figures it out for you.

1.1.2. RStudio

RStudio is an integrated development environment, or IDE, for R programming. Download and install it from http://www.rstudio.com/download.

1.1.3. The Tidyverse packages

Much of the codeing in this workshop embraces The Tidyerse. The packages in the tidyverse share a common philosophy of data and R programming, and are designed to work together naturally. You can install and load the Tidyverse package with this code:

install.packages("tidyverse")
library(tidyverse)  # for readr, dplyr, ggplot2, forcats, purrr, etc.

1.1.4. Other packages

We will use a few other packages for our work today. Once installed (by using install.packages() or the packages tab in your RStudio GUI), you can load them using the following code:

# General packages
library(janitor)        # for cleaning variable names
library(patchwork)      # for composing a plot of plots
library(sf)             # for working with spatial data
library(nngeo)          # for nearest neighbor spatial joins

# WI DNR packages
library(arcgis.rest)    # for wdnr.gis; needed to query an ArcGIS Rest API
library(wdnr.gis)       # for pulls from spatial layers from WDNR ArcGIS Rest API
library(wdnr.fmdb)      # for fish data from FMDB

# USGS package
library(dataRetrieval)  # for USGS data pulls

1.1.5. R Projects

“R experts keep all the files associated with a project together — input data, R scripts, analytical results, figures. This is such a wise and common practice that RStudio has built-in support for this via projects.” – Project Workflow chapter in r4ds.

When you make a new Rporject and associate with a file directory on your computer, it becomes your “home” or working directory. This is the working directory for this RProject on my local machine:

getwd()

## [1] "C:/Users/maitlb/Documents/R-tutorials/stream-data-in-R"

This is great because now we can easily find files. Whenever you refer to a file with a relative path it will look for it here. This will be important later when we load files from our local machine and save files to our local machine.

2. Data - Watersheds, streams, and fish

Data are generally stored:

as Excel or plain text files in a “data” folder on your computer, or,
in a relational database stored
1. locally on your computer (e.g., an MS Access database) or,
2. hosted on a web server (e.g., WI DNR’s FMDB).

Today we will work with both. We will load data that I have prepared for your directly from the “/data” folder in the project home directory, and pull and load data from online web servers.

“A plethora of spatial data from the Wisconsin DNR and other agencies exist on the internet and are readily available for download.”–Paul Frater. Check out Paul’s Workshop in this years R-Expo, arcpullr and wdnr.gis: Two new R packages for pulling spatial data into R for more details on whats out there and how to use it.

Today, we’ll start by loading watershed polygons and stream lines for our study area, then pull in fish surveys for the area and other associated watershed and environmental data.

2.1. Spatial Data - watershed boundaries and streams

Spatial data can be big data–as in huge file sizes. So it is good practice to store those data, as well as other data, on a web server that can be queried for specific data requests. The benefit is we can specify what area or stream we want data from instead of downloading ALL the data and then filtering it, which can take a long time. However, these data can be downloaded from the internet from the links below to your local machine and loaded into R. Instruction/code for loading shapefiles from a geodatabase on disk into R can be found in the project folder: code/load-24k-hydro-flowlines.R.

Wisconsin’s HUC10 watershed shapefile can be downloaded here: https://data-wi-dnr.opendata.arcgis.com/datasets/478fe58cd72e419cab24eff04ecfb839_5

The full 24K Hydro Geodatabase (with stream and lake shapefiles) can be downloaded here: https://data-wi-dnr.opendata.arcgis.com/datasets/0128cce2c06342218725f1069031a4fa

Next, I show you how to pull and load these data in R using an internal DNR R package that pulls data from an online API. We focus on watershed boundaries (i.e., HUCs) and stream lines from Wisconsin’s 24K Hydro Geodatabase. You can find details on HUCs (hydrologic unit codes) here; they are watersheds delineated by USGS using a nationwide system based on surface hydrologic features. I will now provide some background on Wisconsin 24K Hydro Data as it will be important down the line for linking together our various data sets.

WI DNR 24K Hydro

The Wisconsin Department of Natural Resources has developed a statewide 1:24,000 scale hydrography GIS database (24K Hydro) that represents all surface water. Because an objective of the 24K Hydro data model was to support flow modeling and tracing functionality, various “connectivity lines” (e.g. stream centerlines, extensions, wetland gap connectors, and flow potentials) were added to form a statewide linear surface water network, known as a “dendritic” network. Every segment in the flowline layer has an associated HYDROID.

As we will see, we can use HYDROIDs as a primary key to link our fish data (i.e., a geographic location of a fish survey or a SWIMS station) with other data types we associate with that particular stream segment (e.g., watershed attributes, stream flow and temperature, etc.). More on this in the 24K Hydro Attribution section.

Watershed polygons

The West Fork Kickapoo is a HUC10-level watershed comprised of a few HUC12 sub-watersheds. Because there is currently a bug in DNR’s ArcGIS REST API for the HUC 10 layers, DNR’s GIS package does not have working HUC10 layer, so we need to pull the HUC12 polygons that comprise the West Fork polygon (HUC10). So we need a list of the HUC12 codes within the West Fork HUC10 (code = 0707000602). To explore the available HUCs, use the lookup table provided in the package: wdnr.gis::watershed_lookup.

# get a vector of huc12 codes
huc12_codes <-
  wdnr.gis::watershed_lookup %>% 
  filter(str_detect(huc_codes, "0707000602")) %>%  
  pull(huc_codes)

Note: We use the pipe operator above ( %>% ). The point of the pipe is to help you write code in a way that is easier to read and understand. See r4ds for more details. Basically, it lets us ‘pipe’ the previous object into the next function.

Now we map() the wdnr.gis::get_watershed_layer() over each huc12 code and pull its polygon from DNR servers, and stack them together.

A map function is basically a for-loop. They apply a function to each element of a list or atomic vector and return an objective of the same length. The map_df() function binds the resulting objects together instead of you having to rbind() them after the fact. Below I do the map call, but also show you what the for loop would look like.

# do the mapping
huc12_polys <-
  huc12_codes %>%  
  purrr::map_df(~ get_watershed_layer(watershed_code = .x)) %>%
  janitor::clean_names()

# for-loop version -------------------------------------#
# # Initialize output
# huc12_polys <- list()
# 
# # For each HUC code, pull the watershed polygon
# for (i in seq_along(wfkick_huc12s)) {
#   
#     huc12_poly <- 
#       get_watershed_layer(
#       watershed_code = wfkick_huc12s[i]
#       )
#     
#     huc12_polys[[i]] <- huc12_poly
#     
#     }
# 
# # stack
# huc12_polys <- 
#   bind_rows(huc12_polys) %>% 
#   clean_names()
# 
# # clean up
# rm(huc12_poly)

Let’s take a look at the object:

# print it
huc12_polys

## Simple feature collection with 5 features and 6 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -90.88876 ymin: 43.46545 xmax: -90.70062 ymax: 43.74711
## geographic CRS: WGS 84
##   objectid   huc12_code                           huc12_name huc12_type
## 1     3288 070700060201                          Seas Branch          S
## 2     3309 070700060202 Knapp Creek-West Fork Kickapoo River          S
## 3     3310 070700060203                        Bishop Branch          S
## 4     3323 070700060204                       Harrison Creek          S
## 5     3324 070700060205             West Fork Kickapoo River          S
##   huc12_hydro_mod_code downstream_huc12_code                          geoms
## 1                   NM          070700060201 MULTIPOLYGON (((-90.78143 4...
## 2                   TF          070700060202 MULTIPOLYGON (((-90.75744 4...
## 3                   NM          070700060203 MULTIPOLYGON (((-90.82486 4...
## 4                   TF          070700060204 MULTIPOLYGON (((-90.76483 4...
## 5                   TF          070700060205 MULTIPOLYGON (((-90.70121 4...

Note here is that this object is a simple spatial feature (sf) that is also a dataframe. The geometry column holds the coordinates, and the object itself has a geographic reference systems (WGS 84) and a bounding box.

Let’s now dissolve the HUC12s into a HUC10 poly for the West Fork Watershed. We can aggregate by creating a new column for each row that has the same value, then group by that aggregate id and summarize the spatial data.

# do the dissolve/aggregate
huc10_poly <- 
  huc12_polys %>% 
  mutate(huc10 = "West Fork Kickapoo") %>% 
  group_by(huc10) %>% 
  summarise()

You can see this new object only contains one aggregated polygon:

# print it
huc10_poly

## Simple feature collection with 1 feature and 1 field
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -90.88876 ymin: 43.46545 xmax: -90.70062 ymax: 43.74711
## geographic CRS: WGS 84
## # A tibble: 1 x 2
##   huc10                                                                    geoms
##   <chr>                                                            <POLYGON [°]>
## 1 West Fork Kick~ ((-90.75622 43.47249, -90.75793 43.47303, -90.75861 43.47303,~

24k Hydro Flowlines (streams)

Next, we pull the 24k hydro flowlines (streams) by mapping the get_hydro_layer() function over our HUC12 codes and stacking them together. This is preferable to loading the entire flow line layer to your workspace, then filtering it by you watersheds. This can strain the hosting server and take a long time. As mentioned, I’ve provided instruction/code for loading shapefiles from a geodatabase on disk into R can be found in the project folder: code/load-24k-hydro-flowlines.R.

huc12_flines <-
  huc12_codes %>% 
  map_df(~ get_hydro_layer(
    watershed_code = .x, 
    layer_type = "flowlines")
    )%>% 
  clean_names()

Take a glimpse() (similar to str()) at the variables in the flowlines object:

huc12_flines %>% glimpse()

## Rows: 537
## Columns: 28
## $ objectid                <int> 607019, 608946, 610066, 610210, 610960, 612...
## $ sw_no                   <int> 36020338, 36019753, 36020223, 36020463, 360...
## $ river_sys_name          <chr> "Unnamed", "Unnamed", "Unnamed", "Seas Bran...
## $ row_name                <chr> "Unnamed", "Unnamed", "Unnamed", "Seas Bran...
## $ river_sys_wbic          <int> 1190400, 5029107, 1189900, 1189800, 1189800...
## $ stream_order            <int> 2, 1, 3, 2, 3, 2, 4, 3, 1, 1, 1, 1, 1, 4, 2...
## $ orig_hrz_src_yr         <int> 1983, 1985, 1983, 1983, 1983, 1985, 1983, 1...
## $ orig_hrz_coll_mthd_code <chr> "SCN001", "SCN001", "SCN001", "SCN001", "SC...
## $ orig_hrz_src_dnom_amt   <int> 24000, 24000, 24000, 24000, 24000, 24000, 2...
## $ last_update_date        <dbl> 1.366643e+12, 1.366643e+12, 1.366643e+12, 1...
## $ hydroid                 <int> 200049366, 200050364, 200049484, 200049141,...
## $ hydrocode               <int> 110, 110, 100, 110, 100, 110, 100, 100, 110...
## $ waterbody_hydroid       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ trace_use_flag          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ hydrotype               <int> 508, 508, 508, 508, 508, 508, 508, 508, 508...
## $ carto_use_flag          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ landlock_code           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ quad_line_flag          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ enabled                 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ geom_change_datetime    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ flip_change_datetime    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ wbic_change_datetime    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ create_datetime         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ last_update_user_id     <chr> "ceelew", "ceelew", "ceelew", "ceelew", "ce...
## $ producer_code           <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ shape_len               <dbl> 501.0032, 609.5658, 200.7406, 1073.4777, 13...
## $ in_state_code           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ geoms                   <MULTILINESTRING [°]> MULTILINESTRING ((-90.84415...

Plot streams

Let’s plot our streams and watershed polygons together for a visual. We will be using the ggplot2 package for plotting today. For a background and introduction to the benefits of ggplot, see the Data Vizualizization chpater in r4ds or visit Holly Embke’s workshop at this year’s R-Expo: Making better graphs – publication ready plots in ggplot.

ggplot() +
  geom_sf(data = huc12_polys, fill = "white", color = "black") +
  geom_sf(data = huc12_flines %>% filter(stream_order > 1), color = "blue") +
  geom_sf(data = huc10_poly,color = "black", alpha = 0) +
  ggtitle("W. Fk. Kickapoo")

2.2. Fish data

Next, we can use the wbics (waterbody identifier code) from our flowlines to pull all survey data for those streams using wdnr.fmdb pull functions. Alternatively, you can load fish data of your own from a .csv or .xlsx file using the readr or readxls packages.

2.2.1. Surveys and Efforts

Load data

First, we will pull the all the surveys and efforts data for the West Fork Kickapoo from 1990-2020.

We won’t run this, but for WDNR staff, this is code I would use to pull from the database with my DNR credentials then save to disk for future use:

set_fmdb_credentials()

# set some arguments for the get_fmdb_data functions
huc12_wbics <- huc12_flines$river_sys_wbic 
years <- c(1990:2020)
waterbody_types <- c( "wadable_stream","non-wadable_stream","stream")
gear_types <- c("stream_shocker","backpack_shocker")

# pull surveys
df_surveys <- 
  get_fmdb_surveys(
    year = years, 
    wbic = huc12_wbics,
    waterbody_type = waterbody_types
    ) 

# pull efforts
df_efforts <- 
  get_fmdb_efforts(
    year = years, 
    wbic = huc12_wbics,
    waterbody_type = waterbody_types,
    gear = gear_types
  ) 

write_rds(df_surveys, "data/df_surveys.rds")
write_rds(df_efforts, "data/df_efforts.rds")

I’ve already done that for your and saved the data to .rds files in the data folder. Let’s load that data from file.

df_surveys <- read_rds("data/df_surveys.rds")
df_efforts <- read_rds("data/df_efforts.rds")

Clean data

Now we need to do some standard data cleaning before pulling the fishraw data.

# survey status type to keep
status_types = c(
  "data_entry_complete_and_proofed",
  "historical_data_complete_and_proofed",
  "historical_data_entry_complete",
  "historical_data_load_status_unknown"
)

# filter surveys
df_surveys <- 
  df_surveys %>% 
  # filter surveys by unique survey.seq.no's in efforts data
  filter(survey.seq.no %in% unique(df_efforts$survey.seq.no)) %>% 
  # keep good status types
  filter(survey.status %in% status_types)

# target species necessary for CPEs and assigning 0s
trout_targs <- 
  c(
    "all_species",
    "gamefish_species",
    "gamefish_panfish",
    "trout_spp",
    "brown_trout",
    "brook_trout",
    "rainbow_trout"
    )

# primary survey purposes to keep
prim_surv_purps <- 
  c("cpe","spe","dpe","mpe")


# filter efforts
df_efforts <- 
  df_efforts %>% 
  
  # filter efforts based on the filtered surveys
  filter(survey.seq.no %in% df_surveys$survey.seq.no) %>% 
  # filter out site 315s (test entries in database
  filter(site.seq.no != 315) %>% 
  # filter efforts by target species necessary for CPEs and assigning 0s
  filter(
    target.species %in% trout_targs | 
      secondary.target.species %in% trout_targs
    ) %>% 
  # filter by primary survey purpose
  filter(primary.survey.type %in% prim_surv_purps) %>% 

  # deal with multiple runs and NAs

  # Convert NA run numbers into 1
  replace_na(list(run.number = 1)) %>% 
  # Recode zeros into 1s
  mutate(run.number = if_else(run.number == "0", "1", run.number)) %>% 
  # filter 1st runs
  filter(run.number %in% c("1"))


# finally, filter the surveys by the filtered efforts data:
df_surveys <- 
  df_surveys %>% 
  filter(survey.seq.no %in% unique(df_efforts$survey.seq.no))

Plot survey sites

Okay, let’s plot our surveys sites and the spatial data to visualize. First, we have to make our dataframe spatial – i.e., convert it to a simple spatial feature (sf). Because we need this to match the CRS of 24K hydro, we also pass the EPSG number for WGS 84, Wisconsin’s preferred projection.

# make spatial
df_surveys_sites <- 
  df_surveys %>% 
  distinct(swims.station.id, .keep_all = TRUE) %>% 
  sf::st_as_sf(
    coords = c("longitude", "latitude"), 
    crs = 4326
    )

# plot it
df_surveys_sites %>% 
  ggplot() + 
  geom_sf(data = huc12_polys, fill = "white", color = "black") +
  geom_sf(data = huc12_flines %>% filter(stream_order > 1), color = "blue") +
  geom_sf(data = huc10_poly,color = "black", alpha = 0) +
  geom_sf(shape = 21, size = 2, color = "black", fill = "red", alpha = 0.5) + 
  ggtitle("Trout survey sites")

2.2.2. Fishraw data

Load data

We use the distinct efforts to pull the associated fishraw data. Again, this is the code to use with your FMDB credentials.

set_fmdb_credentials()

distinct_efforts <- 
  df_efforts %>% 
  distinct(visit.fish.seq.no) %>% 
  pull()
  
df_fishraw <- 
  get_fmdb_fishraw(visit_seq = distinct_efforts)

write_rds(df_fishraw, "data/df_fishraw.rds")

But as I’ve done that for you, we load from file.

df_fishraw <- read_rds("data/df_fishraw.rds")

Clean data

Next, we filter out stream trout, and do a few standard data cleaning tasks. Paul Frater (FMDB database guru and package developer) has made some handy functions for helping us out here. Two important ones are expand_counts() and estimate_lengths. We also break down trout into two sizes classes: 1) YOY and 2) Sub-adults and adults.

df_fishraw_cln <-
  df_fishraw %>% 
  # clean
  filter(species %in% c("brook_trout","brown_trout")) %>% 
  # get one row for each fish
  expand_counts(drop_count = FALSE) %>%  
  # estimate lengths for fish without lengths
  estimate_lengths() %>%   
  # get rid of anyother fish without lengths
  filter(!is.na(length)) %>%  
  # convert inches to metric
  mutate(length = length * 25.4) %>%  
  # assign length classes
  mutate(
    length_class = case_when(
      
      species == "brook_trout" & length <  100 ~ "YOY",
      species == "brook_trout" & length >= 100 ~ "Adult",

      species == "brown_trout" & length <  150  ~ "YOY",
      species == "brown_trout" & length >= 150  ~ "Adult"

      )
    )

Finally, we can calculate cpe’s for each species and size class using the calc_cpe() function in wdnr.fmdb. We also will join the results back to the surveys and efforts data “survey.seq.no” as a key.

# do the calculations and join back to surveys and efforts data
df_cpe <- 
  calc_cpe(
    df_surveys, 
    df_efforts, 
    df_fishraw_cln, 
    # these is groupings on which cpes are calculated
    grouping = 
      c(
        "county", 
        "waterbody.name", 
        "wbic", 
        "survey.year", 
        "survey.seq.no", 
        "visit.fish.seq.no", 
        "length_class"
        ),
    # must specify species or function pulls back in all other species data
    spp = c("brook_trout", "brown_trout"),  
    keep_zero_cpe = FALSE) %>%
  # then we link back to surveys and efforts for swims station id and coords
  left_join(
    df_surveys %>% 
      select(survey.seq.no, swims.station.id, latitude, longitude), 
    by = "survey.seq.no") %>% 
  # and finally link to efforts to get gear type of surveys
  left_join(df_efforts %>% select(survey.seq.no, gear), 
            by = "survey.seq.no") %>% 
  mutate(
    species = fct_recode(
      species, 
      "Brook Trout" = "brook_trout",
      "Brown Trout" = "brown_trout"
      ),
    gear_class = case_when(
      gear == "backpack_shocker" & length_class == "Adult" ~ "Backpack - Adult",
      gear == "backpack_shocker" & length_class == "YOY" ~ "Backpack - YOY",
      gear == "stream_shocker" & length_class == "Adult" ~ "Stream - Adult",
      gear == "stream_shocker" & length_class == "YOY" ~ "Stream - YOY")
    ) 

# print it
df_cpe %>% glimpse()

## Rows: 309
## Columns: 18
## $ county            <chr> "vernon", "vernon", "vernon", "vernon", "vernon",...
## $ waterbody.name    <chr> "birds_creek", "birds_creek", "birds_creek", "bis...
## $ wbic              <int> 1189100, 1189100, 1189100, 1188500, 1188500, 1188...
## $ survey.year       <int> 2019, 2019, 2019, 1993, 1993, 1993, 1993, 2003, 2...
## $ survey.seq.no     <int> 515090937, 515090937, 515090937, 78348, 78348, 78...
## $ visit.fish.seq.no <int> 766701, 766701, 766701, 489391, 489391, 489391, 4...
## $ species           <fct> Brook Trout, Brown Trout, Brown Trout, Brook Trou...
## $ length_class      <chr> "Adult", "Adult", "YOY", "Adult", "YOY", "Adult",...
## $ total_catch       <dbl> 5, 3, 27, 1, 1, 122, 36, 1, 237, 84, 141, 270, 19...
## $ total_effort      <dbl> 0.06213712, 0.06213712, 0.06213712, 0.12427424, 0...
## $ cpe               <dbl> 80.46720, 48.28032, 434.52287, 8.04672, 8.04672, ...
## $ cpe_type          <chr> "electrofishing", "electrofishing", "electrofishi...
## $ cpe_units         <chr> "N/miles_shocked", "N/miles_shocked", "N/miles_sh...
## $ swims.station.id  <chr> "10053007", "10053007", "10053007", "10010619", "...
## $ latitude          <dbl> 43.54026, 43.54026, 43.54026, 43.54369, 43.54369,...
## $ longitude         <dbl> -90.74513, -90.74513, -90.74513, -90.80701, -90.8...
## $ gear              <chr> "backpack_shocker", "backpack_shocker", "backpack...
## $ gear_class        <chr> "Backpack - Adult", "Backpack - Adult", "Backpack...

Plot trout CPEs

Let’s summarize our data for the whole West Fork of the Kickapoo by year, gear, and length class, then plot it.

# set for colors for the plot
pal <- RColorBrewer::brewer.pal(4, "RdBu")

p.trout.trends <- 
  # summarize data
  df_cpe %>% 
  group_by(survey.year, species, gear_class) %>% 
  summarise(
    mean = mean(cpe, na.rm = TRUE),
    sd = sd(cpe)/sqrt(length(cpe))
    ) %>% 
  
  # plot
  ggplot(aes(x = survey.year, y = mean, color = gear_class)) + 
  geom_point() +
  geom_line(size = 0.5) +
  # geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width=0) +
  scale_x_continuous(breaks = seq(1990,2020,5), limits = c(1990, 2020)) + 
  scale_y_continuous(breaks = seq(0,2000,500), limits = c(0,2100)) +
  scale_color_manual("Gear - Class", values = pal[c(1,2,4,3)]) + 
  facet_wrap(vars(species), nrow = 2) + 
  labs(x = "Year", y = "Trout / mile") + 
  ggtitle("Trout trends - West Fork Kickapoo") + 
  theme_bw() 

# print it
p.trout.trends

2.3. Watershed atributes

It is often helpful to have information on watershed attributes such as stream gradient, reach length, and drainage area for a particular fish survey. Luckily, some fine folks at the WI DNR have done some heavy lifting to help up do just that.

2.3.1. 24K Hydro Attribution

In 2013, a DNR project worked to attribute stream features and small catchments (i.e., HUC16) in the 24K Hydro Geodatabase with a variety of geologic, land cover, and other base data. The immediate goal of this project was to provide fine-resolution stream attribute data to be used to model stream flows and fish distributions.

To do this, the team filtered the 24K flowlines for relevant “stream” line segment features. Those features were considered part of reaches, the basic unit that was attributed. For lakes, the lake polygon and all stream features flowing under them were assigned a single REACHID equivalent to the HYDROID of the lake polygon. All other streams were considered as separate features, so that each HYDROID is associated with a unique REACHID.

Thus, REACHIDs are unique identifiers for stream segments with catchment-level information available in WHD24k Value Added GDB that relate to HYDROIDs in the 24K Hydro GDB.

There are three main flavors of these data:

Stream Attributes - including lat/long of stream centroid, connectivity (distance to lakes), elevation, sinuosity, gradient (slope), wetted width, stream order, network (total stream length).
Natural Community Model - included output from the Natural Community Model. Some helful variables include Natural Community class, temperature class, modeled temperatures, and modeled seasonal and monthly flows.
Watershed attributes - these variables represent data aggregated to the “HUC16” scale. That is, the catchment draining a particular stream reach (again, identified by its REACHID).

2.3.2. Watershed data

Load data

Unfortunately, the Value Added dataset is not publicly facing. So below we just load the .rds files I previously saved and select out some commonly used metrics for our use. Detailed descriptions of these variables can be found here https://www.arcgis.com/home/item.html?id=835e08fdc8104c5dbd5d173939a3c8d9.

whd_stream_atribs <- 
  read_rds("data/24K_va_stream_atribs.rds") %>% 
  clean_names() %>% 
  as_tibble() %>% 
  select(reachid, hyd_cat, c_huc12, gradient, c_length, wet_width, stream_order)

whd_ncm <- 
  read_rds("data/24K_va_ncm.rds") %>% 
  clean_names() %>% 
  as_tibble() %>% 
  select(reachid, nat_comm, temp_class, temp_summer_cl_cc, temp_july_cl_cc,
         temp_max_cl_cc, ends_with("_c"), -starts_with(c("wy","wrt")))

whd_watwersheds <- 
  read_rds("data/24k_va_watersheds.rds") %>% 
  clean_names() %>% 
  as_tibble() %>% 
  select(reachid, w_area, w_perm, w_darcy, 
         contains(c("prcp", "_temp_")))

Clean data

As mentioned, these data are organized around REACHIDs (see 24K Hydro Attribution). In total, there are value added data for 162,620 stream reaches.

whd_stream_atribs %>% group_by(hyd_cat) %>% tally()

## # A tibble: 3 x 2
##   hyd_cat      n
##   <chr>    <int>
## 1 lake      9752
## 2 stream  162620
## 3 <NA>       491

So we filter out the stream reaches and join the tables together using reachid as the key.

df_24k_attributes <- 
  whd_stream_atribs %>% 
  filter(hyd_cat == "stream") %>% 
  left_join(whd_ncm, by = "reachid") %>% 
  left_join(whd_watwersheds, by = "reachid") %>% 
  mutate(across(c(reachid), as.character)) 


# clean up R environment
rm(whd_ncm); rm(whd_stream_atribs); rm(whd_watwersheds)

# print it
df_24k_attributes

## # A tibble: 162,620 x 34
##    reachid hyd_cat c_huc12 gradient c_length wet_width stream_order nat_comm
##    <chr>   <chr>     <dbl>    <dbl>    <dbl>     <dbl>        <int> <chr>   
##  1 200000~ stream  7.09e10   2.03     1430.        0.8            1 Coldwat~
##  2 200000~ stream  7.09e10   0.349      44.9       2.6            2 Coldwat~
##  3 200000~ stream  7.09e10   0.349      26.5       2.6            2 Coldwat~
##  4 200000~ stream  7.09e10   1.68     1532.        0.8            1 Coldwat~
##  5 200000~ stream  7.09e10   0.349    2100.        2.4            2 Coldwat~
##  6 200000~ stream  7.09e10   0.567     968.        1.8            1 Cool-Co~
##  7 200000~ stream  7.09e10   0.0884   1926.        7.9            4 Cool-Wa~
##  8 200000~ stream  7.09e10   0.917    3248.        1.3            1 Cool-Co~
##  9 200000~ stream  7.09e10   1.19       39.8       1.8            1 Cool-Co~
## 10 200000~ stream  7.09e10   0.640     917.        0.8            1 Cool-Co~
## # ... with 162,610 more rows, and 26 more variables: temp_class <chr>,
## #   temp_summer_cl_cc <dbl>, temp_july_cl_cc <dbl>, temp_max_cl_cc <dbl>,
## #   spr_e10_c <dbl>, spr_e50_c <dbl>, spr_e90_c <dbl>, sum_e10_c <dbl>,
## #   sum_e50_c <dbl>, sum_e90_c <dbl>, fal_e10_c <dbl>, fal_e50_c <dbl>,
## #   fal_e90_c <dbl>, apr_e10_c <dbl>, apr_e50_c <dbl>, apr_e90_c <dbl>,
## #   aug_e10_c <dbl>, aug_e50_c <dbl>, aug_e90_c <dbl>, w_area <dbl>,
## #   w_perm <dbl>, w_darcy <dbl>, w_prcp_ann <dbl>, w_temp_ann <dbl>,
## #   w_temp_gs <dbl>, w_temp_july <dbl>

3. Cross reference

Now we need to link up our fish data with our stream and watershed attribute data. Fish survey sites, in addition to having their own distinct IDs in FMDB, each have a distinct SWIMS station IDs associated with them. In turn, SWIMS stations IDs are linked to hydroids in the 24K Hydro lines, which are in turn linked with reachids in the 24k attributed data.

So, we need a cross-walk table between swims_stations_id, hydroid, and reachid .

3.1. Xwalk table

I have compiled that crosswalk for you. Let’s load it into R.

xwalk_ids <- 
  read_rds("data/xwalk_ids.rds") %>% 
  clean_names() %>% 
  filter(!is.na(reach_id))
xwalk_ids

## # A tibble: 26,459 x 3
##    station_id hydro_id  reach_id 
##    <chr>      <chr>     <chr>    
##  1 313038     200105272 200105272
##  2 203164     200054446 200054446
##  3 207        200026143 200026143
##  4 209        200211923 200211923
##  5 021        200000410 200000410
##  6 212        200005635 200005635
##  7 133072     200027010 200027010
##  8 213001     200164493 200164493
##  9 213003     200164493 200164493
## 10 213004     200164409 200164409
## # ... with 26,449 more rows

Now we can join our xwalk table to our survey sites table using the SWIMS station ID as our key.

df_surveys_sites_xref <- 
  df_surveys_sites %>% 
  left_join(xwalk_ids, by = c("swims.station.id" = "station_id"))
df_surveys_sites_xref %>% as_tibble()

## # A tibble: 78 x 39
##    county waterbody.name survey.year survey.seq.no   wbic station.name
##    <chr>  <chr>                <int>         <int>  <int> <chr>       
##  1 vernon west_fork_kic~        2000          2222 1.19e6 west fork k~
##  2 vernon west_fork_kic~        2000          2222 1.19e6 west fork k~
##  3 vernon west_fork_kic~        2000          2222 1.19e6 west fork k~
##  4 vernon west_fork_kic~        2000          2216 1.19e6 west fork k~
##  5 vernon west_fork_kic~        2008       5503234 1.19e6 west fork k~
##  6 vernon knapp_creek           2005         91147 1.19e6 knapp creek~
##  7 vernon knapp_creek           2005         91148 1.19e6 knapp creek~
##  8 vernon bishop_br             2011     226999122 1.19e6 bishop bran~
##  9 vernon bishop_br             2016     515082742 1.19e6 bishop bran~
## 10 vernon bishop_br             2019     515090728 1.19e6 bishop bran~
## # ... with 68 more rows, and 33 more variables: site.seq.no <int>,
## #   swims.station.id <chr>, latlong.method <chr>, latlong.datum <chr>,
## #   waterbody.type <chr>, survey.begin.date <date>, survey.end.date <date>,
## #   primary.survey.purpose <chr>, secondary.survey.purpose <chr>,
## #   survey.notes <chr>, survey.status <chr>, special.regs.type <chr>,
## #   macroinvertebrate.sample.id <chr>, data.entry.name <chr>,
## #   data.entry.date <dttm>, last.update.name <chr>, last.update.date <dttm>,
## #   s.effort.start.time.invalid <chr>, s.elec.amps.invalid <chr>,
## #   s.elec.current.missing <chr>, s.elec.dist.invalid <lgl>,
## #   s.elec.dist.units.invalid <lgl>, s.elec.duty.cycle.invalid <chr>,
## #   s.elec.pulse.rate.invalid <chr>, s.elec.volts.invalid <chr>,
## #   s.invalid.species <lgl>, s.primary.target.error <lgl>,
## #   s.time.invlaid <lgl>, srvy.too.long <chr>, s.cnt.outside.range <lgl>,
## #   hydro_id <chr>, reach_id <chr>, geometry <POINT [°]>

Let do a quick check to make sure we don’t have any duplicates:

# filter for dups and print
df_surveys_sites_xref %>% 
  group_by(swims.station.id) %>%  
  filter(n() > 1) %>% 
  select(swims.station.id, hydro_id, reach_id)

## Simple feature collection with 3 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -90.78317 ymin: 43.65616 xmax: -90.78317 ymax: 43.65616
## geographic CRS: WGS 84
## # A tibble: 3 x 4
## # Groups:   swims.station.id [1]
##   swims.station.id hydro_id  reach_id              geometry
##   <chr>            <chr>     <chr>              <POINT [°]>
## 1 10009122         200051620 200051620 (-90.78317 43.65616)
## 2 10009122         200051725 200051725 (-90.78317 43.65616)
## 3 10009122         200051588 200051588 (-90.78317 43.65616)

It looks like there is swims station with three associated hydro/reachids We will have to do some spatial analysis to deal with this.

3.2. Spatial join

We will do a spatial, nearest neighbor join between our survey site and the 24K hydro flowlines to get the correct hydroid and thus reachid. We first get the index of the closest 3 hydro lines to our point and thir distances (we want the closest line whose wbic matches out surveys wbic).

# isolate the station that we need to spatially cross reference
points <- 
  df_surveys_sites_xref %>% 
  group_by(swims.station.id) %>%  
  filter(n() > 1) %>% 
  distinct(.keep_all = TRUE)

# get nearest 3 lines to station and their distances
nn_trace <- st_nn(points, huc12_flines, k = 3, returnDist = TRUE)

# join stations to flow lines using the nearest neighbor, get distance
nn_join <- st_join(points, huc12_flines, join = st_nn, k = 3)

Then, we extract the indexes and, for each station, only keep the closest line where wbics match.

# extract the flow lines indexes and distances for each
nn_distances <-bind_cols(
  index = unlist(nn_trace$nn),
  distance = unlist(nn_trace$dist))

# clean it up and filter out correct data
nn_join_full <- 
  # combine the joined data with distances
  bind_cols(nn_join, nn_distances) %>% 
  # get rid of spatial ref
  st_drop_geometry() %>% 
  # keep a few columns for simplicity
  select(swims.station.id, wbic, river_sys_wbic, hydroid, distance) %>% 
  # covert column types
  mutate(across(where(is.integer), as.character))  %>% 
  # make a logical for matching wbics between surveys and flowlines
  mutate(wbic_match = river_sys_wbic == wbic) %>% 
  # for each station, only keep closest line where wbics match
  group_by(swims.station.id) %>%
  filter(wbic_match == TRUE & distance == min(distance))

nn_join_full

## # A tibble: 1 x 6
## # Groups:   swims.station.id [1]
##   swims.station.id wbic    river_sys_wbic hydroid   distance wbic_match
##   <chr>            <chr>   <chr>          <chr>        <dbl> <lgl>     
## 1 10009122         1187900 1187900        200051588    0.102 TRUE

Great, this looks like the correct hydroid. Let’s remove the other two incorrect rows from the joined data.

# list of the two bad hydroids
nn_join_remv <- 
  df_surveys_sites_xref %>% 
  group_by(swims.station.id) %>%  
  filter(n() > 1) %>% 
  select(swims.station.id, hydro_id, reach_id) %>% 
  filter(!hydro_id == nn_join_full %>% pull(hydroid))

# update the x-refed table by filtering out the bad hydros
df_surveys_sites_xref <- 
  df_surveys_sites_xref %>% 
  filter(!(swims.station.id == nn_join_remv$swims.station.id & 
           reach_id %in% nn_join_remv$hydro_id )) %>% 
  select(swims.station.id, hydro_id, reach_id, geometry)
df_surveys_sites_xref

## Simple feature collection with 76 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -90.84908 ymin: 43.47766 xmax: -90.74173 ymax: 43.69965
## geographic CRS: WGS 84
## First 10 features:
##    swims.station.id  hydro_id  reach_id                   geometry
## 1          10009122 200051588 200051588 POINT (-90.78317 43.65616)
## 2          10009125 200050605 200050605 POINT (-90.76825 43.63492)
## 3          10029589 200210448 200210448 POINT (-90.77533 43.57829)
## 4          10015435 200052775 200052775   POINT (-90.7786 43.6746)
## 5          10015436 200052775 200052775   POINT (-90.7786 43.6746)
## 6          10031590 200044854 200044854 POINT (-90.75238 43.52228)
## 7          10030768 200046313 200046313 POINT (-90.81609 43.54724)
## 8            633149 200046124 200046124 POINT (-90.80706 43.54474)
## 9          10028744 200049265 200049265 POINT (-90.82918 43.60635)
## 10         10029587 200049046 200049046 POINT (-90.78825 43.60199)

Excellent. Now we have a site reference table with associated hydroids and reachids for each of our 76 survey locations.

3.3. Link fish to watershed

Now we can link our fish surveys / fish data with watershed data like stream gradient, stream order, and watershed/ upstream drainage area. We do this simply by joining on our reachid key.

df_cpe_attributed <- 
  df_cpe %>% 
  as_tibble() %>% 
  left_join(df_surveys_sites_xref, by = "swims.station.id") %>% 
  left_join(df_24k_attributes, by = c("reach_id" = "reachid"))

df_cpe_attributed %>% glimpse()

## Rows: 309
## Columns: 54
## $ county            <chr> "vernon", "vernon", "vernon", "vernon", "vernon",...
## $ waterbody.name    <chr> "birds_creek", "birds_creek", "birds_creek", "bis...
## $ wbic              <int> 1189100, 1189100, 1189100, 1188500, 1188500, 1188...
## $ survey.year       <int> 2019, 2019, 2019, 1993, 1993, 1993, 1993, 2003, 2...
## $ survey.seq.no     <int> 515090937, 515090937, 515090937, 78348, 78348, 78...
## $ visit.fish.seq.no <int> 766701, 766701, 766701, 489391, 489391, 489391, 4...
## $ species           <fct> Brook Trout, Brown Trout, Brown Trout, Brook Trou...
## $ length_class      <chr> "Adult", "Adult", "YOY", "Adult", "YOY", "Adult",...
## $ total_catch       <dbl> 5, 3, 27, 1, 1, 122, 36, 1, 237, 84, 141, 270, 19...
## $ total_effort      <dbl> 0.06213712, 0.06213712, 0.06213712, 0.12427424, 0...
## $ cpe               <dbl> 80.46720, 48.28032, 434.52287, 8.04672, 8.04672, ...
## $ cpe_type          <chr> "electrofishing", "electrofishing", "electrofishi...
## $ cpe_units         <chr> "N/miles_shocked", "N/miles_shocked", "N/miles_sh...
## $ swims.station.id  <chr> "10053007", "10053007", "10053007", "10010619", "...
## $ latitude          <dbl> 43.54026, 43.54026, 43.54026, 43.54369, 43.54369,...
## $ longitude         <dbl> -90.74513, -90.74513, -90.74513, -90.80701, -90.8...
## $ gear              <chr> "backpack_shocker", "backpack_shocker", "backpack...
## $ gear_class        <chr> "Backpack - Adult", "Backpack - Adult", "Backpack...
## $ hydro_id          <chr> "200046102", "200046102", "200046102", "200046257...
## $ reach_id          <chr> "200046102", "200046102", "200046102", "200046257...
## $ geometry          <POINT [°]> POINT (-90.74513 43.54026), POINT (-90.7451...
## $ hyd_cat           <chr> "stream", "stream", "stream", "stream", "stream",...
## $ c_huc12           <dbl> 70700060205, 70700060205, 70700060205, 7070006020...
## $ gradient          <dbl> 2.36155325, 2.36155325, 2.36155325, 0.95243280, 0...
## $ c_length          <dbl> 3559.733, 3559.733, 3559.733, 619.151, 619.151, 6...
## $ wet_width         <dbl> 1.6, 1.6, 1.6, 5.1, 5.1, 5.1, 5.1, 5.8, 5.8, 5.8,...
## $ stream_order      <int> 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ nat_comm          <chr> "Coldwater", "Coldwater", "Coldwater", "Cool-Cold...
## $ temp_class        <chr> "Cold", "Cold", "Cold", "Cool-Cold", "Cool-Cold",...
## $ temp_summer_cl_cc <dbl> 15.84, 15.84, 15.84, 16.85, 16.85, 16.85, 16.85, ...
## $ temp_july_cl_cc   <dbl> 16.26, 16.26, 16.26, 17.34, 17.34, 17.34, 17.34, ...
## $ temp_max_cl_cc    <dbl> 18.53, 18.53, 18.53, 21.00, 21.00, 21.00, 21.00, ...
## $ spr_e10_c         <dbl> 0.74, 0.74, 0.74, 13.70, 13.70, 13.70, 13.70, 17....
## $ spr_e50_c         <dbl> 0.51, 0.51, 0.51, 7.12, 7.12, 7.12, 7.12, 9.44, 9...
## $ spr_e90_c         <dbl> 0.46, 0.46, 0.46, 4.56, 4.56, 4.56, 4.56, 6.41, 6...
## $ sum_e10_c         <dbl> 0.67, 0.67, 0.67, 9.28, 9.28, 9.28, 9.28, 12.50, ...
## $ sum_e50_c         <dbl> 0.44, 0.44, 0.44, 4.62, 4.62, 4.62, 4.62, 6.49, 6...
## $ sum_e90_c         <dbl> 0.44, 0.44, 0.44, 4.02, 4.02, 4.02, 4.02, 5.80, 5...
## $ fal_e10_c         <dbl> 0.60, 0.60, 0.60, 7.48, 7.48, 7.48, 7.48, 10.10, ...
## $ fal_e50_c         <dbl> 0.50, 0.50, 0.50, 4.86, 4.86, 4.86, 4.86, 6.82, 6...
## $ fal_e90_c         <dbl> 0.46, 0.46, 0.46, 4.08, 4.08, 4.08, 4.08, 5.89, 5...
## $ apr_e10_c         <dbl> 0.67, 0.67, 0.67, 11.40, 11.40, 11.40, 11.40, 14....
## $ apr_e50_c         <dbl> 0.42, 0.42, 0.42, 6.68, 6.68, 6.68, 6.68, 8.76, 8...
## $ apr_e90_c         <dbl> 0.42, 0.42, 0.42, 6.18, 6.18, 6.18, 6.18, 8.29, 8...
## $ aug_e10_c         <dbl> 0.68, 0.68, 0.68, 7.39, 7.39, 7.39, 7.39, 10.40, ...
## $ aug_e50_c         <dbl> 0.44, 0.44, 0.44, 4.44, 4.44, 4.44, 4.44, 6.33, 6...
## $ aug_e90_c         <dbl> 0.44, 0.44, 0.44, 3.84, 3.84, 3.84, 3.84, 5.57, 5...
## $ w_area            <dbl> 2.9429, 2.9429, 2.9429, 0.3910, 0.3910, 0.3910, 0...
## $ w_perm            <dbl> 190.899, 190.899, 190.899, 212.569, 212.569, 212....
## $ w_darcy           <dbl> -18.618, -18.618, -18.618, -27.250, -27.250, -27....
## $ w_prcp_ann        <dbl> 851.4, 851.4, 851.4, 847.4, 847.4, 847.4, 847.4, ...
## $ w_temp_ann        <dbl> 7.14, 7.14, 7.14, 6.94, 6.94, 6.94, 6.94, 6.94, 6...
## $ w_temp_gs         <dbl> 15.26, 15.26, 15.26, 15.06, 15.06, 15.06, 15.06, ...
## $ w_temp_july       <dbl> 21.35, 21.35, 21.35, 21.13, 21.13, 21.13, 21.13, ...

4. Exploratory Data Analysis (EDA)

Now we have watershed attributes (at the reach or HUC 16 scale), for each of our cpe estimates. With this, we can do some exploratory data analysis, or EDA.

4.1. Row-wise work

Often we want to do some work on groups in our data. These are often denoted by factor or categorical variables and group together different rows of a data table. For example, the c_huc12 variable groups our CPEs by sub-watershed. We’ll also use the knitr::kable function to make a formatted table output.

Summarize data

df <- 
  df_cpe_attributed %>% 
  mutate(across(c(c_huc12), as.character)) %>% 
  mutate(c_huc12 = str_c("0", c_huc12)) %>% 
  left_join(as_tibble(wdnr.gis::watershed_lookup), by = c("c_huc12" = "huc_codes")) %>% 
  group_by(huc_names, survey.year, species, gear_class) %>% 
  summarise(mean_cpe = mean(cpe, na.rm = TRUE), 
            sd = sd(cpe, na.rm = TRUE)/sqrt(length(cpe)),
            .groups = "drop")

# print a formated table; first 5 rows only
knitr::kable(
  df[1:5,],
  caption = "Summarized CPEs by HUC12 and year"
)

Summarized CPEs by HUC12 and year
huc_names	survey.year	species	gear_class	mean_cpe	sd
Bishop Branch	1993	Brook Trout	Stream - Adult	8.04672	NA
Bishop Branch	1993	Brook Trout	Stream - YOY	8.04672	NA
Bishop Branch	1993	Brown Trout	Stream - Adult	981.69981	NA
Bishop Branch	1993	Brown Trout	Stream - YOY	289.68191	NA
Bishop Branch	2002	Brook Trout	Backpack - Adult	491.04000	NA

For convenience we make a table of the names of our huc12s and we’ve made a column called fname that we’ll use later when saving each graph as a file.

Make output table

# output table with file names
huc_names  <- 
  df %>% 
  select(huc_names) %>% 
  unique() %>%
  mutate(fname = tolower(paste0("figures/huc12-", huc_names , "-cpe")), 
         fname = stringr::str_replace_all(fname, " ", "_"))

huc_names

## # A tibble: 5 x 2
##   huc_names                        fname                                        
##   <chr>                            <chr>                                        
## 1 Bishop Branch                    figures/huc12-bishop_branch-cpe              
## 2 Harrison Creek                   figures/huc12-harrison_creek-cpe             
## 3 Knapp Creek-West Fork Kickapoo ~ figures/huc12-knapp_creek-west_fork_kickapoo~
## 4 Seas Branch                      figures/huc12-seas_branch-cpe                
## 5 West Fork Kickapoo River         figures/huc12-west_fork_kickapoo_river-cpe

Make the plots: group_by() %>% nest() %>% mutate()

All that remains is to do this for every huc12. There are several ways we might do this, depending on whatever else we might have in mind for the plots. We could just write a for() loop that iterates over the names of the HUCs, makes a plot for each one, and saves it out to disk. Or we could use the power of tibbles (a special type of table where a elements can be lists, not just values) and the group_by() %>% nest() %>% mutate() workflow.

That is, we group the data by our huc12 codes and nest the data by those groups (see ?tidyr::nest() for more details on this awesome functionality). You can see that we now have one row in this tibble for each group, and the second column, data is a tibble with all the data for that group in it.

plots <- 
  df %>% 
  group_by(huc_names) %>% 
  nest()

plots

## # A tibble: 5 x 2
## # Groups:   huc_names [5]
##   huc_names                            data             
##   <chr>                                <list>           
## 1 Bishop Branch                        <tibble [54 x 5]>
## 2 Harrison Creek                       <tibble [13 x 5]>
## 3 Knapp Creek-West Fork Kickapoo River <tibble [47 x 5]>
## 4 Seas Branch                          <tibble [29 x 5]>
## 5 West Fork Kickapoo River             <tibble [21 x 5]>

Neat huh!

We use the mutate() funciton to create a new column, and then a map() relative to feed the name of each HUC12 to a ggplot() function and bundle the results up in a tibble. That is, we are doing row-wise work. Like this:

# make the plots
plots <- 
  plots %>% 
  mutate(
    plot = map2(
      data, huc_names, 
      ~ggplot(.x, aes(x = survey.year, y = mean_cpe, 
                      color = gear_class, group = gear_class)) + 
        geom_point() + 
        geom_line() +
        facet_wrap(vars(species), nrow = 2) + 
        scale_color_manual("Gear - Class", values = pal[c(1,2,4,3)]) +
        labs(x = "Survey Year", 
             y = "Trout / mile (± SE)", 
             color = "Length Class")+ 
        ggtitle(paste0(.y)) + 
        theme_bw()
    )) %>% 
  # join to the output table with filenames
  left_join(huc_names, by = "huc_names")

plots

## # A tibble: 5 x 4
## # Groups:   huc_names [5]
##   huc_names                 data         plot  fname                            
##   <chr>                     <list>       <lis> <chr>                            
## 1 Bishop Branch             <tibble [54~ <gg>  figures/huc12-bishop_branch-cpe  
## 2 Harrison Creek            <tibble [13~ <gg>  figures/huc12-harrison_creek-cpe 
## 3 Knapp Creek-West Fork Ki~ <tibble [47~ <gg>  figures/huc12-knapp_creek-west_f~
## 4 Seas Branch               <tibble [29~ <gg>  figures/huc12-seas_branch-cpe    
## 5 West Fork Kickapoo River  <tibble [21~ <gg>  figures/huc12-west_fork_kickapoo~

Neat! We took our little huc12 tibble from above and added a new list-column to it. Each ggplot row is a fully-composed plot, sitting there waiting for us to do something with it. You could of course do something equivalent in Base R with lapply().

Lets plot one:

plots$plot[[3]]

What we’ll do with it is save a png of each plot. We’ll use ggsave() for that. It will need to know the name of the file we’re creating and the object that contains the corresponding plot. To pass that information along, we could use map() again. Or, more quietly, we can use walk(), which is what you do when you just want to stroll down a list, feeding the list elements one at a time to a function in order to produce some side-effect (like saving a file) rather than returning some value or number that you want to do something else with.

To create a named file for each huc12 and have it actually contain the plot we need to provide two arguments: the file name and the plot itself. We assemble a valid file name using the fname column of plots. The plot is in the plot column. When we need to map two arguments to a function in this way, we use map2() or its counterpart walk2().

# walk the walk and plot the plots to file
walk2(paste0(plots$fname, ".png"), 
      plots$plot, 
      ggsave, 
      height = 5, width = 7)

The first argument creates the filename, for example, “figures/bishop_branch-cpe.png”. The second is the corresponding plot for that huc12 The function we feed those two bits of information to is ggsave, and we also pass along a height and width instruction. Those will be the same for every plot.

The end result is a figures/ folder with 5 png files in it.

4.2. Column-wise work

Now we shift to thinking about applying some work to different columns of data. Say, for instance, we have a many predictor variables for explain a response of interest–like trout CPEs. We can easily look at a plot for this:

df_cpe_attributed %>% 
  ggplot(aes(x = spr_e90_c, y = cpe)) + 
  geom_point() +  
  geom_smooth(method = "lm", se = FALSE, color = "grey74") +
  facet_wrap(vars(species), nrow = 2) + 
  labs(x = "Spring Q90 stream flow (floods)", y = "Trout / mile)") + 
  ggtitle("Trout CPEs vs. Spring Q90 stream flow (floods)") + 
  theme_bw()

How about automating this for many predictors? The goal is to make scatterplots for cpe vs every explanatory variable we have.

Prep variables

The plan is to loop through the variables and make the desired plots. I’m going to use vectors of the variable names for this, one vector for the response variables and one for the explanatory variables.

names(df_cpe_attributed)

##  [1] "county"            "waterbody.name"    "wbic"             
##  [4] "survey.year"       "survey.seq.no"     "visit.fish.seq.no"
##  [7] "species"           "length_class"      "total_catch"      
## [10] "total_effort"      "cpe"               "cpe_type"         
## [13] "cpe_units"         "swims.station.id"  "latitude"         
## [16] "longitude"         "gear"              "gear_class"       
## [19] "hydro_id"          "reach_id"          "geometry"         
## [22] "hyd_cat"           "c_huc12"           "gradient"         
## [25] "c_length"          "wet_width"         "stream_order"     
## [28] "nat_comm"          "temp_class"        "temp_summer_cl_cc"
## [31] "temp_july_cl_cc"   "temp_max_cl_cc"    "spr_e10_c"        
## [34] "spr_e50_c"         "spr_e90_c"         "sum_e10_c"        
## [37] "sum_e50_c"         "sum_e90_c"         "fal_e10_c"        
## [40] "fal_e50_c"         "fal_e90_c"         "apr_e10_c"        
## [43] "apr_e50_c"         "apr_e90_c"         "aug_e10_c"        
## [46] "aug_e50_c"         "aug_e90_c"         "w_area"           
## [49] "w_perm"            "w_darcy"           "w_prcp_ann"       
## [52] "w_temp_ann"        "w_temp_gs"         "w_temp_july"

response = "cpe"
expl = names(df_cpe_attributed)[30:53]

Make plotting function

Now we make a plotting function to pass over our vectors. The function takes three arguments, dat = a dataframe with all the data, x = the predictor variable, and y = the cpe variable.

scatter_fun = function(dat, x, y) {
     ggplot(data = dat, aes(x = .data[[x]], y = .data[[y]]) ) +
          geom_point() +
          geom_smooth(method = "loess", se = FALSE, color = "grey74") +
          facet_wrap(vars(species), nrow = 2) +
          labs(x = paste0(x), y = "Trout / mile)") + 
          ggtitle(paste0(x)) + 
          theme_bw()
}

Let’s test our plotting function on one predictor and our response:

scatter_fun(df_cpe_attributed, x = "spr_e90_c", y = "cpe")

Iterate

Now we use map to iterate over the columns, and print one of them.

plots <- map(expl, ~scatter_fun(df_cpe_attributed, .x, "cpe") )

plots[[2]]

Save to file

Finally, save all the files to one single PDF files using Base R’s plot save process.

pdf("figures/all_scatterplots.pdf")
plots
dev.off()

This results in a PDF in the figures folder with all scatterplots in it.

5. Data - environmenatal

5.1 DAYMET - precip

Load data and plot

I’ve saved some DAYMET precipitation data for you. This is total daily precipitation for each reachid in our watershed. Below we load the data, summarize it by year, and plot the data.

# load
df_daymet <- read_rds("data/daymet.rds")

# summarize and save plot to object
p.prcp <- 
  df_daymet %>% 
  group_by(year) %>% 
  summarise(prcp = mean(yr_cum)) %>% 
  ggplot(aes(x = year, y = prcp)) + 
  geom_point() + geom_line() + 
  labs(x = "Year", y = "Precipitaton (mm)") +
  ggtitle("Total average yearly precipitation - West Fork Kickapoo") + 
  theme_bw()

# print plot
p.prcp

Plot with fish data

Now we combine the plots to explore some possible correlations. We will use the patchwork package to compose a panel plot.

# load the package
library(patchwork)

# print the plots, one on top of the other. 
p.trout.trends / p.prcp + 
  # give the trout data more space than the prcp
  plot_layout(heights = c(3,1))

5.2 USGS flow data

USGS data comes from the National Water Information System (NWIS). The R-package “dataRetrieval” simplifies the process of finding and retrieving water from the USGS. See Laura DeCicco’s workshop in this R-Expo for a detailed walkthrough of this package, or this website for Laura’s online tutorial.

We will use it to find USGS gauge station in our HUC10 West Fork watershed, and then to pull summarized daily stream flow data.

Find available data

# load the packge
library(dataRetrieval)

# find available stations and data and tidy it, make it spatial
df_usgs_sites <- 
  # pull available data
  whatNWISdata(
    huc = "07070006",  #  HUC8 Kickapoo
    siteType = "ST",   # data for streams/rivers
    service = "dv",    # daily value data
    parameterCd = c(
      "00010",  # temperature data
      "00060"   # discharge (flow) data
    ),
    statCd = c(
      "00001",  # max
      "00002",  # min
      "00003"   # mean
    )  
  ) %>%
  # rename the data type codes
  mutate(
    parm_cd = case_when(
      parm_cd == "00010" ~ "Temperature", 
      parm_cd == "00060" ~ "Streamflow"), 
    stat_cd = case_when(
      stat_cd == "00001" ~ 'Max', 
      stat_cd == "00002" ~ 'Min',
      stat_cd == "00003" ~ 'Mean'),
    span = 
      lubridate::interval(begin_date, end_date) %>% 
      lubridate::time_length(unit = "year") %>% 
      round(digits = 0)
    ) %>% 
  # keep only stations with end dates after 1990
  filter(end_date >= "1990-01-01") %>% 
  # make spatial object
  st_as_sf(
    coords = c("dec_long_va", "dec_lat_va"), 
    crs = 4326
    ) %>% 
  # only keep those inside our huc10 boundary
  st_intersection(huc10_poly)


# print it
df_usgs_sites %>% 
  st_drop_geometry() %>% 
  select(site_no, parm_cd, stat_cd, begin_date, end_date, span) %>% 
  arrange(parm_cd, stat_cd, begin_date)

##    site_no    parm_cd stat_cd begin_date   end_date span
## 1 05408480 Streamflow    Mean 2010-09-30 2017-09-29    7
## 2 05408476 Streamflow    Mean 2010-10-01 2015-06-24    5

We can see there are only two with any recent long term data. This is expected–most USGS gages are found at the HUC8 scale. Let’s plot them to see where they are in the watershed:

df_usgs_sites %>% 
  ggplot() + 
  geom_sf(data = huc12_polys, fill = "white", color = "black") +
  geom_sf(data = huc12_flines %>% filter(stream_order > 1), color = "blue") +
  geom_sf(data = huc10_poly,color = "black", alpha = 0) +
  geom_sf(data = df_surveys_sites,  
          shape = 21, size = 2, color = "black", fill = "black", alpha = 0.25) + 
  # add usgs sites
  geom_sf(shape = 21, size = 2, color = "black", fill = "red", alpha = 0.75) + 
  ggtitle("USGS flow sites") + 
  theme_bw()

Download flow data

Let pull the data:

# list of site numbers
usgs_sites <-
  df_usgs_sites %>% 
  st_intersection(huc10_poly) %>%
  pull(site_no)

# parameter code for flow data
pCodes <- "00060"

# pull data
df_usgs_flow <- 
  readNWISdv(usgs_sites, pCodes) %>% 
  renameNWISColumns() %>% 
  clean_names() %>% 
  select(-ends_with("_cd")) %>% 
  as_tibble()

Clean data

And now plot the flow data over time:

df_usgs_flow %>% 
  ggplot(aes(x = date, y = flow, color = site_no)) +
  geom_line(size=1.5) + 
  viridis::scale_color_viridis(discrete = TRUE) + 
  labs(x = "Year", y = "Flow (cfs)")  +
  theme_bw()

So they only data that we can use is from site no 05408480. Let’s filter it out and look at that data.

df_usgs_flow <- 
  df_usgs_flow %>% 
  filter(site_no == "05408480")

df_usgs_sites %>% 
  filter(site_no == "05408480") %>% 
  ggplot() + 
  geom_sf(data = huc12_polys, fill = "white", color = "black") +
  geom_sf(data = huc12_flines %>% filter(stream_order > 1), color = "blue") +
  geom_sf(data = huc10_poly,color = "black", alpha = 0) +
  geom_sf(data = df_surveys_sites,  
          shape = 21, size = 2, color = "black", fill = "black", alpha = 0.25) + 
  # add usgs sites
  geom_sf(shape = 21, size = 2, color = "black", fill = "red", alpha = 0.75) + 
  labs(x = "Year", y = "Flow (cfs)")  +
  ggtitle("USGS flow sites") + 
  theme_bw()

df_usgs_flow %>% 
  ggplot(aes(x = date, y = flow, color = site_no)) +
  geom_line(size=1.5) + 
  scale_x_date(date_labels = "%Y", date_breaks = "1 year",) + 
  viridis::scale_color_viridis(discrete = TRUE) + 
  labs(x = "Year", y = "Flow (cfs)")  +
  theme_bw()

So, we have one site located fairly high in the watershed with a few years of data.

6. Practice problem

Can you figure out how to link this USGS flow data back to our fish data?

I bet you can :)

Working with Fish and Stream Data in R

R-Expo Workshop Wisconsin AFS 2021

Bryan M. Maitland (bryan.maitland@wisconsin.gov)

8 Feb 2021

1. Introduction

Case Study: West Fork of the Kickapoo

1.1. Prerequisites

1.1.1. R

1.1.2. RStudio

1.1.3. The Tidyverse packages

1.1.4. Other packages

1.1.5. R Projects

2. Data - Watersheds, streams, and fish

2.1. Spatial Data - watershed boundaries and streams

WI DNR 24K Hydro

Watershed polygons

24k Hydro Flowlines (streams)

Plot streams

2.2. Fish data

2.2.1. Surveys and Efforts

Load data

Clean data

Plot survey sites

2.2.2. Fishraw data

Load data

Clean data

Plot trout CPEs

2.3. Watershed atributes

2.3.1. 24K Hydro Attribution

2.3.2. Watershed data

Load data

Clean data

3. Cross reference

3.1. Xwalk table

3.2. Spatial join

3.3. Link fish to watershed

4. Exploratory Data Analysis (EDA)

4.1. Row-wise work

Summarize data

Make output table

Make the plots: group_by() %>% nest() %>% mutate()

4.2. Column-wise work

Prep variables

Make plotting function

Iterate

Save to file

5. Data - environmenatal

5.1 DAYMET - precip

Load data and plot

Plot with fish data

5.2 USGS flow data

Find available data

Download flow data

Clean data

6. Practice problem

R-Expo Workshop
Wisconsin AFS 2021