--- title: Introduction to streamDAG author: Ken Aho date: "`r format(Sys.time(), '%d %B, %Y')`" bibliography: MMEbib.bib output: bookdown::html_document2: base_format:rmarkdown::html_vignette: number_sections: false toc: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Introduction to streamDAG} %\VignetteEncoding{UTF-8} --- The *streamDAG* package provides indices and tools for analyzing directed acyclic graph (DAG) representations of intermittent stream networks. The DAG framework allows a wide range of analytical approaches for intermittent streams including classic measures from hydrology, ecology, and of course, graph theory. A focus of many *streamDAG* algorithms is the measurement of 1) "local" arc (stream segment) and nodal (inter-arc) characteristics, and 2) network-level complexity and connectivity. While many of these approaches are purely topological, a non-trivial number of DAG indices, particularly weighted approaches, will provide outcomes identical to existing hydrological (non-graph-theoretic) measures for streams. These include Integral Connectivity Scale Length (ICSL) and its variants [@western2001]. Perennial streams (and even non-stream networks) can be potentially analyzed with *streamDAG* algorithms. However, the major motivator for the package was the development of procedures that consider the spatio-temporal variability of intermittent stream networks. As a result most *streamDAG* algorithms assume that graphs are directed (from upstream to downstream). Thus, these functions may produce errors if directed graphs are used without checking function arguments. The *streamDAG* package maintainer is Ken Aho (ahoken@isu.edu). An introduction to the *streamDAG* package can be found in [@aho2023DAG]. The *streamDAG* package is built under the basic idiom of the *igraph* package [@csardi], and most of its functions utilize *igraph* basis algorithms. Newest (developmental) versions of *streamDAG* can be obtained from the public GitHub repository: https://github.com/moondog1969/streamDAG. Installation of *streamDAG* from GitHub requires installation of the package *devtools*. In particular, use: ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, comment = NA) ``` ```{r, echo = F, message = F, warning = F} library(igraph) library(knitr) ``` ```{r,warning = FALSE, message = FALSE, eval = F} # install.packages("devtools") library(devtools) install_github("moondog1969/streamDAG") ``` Stable versions of the package will also be housed on the Comprehensive R Archive Network [(CRAN)](https://cran.r-project.org/) beginning in September 2023. Following this inception (version 1.4-4), *streamDAG* can be installed directly from CRAN mirrors using: ```{r,eval = F} install.packages("streamDAG") ``` After installing *streamDAG*, the package can be loaded into R conventionally: ```{r, warning = FALSE} library(streamDAG) ``` # Introductory Examples of Usage ## Murphy Creek We begin with an in-depth demonstration of the *streamDAG* package using Murphy Creek, a very simple intermittent stream in the Reynolds Creek experimental watershed in southwestern Idaho (Fig \@ref(fig:mc)). From 6/3/2019 to 10/3/2019, stream presence data were acquired at 15 minute intervals from 25 Murphy Creek nodes, corresponding to 24 stream segment arcs. Bounding nodes were added to encompass the entire length of the network. This resulted in a final Murphy Creek network with 28 nodes and 27 arcs for analysis. ```{r mc, fig.cap = "The Reynolds Cr. experimental watershed in SW Idaho.", echo = F} knitr::include_graphics("rc.jpg", dpi = 150) ``` \subsubsection{Data Outlay} Purely topological analyses can be conducted in *streamDAG* using only an *igraph* codified stream network. Below is a codification of Murphy Creek based on nodes established by @warix2021. The code `IN_N --+ M1984` indicates that the stream flows from node `IN_N` to node `M1984`, and so on. ```{r} murphy_spring <- graph_from_literal(IN_N --+ M1984 --+ M1909, IN_S --+ M1993, M1993 --+ M1951 --+ M1909 --+ M1799 --+ M1719 --+ M1653 --+ M1572 --+ M1452, M1452 --+ M1377 --+ M1254 --+ M1166 --+ M1121 --+ M1036 --+ M918 --+ M823, M823 --+ M759 --+ M716 --+ M624 --+ M523 --+ M454 --+ M380 --+ M233 --+ M153, M153 --+ M91 --+ OUT) ``` This code is contained as an option in the function `streamDAGs` which also codifies other intermittent stream *igraph* objects. ```{r} streamDAGs("mur_full") ``` Much more flexibility in *streamDAG* functions is possible by defining stream spatial coordinates and graph weighting data, including stream lengths, nutrient loading, and information about stream segment presence (wet) or absence (dry). The *streamDAG* package contains additional Murphy Creek data, including nodal spatial coordinates (UTMs), stream arc (segment) lengths, and stream arc presence absence data. Instream lengths and distances can be obtained from a number of sources including ARC-GIS. Stream presence can be ascertained using a number of methods, including conductivity and temperature sensors. ```{r} data(mur_coords) # Node spatial coords data(mur_lengths) # Arc (stream segment) lengths data(mur_node_pres_abs) # Node presence / absence data with explicit datetimes data(mur_arc_pres_abs) # Arc (stream segment) simulated presence / absence data ``` Care should be taken to ensure that the order of relevant rows and columns and elements in ancillary data correspond to the order of nodes and arcs defined in the underlying graph, `G` with the functions `igraph::V` (which gives nodes) and `A` or `igraph::E` (which give arcs). Within ancillary datasets, different code identifiers can be used to designate arcs. For instance, for an arc $z = \vec{uv}$ where $u$ is the tail of arc $z$ and $v$ is the head of $z$, we could code: `u--+v` or `u-->v` or `u --> v` or `u->v`, or even `u|v`. The important thing is that the ordering is consistent with the arcs in the corresponding graph object. For instance, here are the first six arc names for the graph object `murphy_spring`. ```{r} head(A(murphy_spring)) ``` Note that these correspond to the identifiers for the first six stream lengths (in the first six rows) from `mur_lengths`. ```{r} head(mur_lengths) ``` Naming of nodes should be consistent with the node names in the corresponding graph object. For instance, here are the first six graph node names from `murphy_spring`. ```{r} head(V(murphy_spring)) ``` The naming (and order) correspond to the first six identifiers (column names in this case) for presence absence data from `mur_node_pres_abs`. ```{r} names(mur_node_pres_abs)[1:7][-1] # ignoring datestamp column 1 ``` ### Spatial Plots It is easy to depict a spatially explicit stream DAG using the *streamDAG* function `spatial.plot`. We can make a spatial plot by augmenting graph data with nodal spatial coordinates (Fig \@ref(fig:sp1)). ```{r sp1, fig.cap = "Spatially explicit graph of the completely wetted Murphy Cr. network, as it occurs in the spring."} x <- mur_coords[,2]; y <- mur_coords[,3] names = mur_coords[,1] spatial.plot(murphy_spring, x, y, names, cex.text = .7) ``` ARC-GIS shapefiles can also be used to generate spatial plots with the function `spatial.plot.sf`. Use of shapefiles requires use of the libraries libraries *ggplot2* and *sf*, and resulting graphs can be customized using *ggplot2* modifiers (Fig \@ref(fig:sp2)). Use of shapefiles will eliminate some of the easy to easy-to-use features in `spatial.plot` including directional arrows indicating flow and the automated deletion of arcs and nodes with presence / absence data (see Section 2.3). ```{r sp2, fig.cap = "Example of using a shapefile with `spatial.plot.sf`.", message = F, warning = F} library(ggplot2); library(sf); library(ggrepel) # Note that the directory "shape" also contains required ARC-GIS .shx,.cpg, and .prj files. mur_sf <- st_read(system.file("shape/Murphy_Creek.shp", package="streamDAG")) g1 <- spatial.plot.sf(x, y, names, shapefile = mur_sf) ## some ggplot customizations g1 + expand_limits(y = c(4788562,4789700)) + theme(plot.margin = margin(t = 0, r = 10, b = 0, l = 0)) + geom_text_repel(data = mur_coords, aes(x = x, y = y, label = Object.ID), colour = "black", size = 1.6, box.padding = unit(0.3, "lines"), point.padding = unit(0.25, "lines")) ``` ### Tracking Intermittency The activity of stream nodes and/or arcs (segments) can be easily tracked in stream graphs based on STIC or conductivity data using the *streamDAG* functions `delete.arcs.pa` and `delete.nodes.pa`. For instance, the dataset `mur_node_pres_abs` contains a subset of nodal presence \ absence data for Murphy Creek in 2019. Below we see rows for time series observations 650 to 655. ```{r} mur_node_pres_abs[650:655,] ``` Modifying `murphy_spring` based on the nodal observations at 8/9/2019 22:30 we have: ```{r} npa <- mur_node_pres_abs[650,][,-1] G1 <- delete.nodes.pa(murphy_spring, npa) ``` The resulting spatial plot is shown as Fig \@ref(fig:sp4). Note that nodes without water are now omitted from the graph. Arcs missing one or more bounding nodes are also omitted. ```{r sp4, fig.cap = "Plotting a modification of `murphy_spring` after application of `delete.nodes.pa`."} spatial.plot(G1, x, y, names, cex.text = .7) ``` There are several graphical approaches for distinguishing "dry" and "wet" stream locations. The simplest is to simply show "wet" nodes and arcs bounded by "wet" nodes as in Fig \@ref(fig:sp4). One can also show "dry" node locations by specifying `show.dry = TRUE` (Fig \@ref(fig:s4)). ```{r s4, fig.height = 5.5, fig.width = 6.5, fig.cap = "Dry nodes overlaid on `murphy_spring`. "} spatial.plot(G1, x, y, names, plot.dry = TRUE, cex.text = .7) ``` Finally, one can show "wet" nodes and associated arcs superimposed over the entire network, which includes, potentially, "dry" nodes and arcs. This approach requires generation of `spatial.plot` object representing the entire network, and specification of this object using the argument `cnw` i.e., complete network (Fig \@ref(fig:s42)). ```{r, s42, fig.height = 5.5, fig.width = 6.5, fig.cap = "Dry portions of the network underlying wet nodes and associated arcs."} spc <- spatial.plot(murphy_spring, x, y, names, plot = FALSE) spatial.plot(G1, x, y, names, plot.dry = TRUE, cex.text = .7, cnw = spc) ``` One can also modify graphs based on arc presence / absence data. The dataframe `mur_arc_pres_abs` contains simulated multivariate Bernoulli datasets for Murphy Cr. arcs based on 2019 nodal data. ```{r} head(mur_arc_pres_abs) # 1st 6 rows of data ``` Modifying `murphy_spring` arcs based on the 6th simulated multivariate Bernoulli dataset of arc presence / absence, we have: ```{r} G2 <- delete.arcs.pa(murphy_spring, mur_arc_pres_abs[6,]) ``` The resulting spatial plot is shown in Fig \@ref(fig:sp3). Note that all nodes are plotted, but plotted arcs are limited to those with recorded stream activity. ```{r sp3, fig.height = 5.5, fig.width = 6.5, fig.cap = "Plotting a modified version of `murphy_spring` after application of `delete.arcs.pa`."} spatial.plot(G2, x, y, names, cex.text = .7) ``` ### Unweighted (Purely Topological) Measures for Stream DAGs There are many measures useful for describing and distinguishing intermittent stream networks that are based solely on graph topological features (i.e., the presence or absence of nodes and adjoining arcs). These can be separated into local measures that describe the characteristics of individual stream nodes or arcs, and global measures that summarize the characteristics of an entire network, i.e., the entire graph. #### Local measures A number of local measures are included in the *streamDAG* function `local.summary`. The function only requires an *igraph* graph object. ```{r} local <- local.summary(murphy_spring) round(local, 2)[,1:9] ``` A graphical summary based only on measures with complete cases and standardized outcomes is shown in Fig \@ref(fig:local1). Nodes along the x-axis are sorted based on their order in the `murphy_spring` *igraph* object, which roughly corresponds to their order from sources to sink. In general, nodes increase in information and importance as distance to the sink decreases. Note, however, the "unusual" importance of M1909 due to its location at a confluence (Fig \@ref(fig:sp1)). ```{r local1, fig.cap = "Local graph-theoretic summaries for `murphy_spring`"} cc <- complete.cases(local) local.cc <- local[cc,] s.local <- t(scale(t(local.cc))) ss.local <- stack(data.frame(s.local)) ss.local$metrics <- rep(row.names(local.cc), 28) # theme used throughout vignette theme_facet <- function(){ theme(strip.background = element_blank(), strip.text.x = element_text(size = 10), axis.title.y = element_text(margin = margin(r = .2, unit = "in"), size = 11.5), axis.title.x = element_text(margin = margin(r = .2, unit = "in"), size = 11.5), panel.background = element_rect(fill = "white", colour = "black", linewidth = 0.5), legend.position="none", panel.grid.major = element_blank(), panel.spacing = unit(0.2, "lines"), strip.placement = "inside", axis.ticks = element_line(colour = "black", linewidth = .2), panel.grid.minor = element_blank()) } ggplot(ss.local, aes(x = ind, y = values)) + geom_bar(stat = "identity") + facet_wrap(~metrics) + theme_facet() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size=4.3)) + ylab("Standardized local measures") + xlab("\nNode") ``` ##### Horizontal visibility graphs A less frequently used, but potentially important tool for measuring nodal importance is the horizontal visibility graph [@luque2009]. Two nodes will be *visible* from each other if, when node data (e.g., degrees) are plotted as horizontal bars along the abscissa axis, and placed along the ordinate based on their location in the stream path, the bars can be connected with a horizontal line [@luque2009]. Note that the importance of M1909 in a visibility analysis based on indegree (Fig \@ref(fig:local2)). Weighted (see below) node visibilities can also be obtained with `multi.path.visibility`. ```{r local2, fig.cap = "Nodal visibilities for `murphy_spring` based on nodal indegree."} vis <- multi.path.visibility(murphy_spring, source = c("IN_N","IN_S"), sink = "OUT", autoprint = F) barplot(vis$visibility.summary, las = 2, cex.names = .6, ylab = "Visible nodes", legend.text = c("Downstream", "Upstream", "Both"), args.legend = list(x = "topright", title = "Direction")) ``` #### Global measures Global graph-theoretic measures allow consideration of a stream network in its entirety. Many popular global graph-theoretic measures can be called using the *streamDAG* function `global.summary`. These metrics have been designed expressly to quantify network connectivity, complexity, and, in the case of assortativity, degree trends. ```{r} gp <- print(global.summary(murphy_spring, sink = "OUT")) ``` It may be informative to track changes in global metrics (and local metrics) over time. Fig \@ref(fig:global1) shows a 100 point time series that spans the entire 2019 sampling season. As in Fig \@ref(fig:local1), metrics are standardized to have a mean of zero and a variance of one. Note higher scores for most metrics occur during the spring and a re-wet period during the fall, indicating higher network connectivity. ```{r global1, fig.cap = "Summaries for `murphy_spring` for a subset of global metrics."} subset <- mur_node_pres_abs[seq(1,1163, length = 100),] subset.nodate <- subset[,-1] # walk global.summary through node presence / absence data global <- matrix(ncol = 23, nrow = nrow(subset)) for(i in 1:nrow(subset)){ global[i,] <- global.summary(delete.nodes.pa(murphy_spring, subset.nodate[i,], na.response = "treat.as.1"), sink = "OUT") } scaled.global <- data.frame(scale(global)) names(scaled.global) <- rownames(gp) sub <- scaled.global[,-c(1,2,4,6,10,16,18,23)] stsg <- stack(sub) tim <- as.POSIXct(strptime(subset[,1], format = "%m/%d/%Y %H:%M")) stsg$time <- rep(tim,15) # standardize measures ggplot(stsg, aes(y = values, x = time)) + geom_line(linewidth = 0.75) + #geom_hline(aes(yintercept = 0), colour = gray(0.2)) + facet_wrap(~ind) + theme_facet() + ylab("Standardized global measures") + xlab("") ``` ### Weighted (Reality-Driven) DAG Approaches Purely topological measures may be useful in describing the importance of individual stream nodes along with network-level connectivity and complexity. However, they will be strongly affected by user-defined node designations and abstracted from many important characteristics of stream networks. To account for this, increased realism in stream DAGs can be achieved by adding information to nodes and/or arcs in the form of *weights*. In fact, weighted DAG measures will result in indices similar or identical to existing connectivity metrics from the hydrological literature, e.g., Integral Connectivity Scale Length, (ICSL; @western2001), Bernoulli stream length [@botter2020]. Weighting information particularly relevant to intermittent stream DAGs include flow rates, stream lengths, and arc or node probabilities of activity. In Fig \@ref(fig:weighted1) Murphy Cr. arcs are colored based on their average probabilities for persistence in 2019. As with non-weighted metrics, both local and global summaries are possible. ```{r weighted1, fig.cap = "Murphy Cr. arcs colored by their probabilities of surface water presence (brown = low probability, blue = high probability)."} prob <- apply(mur_arc_pres_abs, 2, mean) o2 <- order(prob) o3 <- order(o2) col <- hcl.colors(27, palette = "Vik", rev = T)[o3] spatial.plot(murphy_spring, x, y, names, arrow.col = col, arrow.lwd = 1.5, col = "lightblue", cex.text = .7, plot.bg = "white") ``` #### Local measures Conventional weighted measures of nodal importance include strength (weighted degree) and weighted alpha-centrality. Code for calculating these measures using stream length and stream probability as weights are shown below through use of the functions `igraph::strength` and `igraph::alpha.centrality` with respect to the completely wetted Murphy Cr. network (Fig \@ref(fig:sp1)). Summary plots are shown as Fig \@ref(fig:localw1) and \@ref(fig:localw11). ```{r localw1, fig.cap = "Node strength and alpha-centrality using stream segment length and stream segment probability of activity (separately) as weights."} G3 <- murphy_spring E(G3)$weight <- mur_lengths[,2] s1 <- strength(G3) a1 <- alpha.centrality(G3) E(G3)$weight <- prob s2 <- strength(G3) a2 <- alpha.centrality(G3) weighted.local <- cbind(s1, a1, s2, a2) s.weighted.local <- scale(weighted.local) colnames(s.weighted.local) <- c("Strength_length", "Alpha-centrality_length", "Strength_prob", "Alpha-centrality_prob") ssw <- stack(data.frame(s.weighted.local)) ssw$node <- rep(rownames(s.weighted.local), 4) # standardize outcomes ggplot(ssw, aes(x = node, y = values)) + geom_bar(stat = "identity") + facet_wrap(~ind) + theme_facet() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size=4.3)) + ylab("Standardized weighted local measures") + xlab("\nNode") ``` Weights address bias that may occur in designating nodes in stream networks. For instance path lengths can be made arbitrarily large by adding more nodes to paths. This effect, however, will be largely addressed if arcs are weighted by their actual field measured lengths. ```{r localw11, fig.cap= "In-path length summaries after weighting arcs by actual in-stream lengths."} library(asbio) G3 <- murphy_spring E(G3)$weight <- mur_lengths[,2] nodes <- attributes(V(G3))$names list.paths <- vector(mode='list', length = length(nodes)); names(list.paths) <- nodes for(i in 1:length(nodes)){ list.paths[[i]] <- spath.lengths(G3, node = nodes[i]) } mean <- as.matrix(unlist(lapply(list.paths, mean))) median <- as.matrix(unlist(lapply(list.paths, median))) var <- as.matrix(unlist(lapply(list.paths, var))) skew <- as.matrix(unlist(lapply(list.paths, skew))) kurt <- as.matrix(unlist(lapply(list.paths, kurt))) path.summary <- data.frame(Mean = mean, Median = median, Variance = var, Skew = skew, Kurtosis = kurt) cc <- complete.cases(path.summary) no.na <- path.summary[cc,]; scale.no.na <- data.frame(scale(no.na)) sna <- stack(data.frame(scale.no.na)) sna$node <- rep(row.names(scale.no.na), 5) ggplot(sna, aes(x = node, y = values)) + geom_bar(stat = "identity") + facet_wrap(~ind) + theme_facet() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size=4.3)) + ylab("Standardized weighted local measures") + xlab("\nNode") ``` Stream-focused measures that consider both arc probability and arc length include Bernoulli stream length (i.e., stream segment length multiplied by the probability of stream presence) and communication distance (i.e., stream segment length multiplied by the inverse probability of stream presence). Thus, while most local graph measures are defined with respect to graph nodes (despite the fact that some nodal metrics (e.g., strength and alpha centrality) have arc weights), Bernoulli length and communication distance are defined with respect to graph arcs. Note that in Fig \@ref(fig:localw2), Bernoulli stream length and communication distance are negatively correlated because of their basis on the probability of arc presence and inverse arc presence, respectively. Large communication distance at an arc implies a higher probability of a stream bottleneck at that location. ```{r localw2, fig.cap = "Bernoulli length and communication distance using stream segment length and stream segment probability of activity (collectively) as weights"} bsl <- bern.length(mur_lengths[,2], prob) # Bernoulli length bcd <- bern.length(mur_lengths[,2], 1/prob) # Comm dist. both <- cbind(bsl, bcd) scale.both <- scale(both) # standardize outcomes oldpar <- par(mar = c(7,4.5,1.5,1.5)) # allow full arc names to be seen barplot(t(scale.both), beside = T,las = 2, cex.names = .8, legend.text = c("Bernoulli length", "Commincation distance"), args.legend = list(x = "topright", cex = .9), ylab = "Standardized measures") par(oldpar) ``` #### Global measures Many existing network-level stream connectivity metrics can be viewed as weighted stream DAG measures. These include Integral Connectivity Scale Length, network-level average Bernoulli stream length and average network communication distance. Here we calculate network level average Bernoulli stream length and average network communication distance for Murphy Creek. Units, are the units of measured in-stream lengths; in this case, meters. ```{r} bern.length(mur_lengths[,2], prob, mode = "global") # Bernoulli length bern.length(mur_lengths[,2], 1/prob, mode = "global") # Comm dist. ``` Here is stream-length based ICSL (average in-stream distance of nodes), and the average Euclidean distance of nodes, for the completely wetted network, represented in `murphy_spring` (Fig \@ref(fig:sp1)). ```{r} # in-stream average nodal distance ICSL(murphy_spring, lengths = mur_lengths[,2]) # average nodal Euclidean distance ICSL(murphy_spring, coords = mur_coords[,2:3], names = mur_coords[,1]) ``` As with unweighted metrics, it may be informative to track weighted global (and local) metrics over time. Below we consider: ICSL, intact stream length to the node, and average alpha-centrality (with stream lengths as arc weights) for Murphy Creek graphs resulting from the stream node presence / absence time series data used earlier (Fig \@ref(fig:globalw1)). ```{r globalw1, fig.cap = "Global weighted network connectivity measures for Murphy Cr. over time."} icsl <- 1:nrow(subset) -> intact.to.sink -> a.cent -> harary # walk global.summary through node presence / absence data for(i in 1:nrow(subset)){ temp.graph <- delete.nodes.pa(murphy_spring, subset.nodate[i,], na.response = "treat.as.1") # replace direction symbol for igraph comparability namelv <- gsub(" -> ", "|", mur_lengths[,1]) a <- attributes(E(temp.graph))$vname w <- which(namelv %in% a) length.sub <- mur_lengths[,2][w] icsl[i] <- ICSL(temp.graph, lengths = length.sub) E(temp.graph)$weights <- length.sub intact.to.sink[i] <- size.intact.to.sink(temp.graph, "OUT") a.cent[i] <- mean(alpha.centrality(temp.graph), na.rm = T) harary[i] <- harary(temp.graph) } global <- cbind(icsl, intact.to.sink, a.cent, harary) # standardize measures scaled.global <- scale(global) oldpar <- par(mar = c(7,4.2,1.5,2)) # plot matplot(scaled.global, xaxt = "n", type = "l", col = hcl.colors(4, palette = "spectral"), ylab = "Standardized global measures", lty = 1:2) legend("topright", lty = 1:2, col = hcl.colors(4, palette = "spectral"), legend = c("ICSL", "intact stream length to sink", "alpha-centrality", "Harary"), cex = .8) axis(side = 1, at = c(1,21,41,61,81,100), labels = subset[,1][c(1,21,41,61,81,100)], las = 2, cex.axis = .7) mtext(side = 1, "Time", line = 6) par(oldpar) ``` The dataframe `mur_seasons_arc_pa` contains simulated arc presence/absence data for the spring, summer, and fall, represented as equal subdivisions of the sampling period. Specifically, the three time periods were: spring (6/3/2019 - 7/13/2019), and summer (7/13/2019 - 8/23/2019), and fall (8/23/2019 - 10/2/2019). ```{r} data(mur_seasons_arc_pa) ``` Fig \@ref(fig:globalw2) shows histograms of distributions of Bernoulli stream lengths in the spring, summer, and fall. Note that the fall rewet is not captured because of the coarse cutoffs used for seasons. ```{r globalw2, fig.cap = "Distributions of Bernoulli network lengths for the seasonal designations."} springL <- matrix(nrow = 100, ncol = 27) -> summerL -> fallL for(i in 1:100){ springL[i,] <- bern.length(mur_lengths[,2], mur_seasons_arc_pa[,1:27][mur_seasons_arc_pa$Season == "Spring",][i,], "global") summerL[i,] <- bern.length(mur_lengths[,2], mur_seasons_arc_pa[,1:27][mur_seasons_arc_pa$Season == "Summer",][i,], "global") fallL[i,] <- bern.length(mur_lengths[,2], mur_seasons_arc_pa[,1:27][mur_seasons_arc_pa$Season == "Fall",][i,], "global") } xlim <- range(c(springL, summerL, fallL), na.rm = T) h <- hist(springL, plot = F) ylim <- range(h$counts) col <- rgb(c(0,0.5,1), c(0,1,0.5), c(1,0.5,0), c(0.4,0.4,0.4)) hist(springL, xlim = xlim, ylim = ylim, main = "", xlab = "Bernoull network length (m)", col = col[1], border = col[1]) oldpar <- par(new = TRUE) hist(summerL, xlim = xlim, ylim = ylim, axes = F, main = "", xlab = "", col = col[2], border = col[2]) oldpar <- par(new = TRUE) hist(fallL, xlim = xlim, ylim = ylim, axes = F, main = "", xlab = "", col = col[3], border = col[3]) legend("topleft", fill = col, legend = c("Spring", "Summer", "Fall"), bty = "n", cex = 1) par(oldpar) ``` Here are average network-level Bernoulli stream lengths and communication distances in the spring, summer and fall. Note the presence of infinitely large network-level communication distances in the fall and summer due to the presence of network blockages. ```{r} mean(springL) # mean spring network length mean(summerL) # mean summer network length mean(fallL) # mean fall network length # mean spring network communication distance bern.length(mur_lengths[,2], 1/colMeans(mur_seasons_arc_pa[,1:27][mur_seasons_arc_pa$Season == "Spring",], na.rm = TRUE), "global") # mean summer network communication distance bern.length(mur_lengths[,2], 1/colMeans(mur_seasons_arc_pa[,1:27][mur_seasons_arc_pa$Season == "Summer",], na.rm = TRUE), "global") # mean fall network communication distance bern.length(mur_lengths[,2], 1/colMeans(mur_seasons_arc_pa[,1:27][mur_seasons_arc_pa$Season == "Fall",], na.rm = TRUE), "global") ``` #### Bayesian Extensions Bayesian extensions are possible for Bernoulli length and communication distance by viewing the probabilities of stream presence at arcs as random variables. The underlying theory for these approaches is described in Aho et al. (in review). Briefly, given a beta-distribution prior (and a binomial likelihood), the posterior beta distribution for the probability of stream presence for the $k$th arc can have the form: \begin{equation} \theta_k \mid \boldsymbol{x}_k \sim BETA \left(w \cdot n \cdot \hat{p}_{k}+ \sum\boldsymbol{x}_k,w\cdot n\left(1-{\hat{p}}_{k}\right)+n-\sum\boldsymbol{x}_k\right) (\#eq:beta) \end{equation} where $w$ is the weight given to the prior relative to the current data, and $p_k$ is the mean of the prior beta distribution. The posterior distribution for the inverse probability of stream presence for the $k$th arc will follow an inverse beta distribution (see Aho et al. (in prep)) with the same parameters shown in Eq \@ref(eq:beta). Multiplying the $k$th posterior for the probability of stream presence and the $k$th posterior for the inverse probability of stream presence by the $k$th stream length will provide posteriors for Bernoulli stream length and communication distance, respectively. This process is facilitated by the *streamDAG* function `beta.posterior`. Assume that we wish to apply a naive Bayesian prior, $\theta_k \sim BETA(1,1)$, to the probability of stream segment activity at Murphy Cr., for all segments. The distribution $BETA(1,1)$ is equivalent to a continuous uniform distribution in 0,1, and will have the mean, $E(\theta_k)= 0.5$. Assume further that wish to give the priors 1/3 of the weight of observed binomial outcomes. As data we will use the first 10 rows from `mur_arc_pres_abs`. We have: ```{r} data <- mur_arc_pres_abs[1:10,] b <- beta.posterior(p.prior = 0.5, dat = data, length = mur_lengths[,2], w = 1/3) ``` The `beta.posterior` function returns a list with the following components: * `alpha`: The $\alpha$ shape parameters for the beta and inverse beta posteriors. * `beta`: The $\beta$ shape parameters for the beta and inverse beta posteriors. * `mean`: The means of the beta posteriors. * `var`: The variances of the beta posteriors. * `mean.inv`: The means of the inverse-beta posteriors. * `var.inv`: The variances of the inverse-beta posteriors. * `Com.dist`: If `length` is supplied, the mean communication distances of the network. * `Length`: If `length` is supplied, the mean stream length of the network. For instance, here are the resulting shape parameters for the beta posterior distributions for the probability of stream presence and the inverse beta posterior distributions for the probability of stream presence. ```{r} b$alpha b$beta ``` We can use this information to depict arc posteriors for the probability of stream presence Fig \@ref(fig:globalw3), and the inverse probability of stream presence Fig \@ref(fig:globalw4). Multiplying the former distributions by their respective stream lengths will give average Bernoulli stream lengths for the segments. Multiplying the latter distributions by their respective stream lengths will give average communication distances for the segments. ```{r globalw3, fig.cap = "Graphical summaries of posterior beta distributions for Murphy Creek stream segments from 06/01/2019 to 10/01/2019. The posteriors represent distributions of probabilities of stream activity. Arc distributions are colored by their mean values (darker distributions have smaller means). The posterior means are overlain on the distributions with dashed lines."} means <- b$alpha/(b$alpha + b$beta) col <- gray(means/max(means)) oldpar <- par(mfrow = c(6,5), oma = c(4,4.5, 0.1, 1), mar = c(0,0,1.2,0.6)) for(i in 1:27){ x <- seq(0.,1,by = .001) y <- dbeta(x, b$alpha[i], b$beta[i]) n <- length(x) plot(x, yaxt = ifelse(i %in% c(1,6,11,16,21,26), "s", "n"), xaxt = ifelse(i %in% 23:27, "s", "n"), type = "n", xlim = c(0,1), ylim = c(0,5.5), cex.axis = .8) polygon(c(x, x[n:1]), c(y, rep(0,n)), col = col[i], border = "grey") segments(means[i], 0, means[i], dbeta(means[i], b$alpha[i], b$beta[i]), lty = 2) mtext(side = 3, names(b$beta)[i], cex = .5) } #axis labels mtext(side = 2, outer = T, expression(paste(italic(f),"(",theta[italic(k)],"|",italic(x[k]),")")), line = 2.5) mtext(side = 1, outer = T, expression(paste(theta[italic(k)],"|",italic(x[k]))), line = 2.5) par(oldpar) ``` ```{r globalw4, fig.cap = "Graphical summaries of posterior inverse beta distributions for Murphy Creek stream segments from 06/01/2019 to 10/01/2019. The posteriors represent distributions of inverse probabilities of stream activity. Arc distributions are colored by their distributional mean values (darker distributions have smaller inverse probabilities). The posterior means are overlain on the distributions with dashed lines."} means <- (b$alpha + b$beta - 1)/(b$alpha - 1) col <- gray(means/max(means)) oldpar <- par(mfrow = c(6,5), oma = c(4,4.5, 0.1, 1), mar = c(0,0,1.2,0.6)) for(i in 1:27){ if(i %in% 1:10){lim <- c(0,1.1)} else { if(i %in% 11:15){lim <- c(0,2.3)} else {lim <- c(0,5)}} x <- seq(1,30,by = .01) y <- dinvbeta(x, b$alpha[i], b$beta[i]) n <- length(x) plot(x, yaxt = ifelse(i %in% c(1,6,11,16,21,26), "s", "n"), xaxt = ifelse(i %in% 23:27, "s", "n"), type = "n", xlim = c(1,15), ylim = lim, cex.axis = .8, log = "x") polygon(c(x, x[n:1]), c(y, rep(0,n)), col = col[i], border = "grey") segments(means[i], 0, means[i], dinvbeta(means[i], b$alpha[i], b$beta[i]), lty = 2) mtext(side = 3, names(b$beta)[i], cex = .5) } #axis labels mtext(side = 2, outer = T, expression(paste(italic(f),"(",theta[italic(k)],"|",italic(x[k]),")",""^{-1})), line = 2.5) mtext(side = 1, outer = T, expression(paste("(",theta[italic(k)],"|",italic(x[k]),")",""^{-1})), line = 2.5) par(oldpar) ``` ## Konza Prairie For comparison, we now briefly consider Konza Prairie, a more complex intermittent stream network in the northern Flint Hills region of Kansas, USA (39.11394$^\circ$N, 96.61153$^\circ$W) (Fig \@ref(fig:kc)). The network contains 42 nodes and 41 arcs with three major reaches and eight source nodes. Codification of the complete Konza Prairie (and the complete Murphy Creek network) are contained in the *streamDAG* function `streamDAGs`. ```{r kc, fig.cap = "The Kings Cr. watershed in central Kansas.", echo = F} include_graphics("kc.jpg", dpi = 100) ``` ```{r} kon_full <- streamDAGs("konza_full") ``` A `spatial.plot` of the full wetted network is shown in Fig \@ref(fig:k1). ```{r k1, fig.cap = "Spatially explicit graph of the completely wetted Konza Prairie network."} data(kon_coords) spatial.plot(kon_full, kon_coords[,3], kon_coords[,2], names = kon_coords[,1]) ``` When applying the definition of matrix multiplication to an adjacency matrix $\boldsymbol{A}$, the $i,j$ entry in $\boldsymbol{A}^k$ will give the number of paths in the graph from node $i$ to node $j$ of length $k$. The result of the computation of $\boldsymbol{A}^k$ (in paths of length $k$) is provided by `A.mult`. The actual matrix $\boldsymbol{A}^k$ can also be obtained from the function. ```{r} A.mult(kon_full, power = 6, text.summary = TRUE) ``` There are 26 paths of length six in the full Konza network. The complete Konza network has a Strahler stream order of three (Fig \@ref(fig:Kstr)) and a Shreve stream order of nine (Fig \@ref(fig:Kshr)). From either perspective the Murphy Cr. network has a stream order of two. ```{r Kstr, fig.cap = "Strahler numbers for the complete Konza network."} sok <- stream.order(G = kon_full, sink = "SFM01_1", method = "strahler") colk <- as.character(factor(sok, levels = as.character(1:3),labels = topo.colors(3, rev = TRUE))) spatial.plot(G = kon_full, x = kon_coords[,3], y = kon_coords[,2], names = kon_coords[,1], pt.bg = colk, cex = 1.5, cex.text = 0, plot.bg = "white") legend("bottomright", title = "Strahler\nStream order", legend = unique(sok), pch = 21, pt.cex = 1.5, pt.bg = unique(colk), ncol = 1, bty = "n", title.cex = 0.9) ``` ```{r Kshr, fig.cap = "Shreve numbers for the complete Konza network."} sok <- stream.order(G = kon_full, sink = "SFM01_1", method = "shreve") colk <- as.character(factor(sok, levels = as.character(unique(sok)), labels = topo.colors(6, rev = TRUE))) spatial.plot(G = kon_full, x = kon_coords[,3], y = kon_coords[,2], names = kon_coords[,1], pt.bg = colk, cex = 1.5, cex.text = 0, plot.bg = "white") legend("bottomright", title = "Shreve\nStream order", legend = unique(sok), pch = 21, pt.cex = 1.5, pt.bg = unique(colk), ncol = 2, bty = "n", title.cex = 0.9) ``` ### Unweighted Local Measures Increased nodal complexity of the complete Konza network, compared to Murphy Creek, is evident in the local metric summary in Fig \@ref(fig:k2). ```{r k2, fig.cap = "Local graph-theoretic summaries for Konza Prairie."} vis <- multi.path.visibility(kon_full, source = c("SFT01_1", "02M11_1","04W04_2", "04T01_1", "04M13_1","SFT02_1","01M06_1", "20M05_1"), sink = "SFM01_1", autoprint = F) cn <- colnames(vis$complete.matrix) local <- data.frame(local.summary(kon_full)) local.sub <- local[c(3,4,7,8,16,17),] scale.local.sub <- data.frame(t(scale(t(local.sub)))) st <- stack(scale.local.sub) st$vars <- rownames(scale.local.sub) m <- match(cn, V(kon_full)$name) scale.local.sub1<- data.frame(scale.local.sub[,m]) st <- stack(scale.local.sub1) st$vars <- rownames(scale.local.sub1) ggplot(st, aes(y = values, x = ind)) + geom_bar(stat="identity", fill = gray(.1), width = .75) + #geom_hline(aes(yintercept = 0), colour = gray(0.2)) + facet_wrap(~vars) + theme_facet() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size=4.3)) + ylab("Standardized local measures") + xlab("\nNode") ``` The importance of nodes at network convergence points is emphasized in a horizontal visibility graph summary (Fig \@ref(fig:k3)). ```{r k3, fig.cap = "Nodal visibilities for for Konza Prairie based on nodal indegree."} vis <- multi.path.visibility(kon_full, source = c("SFT01_1", "02M11_1","04W04_2", "04T01_1", "04M13_1", "SFT02_1","01M06_1", "20M05_1"), sink = "SFM01_1", autoprint = F) barplot(vis$visibility.summary, las = 2, cex.names = .6, ylab = "Visible nodes", legend.text = c("Downstream", "Upstream", "Both"), args.legend = list(x = "topleft", title = "Direction")) ``` ### Unweighted Global Measures In 2021 the Konza network changed rapidly and dramatically from 05/21/2021 (before spring snow melt) to 05/28/2021 (during spring snow melt) to 06/04/2021 (drying following snow melt) (Fig \@ref(fig:k4)). ```{r k4, fig.cap = "Spatial plot representations for Konza Prairie for (A) 05/21/2021 (B) 05/28/2021, and (C) 06/04/2021."} K0521 <- streamDAGs("KD0521"); K0528 <- streamDAGs("KD0528"); K0604 <- streamDAGs("KD0604") kx <- kon_coords[,3]; ky <- kon_coords[,2]; kn <- kon_coords[,1] oldpar <- par(mfrow = c(3,1), mar = c(0,0,1,0), oma = c(5,4,1,1)) spatial.plot(K0521, kx, ky, kn, xaxt = "n", cex.text = 0) legend("topright", bty = "n", legend = "A", cex = 2) spatial.plot(K0528, kx, ky, kn, xaxt = "n", cex.text = 0) legend("topright", bty = "n", legend = "B", cex = 2) spatial.plot(K0604, kx, ky, kn, cex.text = 0) legend("topright", bty = "n", legend = "C", cex = 2) mtext(side = 1, "Longitude", outer = T, line = 3.5); mtext(side = 2, "Latitude", outer = T, line = 3) par(oldpar) ``` This is reflected in the global graph summaries for those dates (Fig \@ref(fig:k5)). ```{r k5, fig.cap = "Global summaries for Konza Prairie for 05/21/2021, 05/28/2021, and 06/04/2021."} g0521 <- global.summary(K0521, sink = "SFM01_1") g0528 <- global.summary(K0528, sink = "SFM01_1") g0604 <- global.summary(K0604, sink = "SFM01_1") global <- cbind(g0521, g0528, g0604)[-3,] scaled.global <- scale(t(global)) ssg <- stack(data.frame(scaled.global)) ssg$time <- rep(c("5/21/2021","5/28/2021","6/04/2021"),22) ggplot(ssg, aes(y = values, x = time)) + geom_bar(stat="identity", fill = gray(.1), width = .75) + facet_wrap(~ind) + theme_facet() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + ylab("Standardized global measures") + xlab("\nNode") ``` # Estimating Arc Stream Presence Outcomes Intermittent stream arc presence / absence data are generally not available because presence / absence data are obtained at particular points in the stream, e.g., nodes. Given a relatively even spatial distribution of nodes, one possibility is to estimate the probability of arc presence as the mean of the presence / absence values of the bounding nodes. Let $x_{k,i}$ be a possible outcome from the $k$th arc, $k = 1,2,\ldots,m$, with bounding nodes $u$ and $v$, for the $i$th time frame, $i=1,2,3,\ldots,n$. There are three possibilities: \[ x_{k,i} = \left\{ {\begin{array}{cc} 1, & \text{both $u$ and $v$ are active (wet)} \\ 0, & \text{both $u$ and $v$ are inactive (dry)} \\ 0.5, & \text{one of $u$ or $v$ is active} \\ \end{array} } \right . \] This conversion is facilitated by the *streamDAG* function `arc.pa.from.nodes` which provides arc activity probabilities (using the rule above) based on bounding node presence / absence values. For instance, below are the 404th and 405th nodal stream presence observations from Murphy Cr. ```{r} mur_node_pres_abs[404:405,] ``` Here we estimate arc probabilities from the nodal data. ```{r} arc.pa.from.nodes(murphy_spring, mur_node_pres_abs[404:405,][,-1]) ``` Here we estimate the marginal arc probabilities and arc correlation structures using the entire `mur_node_pres_abs` dataset. ```{r} conversion <- arc.pa.from.nodes(murphy_spring, mur_node_pres_abs[,-1]) marginal <- colMeans(conversion, na.rm = TRUE) corr <- cor(conversion, use = "pairwise.complete.obs") ``` Impossible correlations (given marginal probabilities) are adjusted with the *streamDAG* function `R.bounds` (see Aho et al. in prep). ```{r} corrected.corr <- R.bounds(marginal, corr) ``` Multivariate Bernoulli outcomes can now be simulated using functions from the package *mipfp* [@barth2018]. ```{r, eval = F} library(mipfp) p.joint.all <- ObtainMultBinaryDist(corr = corrected.corr, marg.probs = marginal, tol = 0.001, tol.margins = .001, iter = 100) out <- RMultBinary(n = 5, p.joint.all, target.values = NULL)$binary.sequences ``` Note that even for relatively small stream networks (e.g., Murphy Cr. with 28 nodes and 27 arcs), the generation of multivariate Bernoulli distributions using `mipfp::ObtainMultBinaryDist` and simulation of multivariate Bernoulli random outcomes using `mipfp::RMultBinary` is computationally cumbersome. Thus, to simplify computational procedures we recommend simulating outcomes only for arcs that demonstrate stream presence spatial dependence, e.g., arcs with outcomes that *are not* always 0 or 1 for an observational period.