
Showcasing Sampling Schemes with Sloths
safeHavens.Rmd
safeHavens
can be installed directly from github, no
plans are made to release the package to CRAN as it’s user base is
relatively niche relative to a typical R package.
# `devtools`, or a similar package, is required to install packages from github
# install.packages('devtools')
# devtools::install_github('sagesteppe/safeHavens')
# lot's of user have issues with devtools, on Windows I believe, but `remotes` tends to work well.
# install.packages('remotes') # remotes is very similar and a good alternative for this use case.
remotes::install_github('sagesteppe/safeHavens') # blah even same function name!
Once installed safeHavens
can be loaded like any other
package, whether they are installed from CRAN or github. We will also
load ggplot2
for it’s excellent ability to plot simple
features, and tidyverse
for pushing some fields around
etc.
library(safeHavens)
library(ggplot2)
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) :
#> object 'type_sum.accel' not found
library(sf)
library(terra)
library(spData)
library(dplyr)
library(patchwork)
set.seed(23) # MJ ;-)
planar_proj <- '+proj=laea +lon_0=-421.171875 +lat_0=-16.8672134 +datum=WGS84 +units=m +no_defs'
Notes about safeHavens
This package helps germplasm curators communicate areas of interest to collection teams for targeting new accessions. It provides seven different sampling approaches for curators to choose from, for each individual taxon they hope to process. It also allows for easy integration into existing workflows and will put out the required spatial data to share with collection teams.
Each of the approaches are based on standard practices in ecology,
and reflect very basic tenets of population genetics, and Tobler’s first
law of geography. The methods have various trade offs in terms of
computational and environmental complexity. The table below presents the
currently implemented sampling scheme and the user facing function
associated with them. In my mind the first four functions are really
flavors of the same process, one whereby we try and partition the
species range into geographic chunks of similar sizes. However, as is
often the case four things which seem very similar to me may have
enormously different results in implementation. The fifth method
IBDBasedSample
is largely in a class of it’s own, in lieu
of using the continuity of geographic space as it’s primary
method, it focuses on the discontinuity of space and using distance
matrices and clustering to determine which patches of range are more
close to each than other patches. The impetus behind this method is of
course Sewall Wrights Isolation by Distance (1943).
The EcoregionBasedSample
may be the most commonly
encountered method in North America and in various formats is driving
two major germplasm banking projects in both the Midwest and
Southeastern United States, as well as at a high level, composing the
way that numerous native seed collection coordinators are structured in
the West. This method using environmental variation as an implicit guide
to targeting populations for seed collections, i.e. the different
ecoregions serve as a stratification agent. In broad strokes, the
general thinking is that these regions represent continuous transitions
in the environment faced by the species, and populations across these
ranges will be differently adapted to these environments. Given it’s
relative popularity in implementation, this function has more arguments
than it’s counterparts, which will be discussed below.
The final function EnvironmentalBasedSample
is both the
most computationally expensive, and the most environmentally explicit.
This function will rely on a Species Distribution Model, generated via a
generalized linear model, supported by this package, to cluster
populations based on environmental variables related to their observed
distributions and the spatial configuration and distance between them.
On paper, this draws together all aspects of the above functions,
however no testing of this approach has been implemented. It will be
discussed below in depth.
Function | Description | Comp. | Envi. |
---|---|---|---|
GridBasedSample |
Creates and merges n grids over area | L | L |
PointBasedSample |
Creates points to make grids over area | L | L |
EqualAreaSample |
Breaks area into similar size pieces | L | L |
OpportunisticSample |
Using PBS with existing records | L | L |
IBDBasedSample |
Breaks species range into clusters | H | M |
EcoregionBasedSample |
Using existing ecoregions to sample | L | H |
EnvironmentalBasedSample |
Uses correlations from SDM to sample | H | H |
Note in this table ‘Comp.’ and ‘Envi.’ refer to computational and environmental complexity respectively, and range from low (L) through medium to high.
General notes about this vignette
This package is strongly focused on plants, no disrespect to animals, they just never occurred to me during development. However, to make it up to animal people, we will use a Sloth, which I think is Bradypus variegatus it looks both maniacal and mischievous, and the like inspiration for many Jim Henson puppets. However, we will treat Bradypus more like a plant species when we discuss these sampling schemes - that is we will assume we are focused on organisms which seldom move great distances in dispersal processes.
We can access some data on Bradypus from the
dismo
package - presumably short for ‘Distribution
Modelling’, which was created by Robert Hijman and others. We
will use a couple other dismo
functions in this package,
and we will utilize some spatial data from it, It’s a good package to be
familiar with. If you want to read more about dismo
and
distribution modelling in general here is an awesome bookdown resource
written by Hijman and
Elith - aspects of this vignette are definitely based on it. We will
use Species Distribution Models (SDMs) as the input variable to many of
the arguments in this function - don’t get caught up in the details of
making them. These days you can get good SDM results out of essentially
entirely automated pipelines, we’ll have the truncated process outlined
here in a chunk, but I honestly argue you should ignore this section the
first few times you read this document, and go back for the details
(discussed in the link above) only if you decide to pursue the route of
EnvironmentalBasedSample
. I may be too optimistic in
thinking that any one of these schemes will give adequate results for
sample design, but some may be better for certain species and
projects.
Future users will need to obtain their own occurrence records for
using with this package. The r package rbgif
is a fantastic
way to do this, a well documented article linking to several vignettes
for the package is here.
Species Distribution Modelling (skip me your first few read throughs?!)
If you are interested in visualizing what a sampling scheme under
safeHavens
most complex function
EnvironmentalBasedSample
requires, you can explore this
section to understand the costs required to achieve this process. It
goes on for a while, so I recommend checking this section last. In the
meantime, if you still want to see the results from
EnvironmentalBasedSample
compared to the other functions,
you can load the results from disk, they are distributed with the
package.
x <- read.csv(file.path(system.file(package="dismo"), 'ex', 'bradypus.csv'))
x <- x[,c('lon', 'lat')]
x <- sf::st_as_sf(x, coords = c('lon', 'lat'), crs = 4326)
files <- list.files(
path = file.path(system.file(package="dismo"), 'ex'),
pattern = 'grd', full.names=TRUE )
predictors <- terra::rast(files) # import the independent variables
The goal of most SDM’s is to create a model which most accurately predicts where a species will be located in environmental space, and hence geographic space. The goal of these models are to understand the degree, and direction, to which various environmental features correlate with a species observed range. Accordingly, they are not optimized for use in making the typical decisions associated with species distribution models, i.e. conservation area planning. Rather the models supported in this package are solely focused on estimating the effects of various environmental parameters on the species distribution.
sdModel <- elasticSDM(
x = x, predictors = predictors, quantile_v = 0.025,
planar_proj = planar_proj)
We use the caret
package to help out with our
glmnet
modelling, it’s unnecessary, but it provides output
which is very easy to explore and interact with. While much of
caret
functionality has been improved on in
tidymodels
and associated packages by Max Kuhn (also the
lead caret
author), caret
is pretty stable and
will serve our purposes fine. What makes an elastic net model
interesting is that it is able to bridge the worlds of lasso and ridge
regression by blending the alpha parameter. Lasso has an alpha of 0,
while Ridge has an alpha of 1. Lasso regression will perform automated
variable selection, and can actually drop (or ‘shrink’) them from a
model, while Ridge regression will keep all variables and if correlated
features are present give some contributions to each of them. An elastic
net blends this propensity to drop or retain variables whenever it is
used. caret
will test a range of alphas to accomplish
this.
Note that by using this we don’t exactly get a feel for which variable is really correlated with a an event, but we do at least get a sense of how a set of variables affects the range. Infer from this at your own risk! Inference from SDM’s is something I do not recommend except in the most special of circumstances anyways.
sdModel$CVStructure
Here we can see how the elastic net decided which is the top model. We used Accuracy rather than Kappa as the main criterion for model selection.
sdModel$ConfusionMatrix
Here we can see how our selected model works on predicting the state
of test data. Note these accuracy results are slightly higher than those
from the CV folds. This is not a bug, CV folds are testing on their own
holdouts, while this is a brand new set of holdouts. The main reason the
confusion matrix results are likely to be higher is due to spatial
auto-correlation which are not address in a typical ‘random split’ of
test and train data. In order to minimize the effects of
spatial-autocorrelation on our model we use CAST
under the
hood which allows for spatially informed cross validation.
Consider the output from CVStructure to be a bit more realistic.
terra::plot(sdModel$RasterPredictions)
SDM’s produce surfaces which display the probability of suitable
habitat across a landscape. Many people want a binary prediction surface
from them, i.e. is it more or less probable that suitable habitat
for this taxon exists in this particular location (grid cell)?.
Going from a continuous probability surface to a binary surface loses a
lot of information, but in many use cases is essential to reduce
computational complexity. If you are interested in the caveats
associated with this Frank Harrell has written extensively on the topic.
We will use binary surfaces for implementing our sampling procedures
across a species range. The function PostProcessSDM
is used
for this purpose.
Historically, when assessing probability output 0.5 probability was
used as a threshold, probabilities beneath if being considered ‘Not
suitable’, while probabilities above are classified as ‘Suitable’. This
works well for many use cases, but I argue that thresholding is outside
the domain of statistics and in the realm of practice. My motto for
implementing models, is a slight elaboration of George Box’s aphorism,
“All models are wrong, some are useful - how do you want to be
wrong?”. Our goal of sampling for germplasm conservation is to
maximize the representation of allelic diversity across the range of a
species. In order to do this, we need a good understanding of what the
species actual range is, hence I am more happy to predict the species
is present where it is not, than to predict it is absent where
it actually grows. Hence my preferred thresholding statistic is
Sensitivity, over any metric which weighs false predicted
presences. This argument is free to vary and supports any of the
threshold values calculated by dismo::threshold
, explore it
to better understand it’s options.
Until now, you may be wondering why the function which achieves this
is named PostProcessSDM
rather ThresholdSDM
,
the reason for this perceived discontinuity is that the function does
another process in it’s second portion. Using all initial occurrence
data, both the sets that went into our training and test data for
developing the statistical model, we create ‘buffers’ around these
points to ensure that none of the known occurrence points are ‘missing’
from the output binary map. This is a two edged sword, where we are
again address the notion of dispersal limitation, and realize that not
all suitable habitat is occupied habitat.
What the PostProcessSDM
function does is it again
creates cross validation folds, and selects all of our training data. In
each fold if then calculates the distance from each occurrence point to
it’s nearest neighbor. We then summarize these distances and can
understand the distribution of distances as a quantile. We use a
selected quantile, to then serve as a buffer. Area with predicted
suitable habitat outside of this buffer become ‘cut off’ (or masked)
from the binary raster map, and areas within buffer distance from known
occurrences which are currently masked are reassigned as probabilities.
The theory behind this process in underdeveloped and nascent, it does
come down to the gut of an analyst. With the Bradypus data set
I use 0.25 as the quantile, which is saying “Neighbors are generally
100km apart, I am happy with the risk of saying that 25km within each
occurrence is occupied suitable habitat”. Increasing this value
to say 1.0 will mean that no suitable habitat is removed, decreasing it
makes maps more conservative. The cost with increasing the distances
greatly is that the sampling methods may then puts grids in many areas
without populations to collect from.
threshold_rasts <- PostProcessSDM(
rast_cont = sdModel$RasterPredictions,
test = sdModel$TestData,
planar_proj = planar_proj,
thresh_metric = 'sensitivity', quant_amt = 0.5)
We can compare the results of applying this function side by side using the output from the function.
terra::plot(threshold_rasts$FinalRasters)
glmnet
is used for three main reasons, 1) it gives
directional coefficients and so we have a feel for how each 1 unit
increase in an independent variable predicts the response, In my mind
this is a big improvement over ‘Variable Importance Factors’ where we
just know that certain variables contributed more to the model than
others.
2) It maintains some degree of automated selection reducing the work an
analyst needs to do, i.e. you can process many species without spending
too much time on any single one.
3) glmnet
actually re-scales all variables before model
generation, which I suppose can be implemented with other models, we
will use the same re-scaling that glmnet
does to transform
our independent variables in the raster stack and then multiply them by
their beta-coefficients.
In this way our raster stack becomes representative of our model, we can
then use these values as the basis for hierarchical cluster later
on.
# CREATE A COPY OF THE RASTER PREDICTORS WHERE WE HAVE
# STANDARDIZED EACH VARIABLE - SO IT IS EQUIVALENT TO THE INPUT TO THE GLMNET
# FUNCTION, AND THEN MULTIPLIED IT BY IT'S BETA COEFFICIENT FROM THE FIT MODEL
# we will also write out the beta coefficients using writeSDMresults right after
# this.
rr <- RescaleRasters(
model = sdModel$Model,
predictors = sdModel$Predictors,
training_data = sdModel$TrainData,
pred_mat = sdModel$PredictMatrix)
terra::plot(rr$RescaledPredictors)
We can see that the variables are ‘close’ to being on the same scale, this will work in a clustering algorithm. If any of the layers are all the same color (maybe yellow?) that means they have no variance, that’s a term that was shrunk from the model. It will be dealt with internally in a future function. The scales of the variables are not exact, because they are weighed by their coefficients in the model
print(rr$BetaCoefficients)
We can also look at the coefficients for each variable.
glmnet
returns the ‘untransformed’ variables, i.e. the
coefficients on the same scale as the input rasters, we calculate the BC
right afterwards.
safeHavens
generates all kinds of things as it runs
through the functions elasticSDM
,
PostProcessSDM
, and RescaleRasters
. Given that
one of these sampling scheme may be followed for quite some time, I
think it is best practice to save many of these objects. Yes, they will
take up some storage space, but storage is virtually free these days
anyways. So why not write out all of the following items? I test write
them in a directory which exists for the project associated with the
creation of this R package, just save them somewhere and delete them
after this. These test files are tiny anyways.
bp <- '~/Documents/assoRted/StrategizingGermplasmCollections'
writeSDMresults(
cv_model = sdModel$CVStructure,
pcnm = sdModel$PCNM,
model = sdModel$Model,
cm = sdModel$ConfusionMatrix,
coef_tab = rr$BetaCoefficients,
f_rasts = threshold_rasts$FinalRasters,
thresh = threshold_rasts$Threshold,
file.path(bp, 'results', 'SDM'), 'Bradypus_test')
# we can see that the files were placed here using this.
list.files( file.path(bp, 'results', 'SDM'), recursive = TRUE )
And there you have it, all the steps to make a species distribution model - or rather to get the coefficients from a species distribution model! We will play around with it as an example data set to compare our buffered distance results to at the end.
Alternative to SDM, just buffer! (start reading here your first few times!)
Creating the SDMs is a whole process, and not all users may want to
go down that road.
Here we provide an approach which will give us results we can work with,
and is computationally cheap.
It relies on one input, the same occurrence data as above, and a
simple process - drawing a circle at a specific radius around each
observation point! Here we are showing the second of two major packages
which this package depends on, sf
. sf
relies
on simple feature geometries and is now the workhorse of most all vector
data analysis in R. It is very well documented and tested, if you are
unfamiliar with it you should consider it to be the dplyr
of the spatial data ecosystem in R. In this vignette I use
::
notation so you can get an idea of where functions are
coming from, and you’ll notice a lot of
sf::st_*
.
planar_proj =
'+proj=laea +lon_0=-421.171875 +lat_0=-16.8672134 +datum=WGS84 +units=m +no_defs'
x <- read.csv(file.path(system.file(package="dismo"), 'ex', 'bradypus.csv'))
x <- x[,c('lon', 'lat')]
x <- sf::st_as_sf(x, coords = c('lon', 'lat'), crs = 4326)
x_buff <- sf::st_transform(x, planar_proj) |>
sf::st_buffer(125000) |> # we are working in planar metric coordinates, we are
sf::st_as_sfc() |> # buffer by this many / 1000 kilometers.
sf::st_union()
plot(x_buff)
Voila! That’s the whole process. Just play around with the distances until you get something which looks OK. How do I define ‘OK’?.. A feeling in the gut that your boots are tingling. Note that while the buffered polygons may look a little nebulous now, we’ll have them on maps with some more context in just a couple steps.
Just load the SDM!
The package also has an SDM prediction saved in the data we can just load that too for a couple comparisons.
sdm <- terra::rast(file.path(system.file(package="safeHavens"), 'extdata', 'Bradypus_test.tif'))
terra::plot(sdm)
Now prep some data for visualizing the results
While not necessary, we are going to add some context to our maps
which should help you interpret the results of the various functions
from the package. We will use the spData
package which uses
naturalearth data for
it’s world
data and is suitable for creating effective maps
at a variety of resolutions.
x_extra_buff <- sf::st_buffer(x_buff, 100000) |> # add a buffer to 'frame' the maps
sf::st_transform(4326)
americas <- spData::world
americas <- sf::st_crop(americas, sf::st_bbox(x_extra_buff)) |>
dplyr::select(name_long)
#> Warning: attribute variables are assumed to be spatially constant throughout
#> all geometries
bb <- sf::st_bbox(x_extra_buff)
map <- ggplot() +
geom_sf(data = americas) +
theme(
legend.position = 'none',
panel.background = element_rect(fill = "aliceblue"),
panel.grid.minor.x = element_line(colour = "red", linetype = 3, linewidth = 0.5),
axis.ticks=element_blank(),
axis.text=element_blank(),
plot.background=element_rect(colour="steelblue"),
plot.margin=grid::unit(c(0,0,0,0),"cm"),
axis.ticks.length = unit(0, "pt"))+
coord_sf(xlim = c(bb[1], bb[3]), ylim = c(bb[2], bb[4]), expand = FALSE)
rm(x_extra_buff, americas)
Running the Various Sample Design Algorithms
Now that we have some data which can represent species ranges, we can run the various sampling approaches. The table in the introduction is reproduced here.
Function | Description | Comp. | Envi. |
---|---|---|---|
GridBasedSample |
Creates and merges n grids over area | L | L |
PointBasedSample |
Creates points to make pieces over area | L | L |
EqualAreaSample |
Breaks area into similar size pieces | L | L |
OpportunisticSample |
Using PBS with existing records | L | L |
IBDBasedSample |
Breaks species range into clusters | H | M |
EcoregionBasedSample |
Using existing ecoregions to sample | L | H |
EnvironmentalBasedSample |
Uses correlations from SDM to sample | H | H |
Note in this table ‘Comp.’ and ‘Envi.’ refer to computational and environmental complexity respectively, and range from low (L) through medium to high.
Grid Based and Point Based Sample
Ecologists love grids. All of us are taught to love grids for sampling. Grids are very useful for sampling contiguous things. Species ranges are often not contiguous; curators and analysts in our geographically grand ecosystems, e.g. the steppes, prairies, tundra, and taiga might find these useful.
You can look at the output below to see why grids are not great for this type of problem.
The first step in grid sampling is determining an OK number of grids to try and draw as a starting point, if we want 20 collections we will need more than 20 grids, because several will be merged into the larger ones. Using the aspect ratio of a simple bounding box around the area will be analyzing, the function will determine a default number of grids (‘Original’) for testing. Using these defaults it will create a few other sets of grids as well, by either removing one of two grids per direction. Theoretically you could automate grid selection by comparing the number of grids and the minimization of variance. To be safe I wouldn’t consider configurations which generate less than 25 of these initial grids.
tgs <- TestGridSizes(x_buff)
print(tgs)
#> Name Grids Variance GridNOx GridNOy
#> 1 Smallest 34 599.9513 8 6
#> 2 Smaller 28 813.2043 7 5
#> 3 Original 24 862.5458 6 4
#> 4 Larger 18 681.7667 5 3
#> 5 Largest 13 1182.8457 4 2
plot(tgs$Grids, tgs$Variance, xlab = 'Grid Number', ylab = 'Variance',
main = 'Number of grids and areas overlapping species range')
text(tgs$Grids, tgs$Variance + 25, labels=tgs$Name, cex= 0.7)
abline(v=20, col="red")
abline(v=25, col="orange")
tgs <- tgs[tgs$Name=='Smaller',]
Essentially we need more than 20 grids, but realistically (albeit
from limited informal testing) using more than 25 grids - depending on
the complexity of the species range - tends to be an effective floor. In
the table and plot above I opt for using the ‘Smaller’ option, with 28
grids generated by prompting sf::st_make_grid
with 7 grids
in the X direction and 5 in the Y direction. You can kind of
think about this like an elbow plot, but the samples are so few you
won’t get the characteristic shape.
grid_buff <- GridBasedSample(x_buff, planar_proj, gridDimensions = tgs)
gbs.p <- map +
geom_sf(data = grid_buff, aes(fill = factor(ID))) +
# geom_sf_label(data = grid_buff, aes(label = Assigned), alpha = 0.4) + # on your computer, doesnt work at vignette size
labs(title = 'Grids') +
coord_sf(expand = F)
gbs.p
With the grids we drew the pre-specified number of grids across the species range, and then merged them together as required to get the results. We will essentially do the inverse in this step, rather than drawing boundaries - i.e. grid cells, we will draw the centers. This essentially allows the features to ‘grow’ a little more naturally. I also think these results work a little bit better on a fragmented range, their is still some odd clipping, where minor portions of a section of range are assigned to a different grid, but in general a little bit better.
pbs <- PointBasedSample(x_buff)
pbs.sf <- pbs$Geometry
pbs.p <- map +
geom_sf(data = pbs.sf, aes(fill = factor(ID))) +
# geom_sf_label(data = pbs.sf, aes(label = ID), alpha = 0.4) +
labs(title = 'Point') +
coord_sf(expand = F)
pbs.p
Equal Area Sample
Perhaps the simplest method which is offered in
safeHavens
is EqualAreaSample
. It simply
creates many points, pts
defaults to 5000, within our
target polygon and then subjects them to k-means sampling where the
groups are specified by n
our target number of collections.
The individual points then assigned to these groups have polygons which
‘take’ up all of the map space developed, and are intersected back to
the species range, and the area of each polygon is then measured. This
process will be ran a few times, defaulting to 100 reps
,
and the set of polygons which was created during these reps with the
smallest variance in polygon size will be selected to be returned from
the function.
This differs from point based sampling in that the above instance, we start with a few regularly spaced points to grow from, here we take a step back and by using many points let the clusters grow themselves to similar sizes.
eas <- EqualAreaSample(x_buff, planar_projection = planar_proj)
#> Warning: did not converge in 10 iterations
eas.p <- map +
geom_sf(data = eas$Geometry, aes(fill = factor(ID))) +
# geom_sf_label(data = eas.sf, aes(label = ID), alpha = 0.4) +
labs(title = 'Equal Area') +
coord_sf(expand = F)
eas.p
To me the results look quite similar to point based sample.
Opportunistic Sample
To be blunt, while there are many new players to the germplasm
conservation table (and we are thrilled to have you here!), many
existing collections have largely grown out of opportunity (which we are
still very happy about). Many Curators may be interested in how much
they can embed their existing collections into a sampling framework. The
function OpportunisticSample
makes a few very minor
modifications to the point based sample to try and maximize an existing
collection. It doesn’t always work exceptionally, especially when a
couple collections are very close to each other, but it may be a
beneficial tool to have in the belt. As we have observed, the three
previous sampling schemes end of with somewhat similar results - so we
took used the PointBasedSample
as the framework we embedded
this function into it - it was also the easiest function to work this
into!
Essentially this combines the approach of point based sampling, but forces that the clusters are based around the existing accessions. It attempts to ‘center’ the existing collections within clusters, but this can be nearly impossible for a variety of reasons.
exist_pts <- sf::st_sample(x_buff, size = 10) |>
sf::st_as_sf() |> # ^^ just randomly sampling 10 points in the species range
dplyr::rename(geometry = x)
os <- OpportunisticSample(polygon = x_buff, n = 20, collections = exist_pts)
os.p <- map +
geom_sf(data = os$Geometry, aes(fill = factor(ID))) +
# geom_sf_label(data = os.sf, aes(label = ID), alpha = 0.4) +
geom_sf(data = exist_pts, alpha = 0.4) +
labs(title = 'Opportunistic') +
coord_sf(expand = F)
os.p
As you see here, the grids have been aligned around the points. This can
lead to some funky clusters, but a bird in hand is worth two in the
bush.
Isolation by Distance Based Sample
Isolation by Distance, and to a much lesser extent Toblers first law of geography, are the fundamental ideas driving this package. While the above sampling schemes are implicitly based around the former idea, the latter is the most essential for germplasm conservation. This function explicitly uses IBD to develop a sampling scheme, and does not obfuscate it with any other parameters.
files <- list.files( # note that for this process we need a raster rather than
path = file.path(system.file(package="dismo"), 'ex'), # vector data to accomplish
pattern = 'grd', full.names=TRUE ) # this we will 'rasterize' the vector using terra
predictors <- terra::rast(files) # this can also be done using 'fasterize'. Whenever
# we rasterize a product, we will need to provide a template raster that our vector
# will inherit the cell size, coordinate system, etc. from
x_buff.sf <- sf::st_as_sf(x_buff) |>
dplyr::mutate(Range = 1) |>
sf::st_transform( terra::crs(predictors))
# and here we specify the field/column with our variable we want to become an attribute of our raster
v <- terra::rasterize(x_buff.sf, predictors, field = 'Range')
# now we run the function demanding 20 areas to make accessions from,
ibdbs <- IBDBasedSample(x = v, n = 20, fixedClusters = TRUE, template = predictors)
ibdbs.p <- map +
geom_sf(data = ibdbs, aes(fill = factor(ID))) +
# geom_sf_label(data = os.sf, aes(label = ID), alpha = 0.4) +
labs(title = 'IBD') +
coord_sf(expand = F)
ibdbs.p
rm(predictors, files, v, x_buff.sf, exist_pts, os)
Because these data were processed from a raster, they have these sharp edges, representing raster tiles. However, it should immediately be evident that the borders of the clusters are more natural looking than in the previous (and future) sampling schemes.
Ecoregion Based Sample
As mentioned this is by far the most commonly implemented method for guiding native seed collection. However, I am not sure exactly how practitioners all implement it, and whether the formats of application are consistent among practitioners! For these reasons a few different sets of options are supported for a user.
For general usage, two parameters are always required x
which is the species range as an sf object, and ecoregions
,
the sf object containing the ecoregions of interest. The
ecoregions
file does not need to be subset to the range of
x
quite yet - the function will take care of that.
Additional arguments to the function include as usual n
to
specify how many accession we are looking for in our collection. Two
additional arguments relate to whether we are using Omernik Level 4
ecoregions data or ecoregions (or biogeographic regions) from another
source. These are OmernikEPA
, and
ecoregion_col
, if you are using the official EPA release of
ecoregions then both of these are optional, however if you are not using
the EPA product than both should be supplied - but only the
ecoregion_col
argument is totally necessary. This column
should contain unique names for the highest resolution level ecoregion
you want to use from the data set, for many data sets, such as our
example we call ‘neo_eco’ this may be the only field with ecolevel
information!
neo_eco <- sf::st_read(
file.path(system.file(package="safeHavens"), 'extdata', 'NeoTropicsEcoregions.gpkg'),
quiet = TRUE) |>
dplyr::rename(geometry = geom)
head(neo_eco[,c(1, 3, 4, 6, 11)])
#> Simple feature collection with 6 features and 4 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -103.0432 ymin: -31.25308 xmax: -34.79344 ymax: 26.91751
#> Geodetic CRS: WGS 84
#> Provincias Region Dominio
#> 1 Araucaria Forest province Neotropical Parana
#> 2 Atacama province Neotropical <NA>
#> 3 Atlantic province Neotropical Parana
#> 4 Bahama province Neotropical <NA>
#> 5 Balsas Basin province Neotropical Mesoamerican
#> 6 Caatinga province Neotropical Chacoan
#> Subregion geometry
#> 1 Chacoan MULTIPOLYGON (((-53.58012 -...
#> 2 South American Transition Zone MULTIPOLYGON (((-69.42981 -...
#> 3 Chacoan MULTIPOLYGON (((-48.41217 -...
#> 4 Antillean MULTIPOLYGON (((-77.58593 2...
#> 5 Brazilian MULTIPOLYGON (((-97.37265 1...
#> 6 Chacoan MULTIPOLYGON (((-35.56652 -...
x_buff <- sf::st_transform(x_buff, sf::st_crs(neo_eco))
ebs.sf <- EcoregionBasedSample(x_buff, neo_eco, OmernikEPA = FALSE, ecoregion_col = 'Provincias')
#> Warning: attribute variables are assumed to be spatially constant throughout
#> all geometries
#> Warning: attribute variables are assumed to be spatially constant throughout
#> all geometries
# for plotting let's crop it to the other objects
ebs.sf <- st_crop(ebs.sf, bb)
#> Warning: attribute variables are assumed to be spatially constant throughout
#> all geometries
ebs.p <- map +
geom_sf(data = ebs.sf, aes(fill = factor(n))) +
labs(title = 'Ecoregion') +
coord_sf(expand = F)
ebs.p
This output differs from the others we will see, here we have depicted the number of collections to be made per ecoregion. Because the number of ecoregions is greater than our requested sample size, the return object can only take on two values - no collections, or one collection.
Environmental Based Sample
The environmental based sample can only be conducted if you have the species distribution model data. Included in the data directory of the folder are all of the objects required to run this example for the species. We will load them here.
sdModel <- readRDS(
file.path(system.file(package="safeHavens"), 'extdata', 'sdModel.rds')
)
sdModel$Predictors <- terra::rast(
file.path(system.file(package="safeHavens"), 'extdata', 'Predictors.tif')
)
Once these data are loaded into R, we will scale the rasters (using
RescaleRasters
) which will serve as surfaces to predict
from (this is also done above!), then we will run the algorithm
(EnvironmentalBasedSample
). However, before we run the
algorithm we will need to create a directory (also called a ‘folder’),
on our computers to save the results from the function
EnvironmentalBasedSample
. Whereas earlier in this vignette
we showcased that the functions generated the species distribution
model, and us saving the results were a two stage process (e.g. to
create the SDM and associated products we used: elasticSDM
,
PostProcessSDM
, and RescaleRasters
, before
finally saving relevant data with writeSDMresults
), this
function produces both the product and writes out ancillary data
simultaneously.
This approach was chosen as this function is only writing out four
objects: 1) the groups as vector data, 2) and the groups as raster data,
3) the k-nearest neighbors (knn) model used to generate these clusters,
and 4) the confusion matrix associated with testing the knn model.
rr <- RescaleRasters( # you may have already done this!
model = sdModel$Model,
predictors = sdModel$Predictors,
training_data = sdModel$TrainData,
pred_mat = sdModel$PredictMatrix)
# create a directory to hold the results from EBS real quick.
# we will default to placing it in your current working directory.
# If you are a data management freak don't worry too much about this.
# The code to remove the directory will be provided below.
getwd() # this is where the folder is going to located IF YOU DON'T RUN the code below.
#> [1] "/home/runner/work/safeHavens/safeHavens/vignettes"
p <- file.path('~', 'Documents') # in my case I'll dump it in Documents real quick, this should work on
# Linux and Mac, but I don't think Windows?
# dir.create(file.path(p, 'safeHavens-Vignette')) # now create the directory.
ENVIbs <- EnvironmentalBasedSample(
pred_rescale = rr$RescaledPredictors,
write2disk = FALSE,
path = file.path(p, 'safeHavens-Vignette'), # we are not writing, but showing how to provide argument
taxon = 'Bradypus_test',
f_rasts = sdm, n = 20,
lyr = 'Supplemented',
fixedClusters = TRUE,
n_pts = 500,
planar_projection = planar_proj,
buffer_d = 3, prop_split = 0.8)
#> Joining with `by = join_by(x, y)`
#> Warning in st_point_on_surface.sfc(st_geometry(x)): st_point_on_surface may not
#> give correct results for longitude/latitude data
ENVIbs.p <- map +
geom_sf(data = ENVIbs, aes(fill = factor(ID))) +
#geom_sf_label(data = ENVIbs, aes(label = ID), alpha = 0.4) +
labs(title = 'Environmental') +
coord_sf(expand = FALSE)
#> Coordinate system already present. Adding new coordinate system, which will
#> replace the existing one.
ENVIbs.p
The function EnvironmentalBasedSample
can take any of
the three binary rasters created by PostProcessSDM
as
arguments for the template. Here we showcase the different results from
using each of them.
These plots are able to showcase the difference in results depending on which of the three input rasters are utilized. As with all of the sampling schemes, results vary widely based on the spatial extents which the functions are applied to. Using the SDM output which have undergone thresholding results in the largest classified area. At first glance the results may seem very different, but if you look at central america, they are largely consistent, as they are near the Andes; large differences do exist in the Amazon Basin, but even there some alignment between the systems is evident. Accordingly, the surface used for a species should match some evaluation criterion.
Using the threshold raster surface is a very good option if we do not want to ‘miss’ too many areas, whereas the clipped and supplemented options may be better suited for scenarios where we do not want to draw up clusters, which lack any populations which can be collected from.
Comparision of different sampling schemes
So, we’ve made got some maps for you to look at! They all look relatively similar to me when plotted one after another, let’s plot them all simultaneously and see if that’s still the case.
gbs.p + pbs.p + eas.p + os.p + ibdbs.p + ebs.p + ENVIbs.p +
plot_layout(ncol = 3)
Again, the top three figures appear quite similar, with the
Opportunistic
method only deviating slightly form them. In
my mind isolation by distance (IBD) show the biggest different, it seems
to have made the most sense of the naturally occurring
patchiness of the species range. Ecoregion SEEMS…. Environmental also
seems to partition the feature space quite well. Notably drawing a
couple clusters in the Pacific lowlands and Northern Andes
mountains.