Overview
We have already shown that beta-coefficients from species
distribution models can be used in hierarchical clustering to group
species populations into clusters that are more similar and better
aligned with their ecology and climate. While running
EnvironmentalBasedSample, we also focus on optimising the
species-occupied range using Moran eigenvector matrix surfaces (MEMs, or
PCNMs). In this alternative approach, we drop the PCNMs from the SDMs,
focus solely on environmental predictors, identify clusters in the
current environmental space, and then classify
future environmental (and geographic) space using these
cluster parameters. This gives us sets of areas with populations that
may be more pre-adapted to future conditions. In addition to
transferring the realised environmental clusters of populations forward,
we also identify novel climate spaces, using MESS (Elith 2010) and
hierarchical clustering. This results in n identified clusters, and we
can traverse the clustering branches to determine which existing
clusters are most similar to the novel clusters.
This is the sole workflow in safeHavens that seeks to
link current climates to future scenarios, directly suggest how the size
of clusters may change and where they will shift, and allow a germplasm
curator to identify relevant seed sources for predictive provenancing.
The approach employed here is adequate to guide current collections;
decisions on which sources to use at a particular restoration site are
relatively well defined by tools such as the Seed Lot Selection Tool in
North America.
Analysis
This workstream leans heavily on the
EnvironmentalBasedSample workflow and passes its final
objects off to PolygonBasedSample for completion. Please
review Species Distribution Models and
Getting Started before proceeding here. ### Data prep
This workflow will require the geodata package;
geodata is used to retrieve climate data from various
sources as raster data. In this vignette, we will use Worldclim. For most safeHavens
users worldwide, this may be your ideal source of climate data.
geodata is developed and maintained by Robert Hijman,
similar to terra, and dismo, which are doing a
ton of heavy lifting internally on our functions.
library(safeHavens)
library(sf) ## vector operations
library(terra) ## raster operations
library(geodata) ## environmental variables
library(dplyr) ## general data handling
library(tidyr) ## general data handling
library(ggplot2) ## plotting
library(patchwork) ## multiplots
set.seed(22)For this example, we will leverage GBIF data and shift our geographic focus to the Colorado Plateau in the Southwestern USA. The Colorado Plateau is predicted to experience some of the highest rates of climate change on the planet, and is already an area with considerable native seed development activities. For our focal species, we will use Helianthella microcephala for no other reason than that it exists in relatively homogenous portions of the Colorado Plateau - meaning we can download relatively coarse rasters for the vignettes, and that it has a stunning inflorescence.
cols = c('decimalLatitude', 'decimalLongitude', 'dateIdentified', 'species', 'acceptedScientificName', 'datasetName',
'coordinateUncertaintyInMeters', 'basisOfRecord', 'institutionCode', 'catalogNumber')
## download species data using scientificName, can use keys and lookup tables for automating many taxa.
hemi <- rgbif::occ_search(scientificName = "Helianthella microcephala")
hemi <- hemi[['data']][,cols] |>
drop_na(decimalLatitude, decimalLongitude) |> # any missing coords need dropped.
distinct(decimalLatitude, decimalLongitude, .keep_all = TRUE) |> # no dupes can be present
st_as_sf(coords = c( 'decimalLongitude', 'decimalLatitude'), crs = 4326, remove = F)
western_states <- spData::us_states |> ## for making a quick basemap.
dplyr::filter(NAME %in%
c('Utah', 'Arizona', 'Colorado', 'New Mexico', 'Wyoming', 'Nevada', 'Idaho', 'California')) |>
dplyr::select(NAME, geometry) |>
st_transform(4326)
bb <- st_bbox(
c(
xmin = -116,
xmax = -105,
ymax = 44,
ymin = 33.5),
crs = st_crs(4326)
)
western_states <- st_crop(western_states, bb)
Warning: attribute variables are assumed to be spatially constant throughout
all geometries
bmap <- ggplot() +
geom_sf(data = western_states) +
geom_sf(data = hemi) +
theme_minimal() +
coord_sf(
xlim = c(bb[['xmin']], bb[['xmax']]),
ylim = c(bb[['ymin']], bb[['ymax']])
) +
theme(
axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()
)
bmap +
labs(title = 'Helianthella microcephala\noccurrence records')
We will download WorldClim data using geodata. The
results from worldclim_global should match what we loaded
from dismo in the various sloth examples. For our future scenario, we
will use the CMIP6 modelled
future climate data for the 2041-2060 time window. We will download the
coarse 2.5-minute resolution data, ca. ~70km at the equator. Do note
that the source also includes a ~1km at-equator resolution data set (30
seconds), but this would take too long for the vignette.
# Download WorldClim bioclim at ~10 km
bio_current <- worldclim_global(var="bioc", res=2.5)
Cached as: /tmp/RtmpRKmU90/climate/wc2.1_2.5m//wc2.1_2.5m_bio.zip
bio_future <- cmip6_world(
model = "CNRM-CM6-1", ## modelling method
ssp = "245", ## "Middle of the Road" scenario
time = "2041-2060", # time period
var = "bioc", # just use the bioclim variables
res = 2.5
)
Cached as: /tmp/RtmpRKmU90/climate/wc2.1_2.5m//wc2.1_2.5m_bioc_CNRM-CM6-1_ssp245_2041-2060.tif
# Crop to domain - use a large BB to accomodate range shift
# under future time points.
# but going too large will dilute the absence records
bbox <- ext(bb)
bio_current <- crop(bio_current, bbox)
bio_future <- crop(bio_future, bbox)safeHavens does not provide explicit functionality to
rename / align raster surface naming. However, the modelling process
requires that both current and future raster products have perfectly
matching raster names.
The chunk below shows how we will standardise the names by extracting the bioclim variable number at the end of the name and padding ‘bio_’ back to the front with a leading zero, so we have ‘bio_01’ instead of ‘bio_1’. A minor variation of this should work for most data sources, after customising the gsub to erase earlier portions of the file name.
simplify_names <- function(x){
paste0('bio_', sprintf("%02d", as.numeric(gsub('\\D+','', names(x)))))
}
names(bio_current) <- gsub('^.*5m_', '', names(bio_current))
names(bio_future) <- gsub('^.*2060_', '', names(bio_future))
names(bio_current) <- simplify_names(bio_current)
names(bio_future) <- simplify_names(bio_future)
# TRUE means all names match, and are in the same position.
all(names(bio_current) == names(bio_future))
[1] TRUE
## we will also drop some variables that are essentially identical in the study area
drops <- c(
'bio_05' , 'bio_02',
'bio_11', 'bio_12'
)
bio_current <- subset(bio_current, negate = TRUE, drops)
bio_future <- subset(bio_future, negate = TRUE, drops)If you are following the vignette locally, you will find that
plot(bio_current) and plot(bio_future) look
very similar. While the plots look very similar when showing one raster
stack after another, we can diff the two products to see where the
largest changes occur in geographic space.
pct_diff <- function(x, y){((x - y)/((x + y)/2)) * 100}
difference <- pct_diff(bio_current, bio_future)
plot(difference)
We see that variables exhibit inconsistencies in where they change and to what extent. This mismatch of conditions will almost certainly lead to novel climate conditions - unique combinations not yet known to this species. This will be handled using MESS surfaces and a second-stage clustering approach in the function.
Analytical workstream
The workflow for predictive provenance builds upon the
EnvironmentalBasedSample workflow, relying on many of the
same internal helpers and user-facing functions. Note that when using
elasticSDM, we need to specify pcnm = FALSE to
bypass fitting a model with PCNM surfaces; there is no analogue for
these in future scenarios, so they must be dropped.
Fit SDM and rescale raster surfaces
The starting point for this analysis is again fitting a quick SDM for our species.
## note we subset to just the geometries for the prediction records.
# this is because we feed in all columns to the elastic net model internally.
hemi <- select(hemi, geometry)
# as always verify our coordinate reference systems (crs) match.
st_crs(bio_current) == st_crs(hemi)
[1] TRUE
# and fit the elasticnet model
eSDM_model <- elasticSDM(
x = hemi,
predictors = bio_current,
planar_projection = 5070,
PCNM = FALSE ## set to FALSE for this workstream!!!!!
)
# Test with just 3 runs to see variability quickly
run1 <- elasticSDM(hemi, bio_current, 5070)
run2 <- elasticSDM(hemi, bio_current, 5070)
run3 <- elasticSDM(hemi, bio_current, 5070)
# Compare coefficients
coef1 <- as.matrix(coef(run1$Model))
coef2 <- as.matrix(coef(run2$Model))
coef3 <- as.matrix(coef(run3$Model))
# Quick comparison
all_vars <- unique(c(rownames(coef1), rownames(coef2), rownames(coef3)))
compare <- data.frame(
var = all_vars,
run1 = coef1[match(all_vars, rownames(coef1)), 1],
run2 = coef2[match(all_vars, rownames(coef2)), 1],
run3 = coef3[match(all_vars, rownames(coef3)), 1]
)
compare[is.na(compare)] <- 0
compare$cv <- apply(compare[,-1], 1, function(x) sd(x)/mean(x))
print(compare)
var run1 run2 run3 cv
1 (Intercept) 40.39976332 4.870494e+00 4.870494e+00 1.2273151
2 bio_15 -0.21638898 -4.935768e-03 -4.935768e-03 -1.6186992
3 bio_06 -1.79013711 0.000000e+00 0.000000e+00 -1.7320508
4 bio_14 -0.40912337 0.000000e+00 0.000000e+00 -1.7320508
5 bio_19 -0.05222452 0.000000e+00 0.000000e+00 -1.7320508
6 bio_16 0.07622092 0.000000e+00 0.000000e+00 1.7320508
7 bio_18 -0.02635047 0.000000e+00 0.000000e+00 -1.7320508
8 bio_07 -0.62896352 0.000000e+00 0.000000e+00 -1.7320508
9 bio_03 -0.57492789 -1.373735e-01 -1.373735e-01 -0.8919489
10 bio_09 0.03504658 0.000000e+00 0.000000e+00 1.7320508
11 bio_01 1.76209086 0.000000e+00 0.000000e+00 1.7320508
12 bio_04 -0.02115389 4.134668e-04 4.134668e-04 -1.8377448
13 bio_17 0.06551926 0.000000e+00 0.000000e+00 1.7320508
14 PCNM1 9.67291420 0.000000e+00 0.000000e+00 1.7320508
15 PCNM2 -0.17546042 -1.202533e+01 -1.202533e+01 -0.8472085
16 bio_13 0.00000000 0.000000e+00 0.000000e+00 NaN
17 bio_08 0.00000000 2.704826e-02 2.704826e-02 0.8660254
18 PCNM4 0.00000000 1.266518e+01 1.266518e+01 0.8660254
19 PCNM7 0.00000000 1.070483e+01 1.070483e+01 0.8660254
20 PCNM9 0.00000000 0.000000e+00 0.000000e+00 NaNWe will use the eSDM_model function for this workstream; however, it is essential that PCNM is set to FALSE. There is no way to forecast or predict PCNM for future scenarios, so we cannot fit models with it.
If a warning is emitted here, it is likely due to collinearity in the
bioclim variables. The function will internally dro- collinear
predictors to try and maintain eDist usage (just for this
step), but if this fails will fall back to the eRandom method
instead.
knitr::kable(eSDM_model$ConfusionMatrix$byClass[
c('Sensitivity', 'Specificity', 'Recall', 'Balanced Accuracy')])| x | |
|---|---|
| Sensitivity | 0.8000000 |
| Specificity | 0.8846154 |
| Recall | 0.8000000 |
| Balanced Accuracy | 0.8423077 |
plot(eSDM_model$RasterPredictions)
We can view some of the model results; they are OK, but would benefit
from using only a subset of the bioclim predictors to allow for
eDist sampling upstream. It is a bit too generous in
classifying areas as being possibly suitable habitat. While this model
would be unsuitable for applications, certain CRAN and GitHub checks
make it difficult to improve upon as an example.
Using the beta coefficients from the model, we will rescale both the current and future climate scenario rasters. This will happen internally in the next function, but it is good to see the difference between the current and future scenarios to have an idea of expectations for the final results.
bio_current_rs <- RescaleRasters(
model = eSDM_model$Model,
predictors = eSDM_model$Predictors,
training_data = eSDM_model$Train,
pred_mat = eSDM_model$PredictMatrix
)
plot(bio_current_rs$RescaledPredictors)
bio_future_rs <- rescaleFuture(
eSDM_model$Model,
bio_future,
eSDM_model$Predictors,
training_data = eSDM_model$Train,
pred_mat = eSDM_model$PredictMatrix
)
plot(bio_future_rs)
If any of the layers show as a single colour and 0, it means that glmnet shrunk them out of the model.
Similar to how we differenced the rasters at the two time points above, we can take the absolute difference between the relevant raster layers to see where the largest changes occur.
difference_rs <- pct_diff(bio_current_rs$RescaledPredictors, bio_future_rs)
plot(difference_rs)
The above plot shows us the areas with the largest changes between current and predicted conditions.
Cluster surfaces
Clusters can be identified using NbClust, which determines the
optimal n using the complete consensus method. To use
NbClust, rather than manually specifying n, the argument
fixedClusters needs to be set to FALSE. When using
PostProcessSDM, the same thresh metric should
be used as the one applied to PredictiveProvenance.
threshold_rasts <- PostProcessSDM(
rast_cont = eSDM_model$RasterPredictions,
test = eSDM_model$TestData,
train = eSDM_model$TrainData,
planar_proj = 5070,
thresh_metric =
'equal_sens_spec',
quant_amt = 0.25
)
plot(threshold_rasts$FinalRasters)
knitr::kable(threshold_rasts$Threshold$equal_sens_spec)| x |
|---|
| 0.4841719 |
bmap +
geom_sf(data =
sf::st_as_sf(
terra::as.polygons(
threshold_rasts$FinalRasters['Threshold'])
), fill = 'cornsilk'
) +
geom_sf(data = hemi) 
We will use the EnvironmentalBasedSample to determine a
suitable number of climate clusters for the current time period. When
using EBS for predictive provencing also set the coord_wt
parameter to a small value. This will remove the effect of the spatial
clustering. Similar to PCNM, we cannot know the distribution of actual
populations in the future.
ENVIbs <- EnvironmentalBasedSample(
pred_rescale = bio_current_rs$RescaledPredictors,
write2disk = FALSE, # we are not writing, but showing how to provide some arguments
f_rasts = threshold_rasts$FinalRasters,
coord_wt = 0.001,
fixedClusters = FALSE,
lyr = 'Threshold',
n_pts = 500,
planar_proj = "epsg:5070",
buffer_d = 3,
prop_split = 0.8,
min.nc = 5,
max.nc = 15
)

bmap +
geom_sf(data = ENVIbs$Geometry, aes(fill = factor(ID))) +
geom_sf(data = hemi) +
theme(legend.position = 'bottom') +
labs(fill = 'Cluster', title = 'Current')
We can apply the current clustering to the future scenario. This will
show us how and where the currently identified clusters would exist in
the future geographic space. ProjectClusters is really the
heart of this workflow; it requires the outputs from
elasticSDM, PostProcessSDM, and
EnvironmentalBasedSample.
future_clusts <- projectClusters(
eSDM_object = eSDM_model,
current_clusters = ENVIbs,
future_predictors = bio_future,
current_predictors = bio_current,
thresholds = threshold_rasts$Threshold,
planar_proj = "epsg:5070",
thresh_metric = 'equal_sens_spec',
n_sample_per_cluster = 20
)
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.

*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 6 proposed 2 as the best number of clusters
* 9 proposed 3 as the best number of clusters
* 1 proposed 4 as the best number of clusters
* 2 proposed 8 as the best number of clusters
* 1 proposed 9 as the best number of clusters
* 1 proposed 10 as the best number of clusters
* 1 proposed 18 as the best number of clusters
* 3 proposed 20 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 3
*******************************************************************
However, our classifications cannot accommodate new climate conditions. We can identify these areas outside the training conditions for our SDM using MESS. Note that in these areas, our SDM is extrapolating, so its performance cannot be evaluated. However, it seems these may be suitable habitats, and we can plan for that scenario.
plot(future_clusts$mess)
bmap +
geom_sf(data =
st_as_sf(terra::as.polygons(future_clusts$novel_mask)),
fill = 'red') +
labs(title = 'MESS regions')
Once we identify these areas, if they are sufficiently large, we can pass them to a new NbClust scenario, and it can cluster them anew.
current = bmap +
geom_sf(data = ENVIbs$Geometry, aes(fill = factor(ID))) +
labs(title = 'Current', fill = 'Cluster') +
theme(legend.position = "none")
future = bmap +
geom_sf(data = future_clusts$clusters_sf, aes(fill = factor(ID))) +
labs(title = '2041-2060', fill = 'Cluster')
current + future
However, we also need to determine which current areas are most similar to these novel climate clusters. We can take values from both classifcation scenarios, cluster them, and identify which new climate clusters share branches with existing clusters. This is the method we employ to identify the most similar existing climate groups.
knitr::kable(future_clusts$novel_similarity)| novel_cluster_id | nearest_existing_id | avg_silhouette_width |
|---|---|---|
| 6 | NA | 0.4976980 |
| 7 | 4 | 0.3029438 |
It is from these groups that we are most likely to collect germplasm relevant to the future scenarios.
We can also take a very quick look at how cluster sizes change between the scenarios.
future_clusts$changes
cluster_id current_area_km2 future_area_km2 area_change_pct centroid_shift_km
1 1 2371.401 261064.0458 10908.85347 659.7835
2 2 247188.684 24052.8597 -90.26943 238.2640
3 3 16975.136 13171.8144 -22.40525 472.3144
4 4 27177.277 40894.9478 50.47478 52.5251
5 5 20421.344 2302.5222 -88.72492 1063.9605
6 6 0.000 1988.8601 NA NA
7 7 0.000 778.4415 NA NAConclusion
All functions in safeHavens work to identify areas where
populations have the potential to maximise the coverage of neutral and
allelic (genetic) diversity across the species. Large-scale restoration
requires the availability of seed sources and funding opportunities to
use them. Developing germplasm materials from populations that seem
adapted to future climate conditions is essential for future
opportunities. While this function does not in any way seek to identify
the best available seed source for a restoration, ala SST, it does seek
to ensure that the best available option can be applied in a timely
manner as the occasion arises.
This is particularly important for areas that cannot plan restorations or develop their own seed sources using point-based analyses.
