Overview

We have already shown that the beta-coefficients from species distribution models can be used in hierarchical clustering to group a species populations into clusters that experience more similar - and relevant to their ecology - climate. While running EnvironmentalBasedSample we focus on also optimizing the species occupied range using moran eigenvector matrix surfaces (MEMs, or PCNM). In this alternative approach we drop the PCNMs from the SDMs and focus solely on environmental predictors, and work to identify clusters in current environmental space, and then classify future environmental (and geographic) space with these cluster parameters. This gives us sets of areas with populations that may be more pre-adapted to future conditions.

In addition to transferring the realized environmental clusters of populations forward, we also identify novel climate spaces, using MESS (Elith 2010), and hierarchical clustering. This results in us having n identified clusters, and we can traverse the clustering branches to determine which existing clusters are most similar to novel clusters.

This is the sole workflow in safeHavens that seeks to link current climates to future scenarios, and directly suggest how the size of clusters may change, where they will shift to, and allow a germplasm curator to identify relevant seed sources for predictive provenancing. The approach employed here is adequate to guide current collections, however decisions on what sources to use at a particular restoration site are relatively well identified by tools such as the Seed Lot Selection Tool in North America, and COSST in South America.

Analysis

This workstream leans heavily on the EnvironmentalBasedSample workflow, and passes it’s final objects off to PolygonBasedSample for completion. Please review Species Distribution Models and Getting Started before proceeding here.

Data prep

This workflow will require the geodata package; geodata is used for retrieving climate data from a variety of sources as raster data. In this vignette we will use Worldclim. For most safeHavens users around the world this may be your ideal source for climate data. geodata is developed and maintained by Robert Hijman, similar to terra, and dismo, which are doing a ton of heavy lifting internally on our functions.

library(safeHavens)
library(terra)
library(geodata)
library(sf)
library(dplyr)
library(tidyr)
library(ggplot2)
library(patchwork)
set.seed(22)

For this example we will leverage data from GBIF, and shift our geographic focus to the Colorado Plateau in the Southwestern USA. The Colorado Plateau is predicted to experience some of the highest rates of climate change on the planet, and is already an area with considerably native seed development activities. For our focal species we will use Helianthella microcephala for no other reason than that it exists in relatively homogenous portions of the Colorado Plateau - meaning we can download relatively coarse rasters for the vignettes, and that it has a stunning inflorescence.

cols = c('decimalLatitude', 'decimalLongitude', 'dateIdentified', 'species', 'acceptedScientificName', 'datasetName', 
  'coordinateUncertaintyInMeters', 'basisOfRecord', 'institutionCode', 'catalogNumber')

## download species data using scientificName, can use keys and lookup tables for automating many taxa. 
hemi <- rgbif::occ_search(scientificName = "Helianthella microcephala")
hemi <- hemi[['data']][,cols]  |>
  drop_na(decimalLatitude, decimalLongitude) |> # any missing coords need dropped. 
  distinct(decimalLatitude, decimalLongitude, .keep_all = TRUE) |> # no dupes can be present
  st_as_sf(coords = c( 'decimalLongitude', 'decimalLatitude'), crs = 4326, remove = F) 

western_states <- spData::us_states |> ## for making a quick basemap. 
  dplyr::filter(NAME %in% 
    c('Utah', 'Arizona', 'Colorado', 'New Mexico', 'Wyoming', 'Nevada', 'Idaho', 'California')) |>
  dplyr::select(NAME, geometry) |>
  st_transform(4326)

bb <- st_bbox(
  c(
    xmin = -116, 
    xmax = -105, 
    ymax = 44, 
    ymin = 33.5),
    crs = st_crs(4326)
    )

western_states <- st_crop(western_states, bb)
Warning: attribute variables are assumed to be spatially constant throughout
all geometries

bmap <- ggplot() + 
    geom_sf(data = western_states) + 
    geom_sf(data = hemi) +
    theme_minimal() +
    coord_sf(
        xlim = c(bb[['xmin']], bb[['xmax']]), 
        ylim = c(bb[['ymin']], bb[['ymax']])
        )  +
    theme(
      axis.title.x=element_blank(),
      axis.text.x=element_blank(),
      axis.ticks.x=element_blank()
      )

bmap +
    labs(title = 'Helianthella microcephala\noccurrence records')

We will download worldclim data using geodata the results from worldclim_global should be what we loaded from dismo in the various sloth examples.
For our future scenario we will use the CMIP6 modelled future climate data for the 2041-2060 time window. We will download the coarse 2.5 minute, ca ~70km at equator, resolution data. Do note that the source also includes a ~ 1km at equator resolution data set (30 seconds), but this would take too long for the vignette.

# Download WorldClim bioclim at ~10 km
bio_current <- worldclim_global(var="bioc", res=2.5)
bio_future <- cmip6_world(
  model = "CNRM-CM6-1", ## modelling method
  ssp   = "245", ## "Middle of the Road" scenario
  time  = "2041-2060", # time period
  var   = "bioc", # just use the bioclim variables
  res   = 2.5
)

# Crop to domain - use a large BB to accomodate range shift
# under future time points. 
# but going too large will dilute the absence records
bbox <- ext(bb)

bio_current <- crop(bio_current, bbox)
bio_future <- crop(bio_future, bbox)

safeHavens does not offer explicit functionality to rename / align the naming of raster surfaces. However, the modelling process requires that both current and future raster products have perfectly matching raster names.

The chunk below shows how we will standardize the names by extracting the bioclim variable number, at the end of the name, and pad ‘bio_’ back onto the front of it with a leading zero so we have: ‘bio_01’ instead of ‘bio_1’. A minor variation of this should work for most data sources, after customizing the gsub to erase earlier portions of the file name.

simplify_names <- function(x){
    paste0('bio_', sprintf("%02d", as.numeric(gsub('\\D+','', names(x)))))
}

names(bio_current) <- gsub('^.*5m_', '', names(bio_current))
names(bio_future) <- gsub('^.*2060_', '', names(bio_future))

names(bio_current) <- simplify_names(bio_current)
names(bio_future) <- simplify_names(bio_future)

# TRUE means all names match, and are in the same position. 
all(names(bio_current) == names(bio_future))
[1] TRUE

## we will also drop some variables that are essentially identical in the study area
drops <- c(
  'bio_05' , 'bio_02',
  'bio_11', 'bio_12'
  ) 

bio_current <- subset(bio_current, negate = TRUE, drops)
bio_future <- subset(bio_future, negate = TRUE, drops)

If you are following the vignette along locally and decided to plot(bio_current) and plot(bio_future) you would find that they look very similar. While the plots look very similar, when showing one raster stack after another, we can diff the two products to see where the largest changes in geographic space will occur.

pct_diff <- function(x, y){((x - y)/((x + y)/2)) * 100}
difference <- pct_diff(bio_current, bio_future)
plot(difference)

We see that variables have inconsistences in where they will change, and to what extent they will change. This mismatch of conditions will almost certainly lead to novel climate conditions - unique combinations not yet known to this species. This will be handled using MESS surfaces and a second stage clustering approach in the function.

Analytical workstream

The workflow for predictive provenance builds upon the EnvironmentalBasedSample workflow, relying on many of the same internal helpers, and user facing functions. Note that when using elasticSDM we need to specify the flag pcnm = FALSE to bypass fitting a model with PCNM surfaces - their is no analog for these under future scenarios so they must be dropped.

Fit SDM and rescale raster surfaces

The starting point for this analysis is again fitting a quick SDM for our species.

## note we subset to just the geometries for the prediction records. 
# this is because we feed in all columns to the elastic net model internally. 
hemi <- select(hemi, geometry) 

# as always verify our coordinate reference systems (crs) match. 
st_crs(bio_current) == st_crs(hemi) 
[1] TRUE

# and fit the elasticnet model 
eSDM_model <- elasticSDM(
    x = hemi, 
    predictors = bio_current, 
    planar_projection = 5070, 
    PCNM = FALSE ## set to FALSE for this workstream!!!!!
    )

We will use the eSDM_model function for this workstream, however is is essential that PCNM is set to FALSE. There is no way to forecast of predict PCNM to future scenarios, so we cannot fit models with it.

If a warning is emitted here it is likely due to collinearity in the bioclim variables. Rather than using eDist from sdm for generating background points we will use the eRandom method instead.

eSDM_model$ConfusionMatrix$byClass[
  c('Sensitivity', 'Specificity', 'Recall', 'Balanced Accuracy')]
      Sensitivity       Specificity            Recall Balanced Accuracy 
        0.9200000         0.7692308         0.9200000         0.8446154 
plot(eSDM_model$RasterPredictions)

We can view some of the results for the model, it is OK, but would benefit from using only a subset of the bioclim predictors to allow for eDist sampling upstream. It is a bit too generous with classifying areas as being possible suitable habitat. While this model would be unsuitable for applications, certain CRAN and Github checks make it difficult to improve upon as an example.

Using the beta-coefficients from the model we will rescale both the current and future climate scenarios rasters. This will happen internally in the next function, but it is good to see the different between the current and future scenarios to have an idea of expectations for the final results.

bio_current_rs <- RescaleRasters(
    model = eSDM_model$Model, 
    predictors = eSDM_model$Predictors,
    training_data = eSDM_model$Train,
    pred_mat = eSDM_model$PredictMatrix
    )

plot(bio_current_rs$RescaledPredictors)


bio_future_rs <- rescaleFuture(
  eSDM_model$Model, 
  bio_future, 
  eSDM_model$Predictors,
  training_data = eSDM_model$Train,
  pred_mat = eSDM_model$PredictMatrix
)

plot(bio_future_rs)

If any of the layers show as a single color, and 0, it means that glmnet shrunk them from the model.

Similar to how we differenced the rasters at the two time points above, we can take the absolute difference of the relevant raster layers to see where the largest changes will be.

difference_rs <- pct_diff(bio_current_rs$RescaledPredictors, bio_future_rs)
plot(difference_rs)

The above plot shows us the areas with the largest changes between current and predicted conditions.

Cluster surfaces

Clusters can be identified using NbClust, allowing it to determine the optimal n using the METHOD. To use NbClust, rather than manually specifying n, the argument fixedClusters needs to be set to FALSE. When using PostProcessSDM the same threshhold metric should be used that will be applied to PredictiveProvenance.

threshold_rasts <- PostProcessSDM(
  rast_cont = eSDM_model$RasterPredictions, 
  test = eSDM_model$TestData,
  train = eSDM_model$TrainData,
  planar_proj = 5070,
  thresh_metric =
     'equal_sens_spec', 
  quant_amt = 0.25
  )

plot(threshold_rasts$FinalRasters)

threshold_rasts$Threshold$equal_sens_spec
[1] 0.694638

bmap + 
  geom_sf(data = 
  sf::st_as_sf(
    terra::as.polygons(
      threshold_rasts$FinalRasters['Threshold'])
      ), fill = 'cornsilk'
    ) + 
    geom_sf(data = hemi)

We will make the EnvironmentalBasedSample to find a suitable number of climate clusters at the current time period. When using EBS for predictive provencing also set the coord_wt parameter to a small value. This will remove the effect of the spatial clustering. Similar to PCNM, we cannot know the distribution of actual populations in the future.

ENVIbs <- EnvironmentalBasedSample(
  pred_rescale = bio_current_rs$RescaledPredictors, 
  write2disk = FALSE, # we are not writing, but showing how to provide some arguments
  f_rasts = threshold_rasts$FinalRasters, 
  coord_wt = 0.001, 
  fixedClusters = FALSE,
  lyr = 'Threshold',
  n_pts = 500, 
  planar_proj = "epsg:5070",
  buffer_d = 3,
  prop_split = 0.8,
  min.nc = 5, 
  max.nc = 15
  )

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 1 proposed 5 as the best number of clusters 
* 6 proposed 6 as the best number of clusters 
* 1 proposed 7 as the best number of clusters 
* 2 proposed 8 as the best number of clusters 
* 1 proposed 11 as the best number of clusters 
* 3 proposed 12 as the best number of clusters 
* 1 proposed 14 as the best number of clusters 
* 7 proposed 15 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  15 
 
 
******************************************************************* 

bmap + 
  geom_sf(data = ENVIbs$Geometry, aes(fill = factor(ID))) + 
  geom_sf(data = hemi) + 
  theme(legend.position = 'bottom') + 
  labs(fill = 'Cluster', title = 'Current')

We can apply the current clustering to the future scenario. This will show us how and where the currently identified clusters would exist in future geographic space. ProjectClusters is really the whole heart of this workflow, it requires the outputs from elasticSDM, PostProcessSDM, and EnvironmentalBasedSample.

future_clusts <- projectClusters(
  eSDM_object = eSDM_model, 
  current_clusters = ENVIbs,
  future_predictors = bio_future,
  current_predictors = bio_current,
  thresholds = threshold_rasts$Threshold, 
  planar_proj = "epsg:5070", 
  thresh_metric = 'equal_sens_spec',
  n_sample_per_cluster = 20
)

However, our classifications cannot accomodate new climate conditions. We can identify these areas, which are outside of the training conditions for our SDM using MESS. Note that in these areas our SDM is extrapolating, so it’s performance cannot be evaluated. However, it seems that these will possibly be suitable habitat, and we can try to plan for that scenario.

plot(future_clusts$mess)

bmap +
  geom_sf(data = 
    st_as_sf(terra::as.polygons(future_clusts$novel_mask)), 
    fill = 'red') + 
  labs(title = 'MESS regions')
 [1m [22mCoordinate system already present.
 [36mℹ [39m Adding new coordinate system, which will replace the existing one.

Once we identify these areas, if they are sufficiently large we can pass them to a new NbClust scenario and it can cluster them anew.

current = bmap + 
  geom_sf(data = ENVIbs$Geometry, aes(fill = factor(ID))) +
  labs(title = 'Current', fill = 'Cluster') +
  theme(legend.position = "none")

future = bmap + 
  geom_sf(data = future_clusts$clusters_sf, aes(fill = factor(ID))) + 
  labs(title = '2041-2060', fill = 'Cluster') 

current + future

However, we also need to determine which current areas are most similar to these novel climate clusters. We can take values from both classifcation scenarios, and cluster them, and identify which new climate clusters share branches with existing clusters. This is the method we employ to identify the most similar existing climate groups.

future_clusts$novel_similarity
[1] novel_cluster_id     nearest_existing_id  avg_silhouette_width
<0 rows> (or 0-length row.names)

It is from these groups that we are most likely to collect germplasm relevant for the future scenarios.

We can also take a very quick look at how cluster sizes change between the scenarios

future_clusts$changes
   cluster_id current_area_km2 future_area_km2 area_change_pct
1           1        4043.3230      28300.9046       599.94172
2           2       13170.6534        269.7564       -97.95184
3           3        4998.7124       1407.8829       -71.83509
4           4        4435.7130        136.0061       -96.93384
5           5       12027.6067        872.3368       -92.74721
6           6       15081.4257        952.5868       -93.68371
7           7        2576.1257          0.0000      -100.00000
8           8       33353.1496       1413.2125       -95.76288
9           9        3958.0202       3040.5491       -23.18005
10         10        6312.3643        747.9952       -88.15032
11         11        1313.6535          0.0000      -100.00000
12         12         597.1152       4630.9965       675.56168
13         13       36782.2993       4143.2314       -88.73580
14         14       32653.6727       1344.0688       -95.88387
15         15           0.0000       3528.7292              NA
   centroid_shift_km
1           460.1072
2           290.3649
3           205.2148
4           251.8769
5           726.1748
6           254.4887
7                 NA
8           330.4310
9           494.6931
10           14.2699
11                NA
12          539.5743
13          298.5651
14          374.5908
15                NA

Conclusion

All functions in safeHavens work to identify areas where populations have the potential to maximize the coverage of neutral and allelic (genetic) diversity across the species. Large scale restoration requires the availability of seed sources as the need, and funding opportunities arise to use them. Developping germplasm materials from populations that seem adapted to future climate conditions is essential for future opportunities. While this function does not in any way seek to identify the best available seed source for a restoration, ala SST or COSST, it does seek to ensure that the best available option can be timely applied as the occasion arises.

This is of particular importance for areas that are not able to plan restorations and develop their own seed sources using point-based analyses.

Predictive Provenance