
K Medoids Based Sample Site Selection Select a subset of sites that maximize spatial dispersion of sites using k-medioids clustering.
Source:R/KMedoidsBasedSample.R
KMedoidsBasedSample.RdThis function operates on individual points - representing populations, rather than drawing convex hulls or polygons around them to emulate a species range.
It is designed for rare species, where individual populations are relatively scarce, e.g. < 100, and have decent location data.
It will perform bootstrap re-sampling to better estimate the true range of the extent species, as well as coordinate jittering to better address geo-location quality.
After running n_bootstrap of these simulations it will identify the individual networks of sites (co-location) which is the most resilient to these perturbations, and should be less affected by data quality issues.
As arguments it takes the known locations of populations, and will solve for n priority collection sites. Along this process it will also generate a priority ranking of all sites, indicating a naive possible order for prioritizing collections; although opportunity should never discard a site. A required input parameter is a column indicating whether a site is a required. Required sites (1 - as many as < n_sites) will serve as fixed parameters in the optimization scenario which greatly speed up run time. They can represent: existing collections, collections with a very strong chance of happenging due to a funding agency mechanism, or otherwise a single population closet to the geographic center of the species. Notably the solve will be 'around' this site, hence the solves are not purely theoretical, but linked to a pragmatic element.
One can substitute a geographic distance matrix for either a resistance or environmental distance matrix. However, the function will not internally recalculate distances between the bootstrapped points. See vignette for example of creating a quick environmental distance matrix using a simple PCA of bioclim variables.
Note that the input data require two boolean (TRUE/FALSE) columns, 'required' and 'certain', for the function to run. 'required' notes sites that have to be, or have been sampled for germplasm collections, no sites default to required. 'certain' notes that the user is confident are of the taxon at hand; this will default to all FALSE, meaning all sites except 'required' sites will be dropped in simulations.
Usage
KMedoidsBasedSample(
input_data,
n = 5,
n_bootstrap = 999,
dropout_prob = 0.1,
n_local_search_iter = 100,
n_restarts = 3,
verbose = TRUE,
distance_type = "geographic",
min_jitter_dist = 10000
)Arguments
- input_data
A list with two elements: 'distances' (distance matrix) and 'sites' (data frame of site metadata).
- n
The number of sites which you want to select for priority collection. Note that the results will return a rank of prioritization for all sites in the data.
- n_bootstrap
Number of bootstrap replicates to perform.
- dropout_prob
Probability of dropping non-seed sites in each bootstrap replicate, give how few sites there are generally keep under 0.2. Set to 0 to disable dropout.
- n_local_search_iter
Number of local search iterations per restart.
- n_restarts
Number of random restarts per bootstrap replicate.
- verbose
Whether to print progress information. Will print a message on run settings, and a progress bar for the bootstraps.
- distance_type
Character. Defaults to 'geographic', otherwise 'environmental'. If geographic and coordinate uncertainty is greater than
min_jitter_distthen coordinate jittering will be performed.- min_jitter_dist
Minimum coordinate uncertainty (in meters) to initiate jittering of site coordinates.
Examples
if (FALSE) { # \dontrun{
library(ggplot2)
### create sample data
n_sites <- 30 # number of known populations
df <- data.frame(
site_id = seq_len(n_sites),
lat = runif(n_sites, 25, 30), # play with these to see elongated results.
lon = runif(n_sites, -125, -120),
required = FALSE,
coord_uncertainty = 0
)
#function can accept a required point, here arbitrarily place near geographic center
dists2c <- greatCircleDistance(
median(df$lat),
median(df$lon),
df$lat,
df$lon
)
df[order(dists2c)[1],'required'] <- TRUE
## we will simulate coordinate uncertainty on a number of sites.
uncertain_sites <- sample(setdiff(seq_len(n), which(df$required)), size = min(6, n_sites-3))
df$coord_uncertainty[uncertain_sites] <- runif(length(uncertain_sites), 5000, 100000) # meters
# the function can take up to take matrices. the first (required) is a geographic distance
# matrix. calculate this with the `greatCircleDistance` fn from the package for consistency.
# (it will be recalculated during simulations). `sf` gives results in slightly diff units.
dist_mat <- sapply(seq_len(nrow(df)), function(i) {
greatCircleDistance(
df$lat[i], df$lon[i],
df$lat, df$lon
)
})
# the input data is a list, the distance matrix, and the df of actual point locations.
head(df)
test_data <- list(distances = dist_mat, sites = df)
rm(dist_mat, df, n, uncertain_sites, dists2c)
# small quick run
system.time(
res <- maximizeDispersion( ## reduce some parameters for faster run.
input_data = test_data,
n_bootstrap = 500,
n_local_search_iter = 50,
n_restarts = 2
)
)
### first selected
ggplot(data = res$input_data,
aes(
x = lon,
y = lat,
shape = required,
size = cooccur_strength,
color = selected
)
) +
geom_point() +
# ggrepel::geom_label_repel(aes(label = site_id), size = 4) +
theme_minimal() +
labs(main = 'Priority Selection Status of Sites')
### order of sampling priority ranking plot.
ggplot(data = res$input_data,
aes(
x = lon,
y = lat,
shape = required,
size = -sample_rank,
color = sample_rank
)
) +
geom_point() +
# ggrepel::geom_label_repel(aes(label = sample_rank), size = 4) +
theme_minimal()
} # }