BarnebyLives! Running pipeline • BarnebyLives

Purpose

The purpose of this script is to assist botanical collectors in accessioning their collections. It is the central component of this entire package. Here we will go over ensuring that values which will be printed onto herbarium labels and uploaded into databases are accurate.

It is our experience that many herbaria now request digital copies of collection information in their own format. This script aims to alleviate the investment in time this requires, and in doing so allow the collector more time to do less onerous tasks; such as collection and study.

While in theory all of the tasks in this script could be automated, doing so would be an unconscionable action against local taxonomic authorities. While several database options for updating collections (and spell checking names) which are current and well curated (e.g. Kew Plants of the World) exist, we feel many proposals and combinations require interpretation by those in the regions attempting to understand the material at hand.

This script will perform the following tasks.

Geographic

1) Convert coordinates to a Decimal Degrees format, if needed (e.g. from DMS 72*17.55 -> 72.29861).
2) Retrieve appropriate political place names. (i.e. State, County).
3) Retrieve accurate elevation values of coordinates.
4) Retrieve Public Land Ownership (e.g. United States Forest Service, Bureau of Land Management).
5) Project coordinates from one system to another, if needed (e.g. from an UTM zone to WGS84).
6) Retrieve Cadastral data from the PLSS for Township, Range, Section (TRS)

Temporal

1) Provide a ‘written’ date in the style of choosing. (e.g. 12th May 2021), for the purpose of labeling.

Taxonomic

1) Provide a ‘spell check’ for family.
2) Provide a ‘spell check’ for genus.
3) Provide a ‘spell check’ for epithet.
4) Provide a ‘spell check’ for authorities.
5) In addition you may check for synonymy at each of these taxonomic levels, and accept or reject suggestions.

Directions

I preface this with the observation that a good deal of collectors and curators consider these data non-essential now. Regardless, if you are so inclined to gain access to a free google maps account these services are provided.

1) Write directions, from a town of notable size or repute, to the closet parking spot to the collection locality.

Note that these directions may not reflect the actual route traveled by the collectors, but rather what Google Maps thinks is optimal - at the time of the query. Caveat emptor.

Usage

Files required to use this script:

PAD-US database for retrieving land ownership information if needed (note does not include identities of Private owners).
World Flora Online static copy of the database.

# devtools::install_github('sagesteppe/BarnebyLives')
library(tidyverse)
library(BarnebyLives)

In this example I am going to showcase loading the data we want to run through the pipeline using Google Sheets. This can allow for collectors on a team to readily contribute to preparing and transcribing notes for an expedition simultaneously. Loading data in from a local CSV or excel type file is as easy as read.csv or readxl::read_excel instead.

We will use the package googlesheets4, which is a part of the tidyverse, but which does not install by default with the remainder of the ’verse. We will feed in the last part of the URL as the link to the path. If this is your first time using googlesheets4, it will walk you through a few steps to log into your google account so you can access data securely. After that googlesheets4 is actually really straightforward to use, note that I use the drive_auth function here. This function makes non-interactive use easy, it is only used here because the vignette renders as a non-interactive format.

# install.packages('googlesheets4')
library(googlesheets4)

googledrive::drive_auth("reedbenkendorf27@gmail.com")
# read in data from the sheet to process
df <- read_sheet('1iOQBNeGqRJ3yhA-Sujas3xZ2Aw5rFkktUKv3N_e4o8M', 
                    sheet = 'Data Entry - Examples') %>% 
  mutate(UNIQUEID = paste0(Primary_Collector, Collection_number))

only process data which have not yet run through the pipeline.

BarnebyLives takes a little bit of time to run, and it also requires the usage of Google Developer credits for the directions portion. You may want to ensure that you are not re-running files which you have already processed. We can do that very simply using a combination of collectors name’s and their collection numbers. If you are processing for many many people at once, and worried people may have the same names, and be collecting in the same range of numbers, then you can add in other data which would be required to make a UNIQUEID, most likely a component of the date.

# determine whether these data have already been processed by the script, using
# a unique combination of collection name and collection code. 
# these are the data which have gone through the pipeline already
processed <- read_sheet('1iOQBNeGqRJ3yhA-Sujas3xZ2Aw5rFkktUKv3N_e4o8M',
                        sheet = 'Processed - Examples') %>% 
  select(Collection_number, Primary_Collector) %>% 
  mutate(UNIQUEID = paste0(Primary_Collector, Collection_number))

# here we look for any 'input' files which are present in the processed files and 
# drop them. 
input <- filter(df, ! UNIQUEID %in% processed$UNIQUEID ) %>% 
  select(-UNIQUEID)

rm(processed, df)

Even though I coded this functionality, I essentially never use it. The reason being that I don’t use the directions that much, and just get up to start a cup of tea when the pipeline is running.

Ensure Coordinates and Dates and formatted appropriately

Date parsing, e.g. convert date into congruent museum formats (month in roman numerals, or spelled out). This step assumes that you are recording your dates in the American format of (month/day/year), rather than the internationally more common format (day/month/year). We will also parse out the day of month, month of year, and year, which is used in multiple possible downstream software.

data <- date_parser(input, coll_date = 'Date_digital', det_date = 'Determined_date')

dplyr::select(data, starts_with('Date')) %>% 
  utils::head()

Conversion of Degrees Minutes Seconds (DMS) to Decimal Degrees (DD).

The DMS format is still used in a few applications we will convert this to the more generally utilized DD format. Although DMS is certainly prettier to read, DD is just used so much more often we change everything to it. Sad. We will maintain both formats in the data set, so others have them readily available for other uses aside from creating herbarium labels.

data <- dms2dd(data, dms = F)

dplyr::select(data, starts_with(c('latitude', 'longitude'))) %>% 
  utils::head()

Several spreadsheet software auto-increment numeric columns by default. We try to identify that issue here.

data <- autofill_checker(data)

dplyr::select(data, Long_AutoFill_Flag, Lat_AutoFill_Flag) |> 
  dplyr::drop_na() |>
  utils::head()

For internal usage and data acquisition we will make data set explicitly spatial, rather than just holding coordinates in a column. This is also required to write out data in a format readable by Google Earth or another Geographic Information System.

data <- coords2sf(data)
head(data) # now we can see that it is an sf object

Retrieve Political/Administrative Information

BarnebyLives! will retrieve various components of both political and administrative information. All of these data which include: The state, and county of collection, and the township, range, section (TRS) from the Public Land Survey System (PLSS). It will also determine whether a Federal or State Agency manages or owns the land which were collected from, and if so return the name of it. If either the United States Forest Service or Bureau of Land Management does manage the land, it will return information on grazing allotments.

Retrieve State, County, Ownership, TRS, and allotment information

p <- '/media/steppe/hdd/Barneby_Lives-dev/geodata'

data <- political_grabber(data, y = 'Collection_number', path = p)

Geographic Information for Locality info

Many younger collectors use Site Names which are quite broad, we advocate for the usage of locality names.

A locality name is going to provide the relation between a site and the actual collection area through a distance and azimuth. All of the sites which we use as prospective identifiers are from the Geographic Names Information System, these names have undergone standardization, and general cleaning up to better reflect currently acceptable naming practices (i.e. misogynistic and racist names have largely - if not entirely - been removed). Using GNIS allows for places to be unambiguously related to each other. While I hate saying that a collection site is XX km from a river at 121 degrees, because well - just think about it, that seems to be what people want.

Retrieve Nearest GNIS place name, and azimuth from it.

data <- site_writer(data, path = p)

Site characteristics

BarnebyLives acquires a range of characteristics which are useful to describe the physical environment, these are: elevation (both meters and feet), slope, aspect, surficial geology, and geomorphons.

While most of BarnebyLives is dedicated towards addressing the peeves of curators with too much time on their hands, this module is for me. What I generally see, if people even both to describe the site at all, is that they are laser focused on a tiny micro-environment that they collected the individual plant in - not the characteristics of the population or even the general area of the population they are seeing. I don’t dislike these notes, in fact I find them of the utmost value. However, even those inclusions exist within a broader context, which is doing much habitat filtering in and of itself, and without this relation the obscurity of these certain habitats can be lost.

Here are some details on some of the data sets which we are using. All elevation data come from DEM90, as the name implies this data set is at roughly 90 meters of roughly in the X and Y dimensions. This product is then used to derive both slope (in degrees) and aspect (in degrees) information. Slope conveys the change in elevation between adjacent cells, while aspect discusses which way water would flow downslope. An aspect of East states that water would flow downslope to the east. Surficial geology data are acquired from the United States Geological Survey, it works very well for characterizing major geology types in an area, although you may be on a minor inclusion within that material. Geomorphons are a bit more peculiar, they may refer to several things, but here they refer to 10 standardized landform positions Geomorpho90m, which are useful for characterizing landform position. These are useful for spatial ecology and describe where on the landscape the population was broadly located. Generally ‘Valleys’ tend to eat up more land cover than they should, often extending up into alluvial fans, but I find them very useful for describing roughly where on the landscape the population is.

data <- physical_grabber(data, path = p)

Taxonomic

BarnebyLives performs multiple operations related to taxonomy and nomenclature. The function spell_check(), will attempt to determine whether the supplied binomial scientific name, and family are spelled correctly and offer suggestions if they are not.

This function requires that the first few letters of the generic name and epithet are spelled correctly. It will run a spell check against nearly all published plant name spellings, and search for the most similar spelling within these lists if material is not found. It will operate on both pieces of the binomial separately, and can work to identify the species within the context of genus.

p <- '/media/steppe/hdd/Barneby_Lives-dev/taxonomic_data'
data <- spell_check(data, path = p)

data <- spell_check(data = data, column = 'Full_name', path = p)
data <- family_spell_check(data, path = p)

A similar check is to determine that the author whom is cited for describing the species, or transferring it to another genus, is spelled using the standard author abbreviations. This check operates under a similar manner to spell_check(). The list which is consults is comprehensive, within a few years, of all taxonomists whom have published plant species.

data <- author_check(data, path = p)

The final process in the taxonomic step is to ensure that the spellings of any species noted in the vegetation, or associates fields are accurate. These will not then be submitted to Kew, as the fields could quickly overwhelm the API.

data <- associates_spell_check(data, 'Associates', path = p) # we can run on both columns. 
data <- associates_spell_check(data, 'Vegetation', path = p)

Remove the collected species from the list of associated species. Many collectors will include the focal collection in the list of associated species when performing data entry for a site. We can remove it here.

data <- associate_dropper(data, 'Full_name', col  = 'Associates')
data <- associate_dropper(data, 'Full_name', col  = 'Vegetation')

After BarnebyLives! has verified the spelling of the species it will submit it to Plants of the World Online to determine whether a more current and accepted name has been applied. BL! will not automatically over-write your submitted species, and it is important that you interpret the results of query. Especially as it is possible that spell_check has mis-matched your material to another name.

Searches for synonym to species

names <- sf::st_drop_geometry(data) %>% 
  pull(Full_name)

pow_res <- lapply(names,
      powo_searcher) %>% 
  bind_rows()
data <- bind_cols(data, pow_res)

rm(names, pow_res)

Directions

Directions to a parking spot can be acquired from Google Maps; however this implies the location can be driven to in the first place. This may require interactive alterations from the user.

SoS_gkey = Sys.getenv("Sos_gkey")
directions <- directions_grabber(data, api_key = SoS_gkey)

Export collections

We will write these data to Google sheets for our examples.

# first ensure the columns are in the same order as google sheets

processed <- read_sheet('1iOQBNeGqRJ3yhA-Sujas3xZ2Aw5rFkktUKv3N_e4o8M',
                        sheet = 'Processed - Examples') %>% 
  colnames() 

df <- sf::st_drop_geometry(directions) %>% 
  dplyr::select(dplyr::all_of(processed))

# we will add these data onto our final sheet.
sheet_append('1iOQBNeGqRJ3yhA-Sujas3xZ2Aw5rFkktUKv3N_e4o8M', 
             sheet = 'Processed - Examples', data = df)

We can also export collections as either (or both!) a ‘shapefile’ or KML for use in GIS or GoogleEarth.

geodata_writer(data, path = 'someplace', 
               filename = 'Herbarium_Collections_2023',
               filetype = 'kml')