R Data Science: Spatial Data Science I

R for Data Science

Reed Benkendorf

November 21, 2024

Assigned Reading: Geocomputation with R Chapter 2.

What is a Geographic Information System?

What is a GIS

  • a Geographic Information System is a system for producing, managing, displaying, and analyzing geographic information.
  • GIS do not require computers, but are often synonymous with software
  • Spatial Data Science is an emergent field which utilizes data science approaches in a GIS, and is a natural extension of a GIS

A GIS (Anne Sexton)

A Brief History

  • 1854 Cholera outbreak in Soho London kills 616 people
  • Dr. John Snow used both maps and statistics to identify the source (water pumps) and stop the outbreak.
  • foundation of both Epidemiology & GIS.

The Broad Street Pump (John Snow/ National Geographic)
  • Took until the 1990s, when compute power become strongrer, for GIS to flourish.

What is Spatial Data Science?

GIS has always had big data, and distinct statistics…


  • Geographic insights, and geospatial analyses to big tabular data sets
  • Including spatial terms in statistical models


  • Advanced statistics ML, and exploratory processes to develop hypotheses
  • Feature engineering of spatial products

Why use R as a GIS and for Spatial Data Science ?

  • Work flow automation
  • Rich ecosystem (packages, functions, code-sharing, etc.)
  • Reproducible
  • Self documenting
  • Computationally efficient
  • Parallel processing/HPC interfaces

Why use RStudio as an IDE

  • the rich ecosystem of geospatial tools extends beyond R!
  • can use (and document) shell scripts used to download and pre-process data
  • can integrate with python
  • can even save and document code used in other GIS systems (e.g. GRASS)
  • tie together with rmarkdown! ;-)

Geodesy

‘Geodesy is the science of accurately measuring and understanding the Earth’s geometric shape, orientation in space, and gravity field.’ - NOAA

Earth is not a perfect sphere

  • Circumference at equator: 40,075 km (24,901 mi)
  • Circumference along meridians: 40,009 km (24,860 mi)
  • Certain areas depressed (e.g. Indian Ocean) others raised (e.g. Europe)

Blue Marble 2012 (Suomi NPP)

The Earth can be represented as an Ellipsoid

  • A simple 3d geometric shape
  • An ellipsoid is a slightly to greatly ovaliform shape
  • Modeling the Earth as an ellipsoid increases the accuracy of point locations

Ellipsoids

However, the Earth’s surface is not smooth - Geoid

  • Types of models to represent planet Earth
  • Include gravity, excluding winds and tides
  • Very accurate since GPS

Geoid Cross Section & IGCM Geoid

Geodetic Datums

- Reference frame established to represent locations within the frame.
- Historically these were locally focused and based on Geoids.
- either horizontal (X & Y) or vertical (Z) features.
- locations are then measured in relation to the control points.

  • Components:
    • reference ellipsoid or geoid
    • origin point (from which measurements run)
    • control points very strictly measured from the origin

Number One City Datum (Cosmo1976)

Control Points Number One City Datum

::::

Geodetic Coordinates

“The geodetic (or geographic) latitude is the angle between the equatorial plane and the normal (vertical) to the ellipsoid surface at the considered point.” - Proj

  • \(\phi\) = geodetic latitude (north/south)
  • \(\lambda\) =longitude (east/west)
  • h = ellipsoidal height
  • N = Normal (plane at right angle to the ellipsoid surface)

Angles on Ellipsoid and Geodetic Coordinates (Peter Mercator)

Angles on Ellipsoid and Geodetic Coordinates (Peter Mercator)

Coordinate Notation - Trigonometric

  • Trigonometric
    • Degrees Minutes Seconds (DMS), e.g. 42°03’27.7” N
      • Sexagesimal/base 60 (think: minutes in an hour)
    • Decimal Degrees (DD), e.g. 42.05759 N
      • Decimal Fractions of a Degree (portion of 360/2)
  • DMS; seldom used in digital formats
  • DD; almost exclusively used

Coordinate Notation - Conversion

\[ \text{Latitude of Tech} = 42°03'27.7" N \]

Convert the latitude of the tech building from DMS to DD

\[ \text{Decimal Degrees} = \text{Degrees } + \frac{\text{Minutes}}{60} + \frac{\text{Seconds}}{3600} \]

Workflow

\[ 42 + (\frac{03'}{60}) + (\frac{27.7"}{3600}) \]

\[ 42 + 0.05 + 0.007694 = 42.05759 \text{N Decimal Degrees} \]

Coordinate Notation - UTM

  • Universal Transverse Mercator
    • Divides the world into 60 zones
    • Flattens each zone
    • Measures distances in Meters

Common for field work, planar measurements of meters.

UTM Zones of the Conterminous USA (ESRI)

Coordinate Reference System (CRS)

  • System for specifying a location on Earth’s surface
  • Composed of:
    • a model of earths shape (Geoid or Ellipsoid)
    • geodetic datum
    • generally also a projection

Geographic CRS (A. Krystalli)

Problems with working on flat surfaces

Going from three to two dimensions does not work well. But we have worked around this. sort of

Orange Peel (N. Belz)

Geographic & Projected Coordinate System

  • Geographic: location on a three-dimensional model of Earths Surface
  • Projected: location on a two-dimensional model of Earths Surface

Major Map Projections I

  • Thousands of projections for maps. None are perfect.
  • There are three main types of projections

Type Examples Pros Cons
Equal Area Lambert Cylindrical No distortion of area near equator Distorts area near the Poles
Equal Distance Equi-rectangular Looks good in mapping applications Distorts both shape and directions
Conformal Mercator Boundaries are accurate Distorts area near the Poles
Compromise Sensitive to area and direction Sensitive to area and direction

Major Map Projections II

Map Projections (D. Strebe)

Geodesy Takeaways

- There are various models (geoids & ellipsoids) to represent the earths shape
- Different datums are used to represent different parts of the earth.
- You will almost always use WGS 84 (based on a ellipsoid – which is fit through a special model of earths gravitational fields a geoid) NAD83 (based on a ellipsoid) for a geographic coordinate system. These +/-1 m from each other across much of North America. More useful than a geoid.
- You will usually want to use a UTM grid based on WGS or State Planes based on NAD83 for projections.
- Different coordinates notation systems are used, focus on decimal degrees.

Geographic Data Models

  • Vector Data Model represents discrete features on the planet using geometries such as: points, lines, and polygons.
  • Raster Data Model represents (usually) continuous features on the plant using continuous surfaces, like a tile.

Vector data tends to only include features of interest, e.g. bodies of water; whereas a Raster will include, explicitly, the absence of features (e.g. both water and terrestrial areas).

Geographic Data Models

Vector and Raster Data Models

Vector Data in R

  • Vector Data Model represents discrete features on the planet using geometries such as: points, lines, and polygons.

simple features - standards

  • Open Geospatial Consortium ISO 19125-1:2004: adhered to in ESRI, and GDAL.
  • Features have geometries describing their location, and properties described by attributes.
  • Polygons are composed of points, connected by straight lines.
  • Lines composing polygons cannot intersect

simple features (sf) 1 - attributes

  • Attributes of the feature, theoretically devoid of spatial context.
  • Described in text, numbers, stored in a data frame type object.
  • Essentially the fields of a data frame
TAXON DBH HEIGHT
Robinia pseudoacacia 40 24
Quercus alba 32 21

sf 2 - coordinates form a point

  • all geometries are composed of points.
  • points only require two coordinates, X & Y.
    • Y Latitude (necessary), X Longitude (necessary)
    • Z Elevation/Depth (uncommon)
    • M Time or uncertainty of measurement (uncommon)

sf 3 - Points form a sf geometry (sfg’s)

  • SFG is the spatial topology associated with a feature

  • POINT - one-dimensional location

  • LINESTRING - two points connected by a string

  • POLYGON - Sequence of points connected by strings

sf 4 - combining geometries (sfg’s)

  • Two or more points, and sets of linestrings, and polygons can be the geometries of a single feature
  • Multi(POINT), multi(LINESTRING), multi(POLYGON)
  • GEOMETRYCOLLECTION - A mixed set of geometry types in the same geometry

sf 5 - geometries and spatial information form a collection (sfc)

  • A list column of S3 type containing attributes of the Geometry/Geometries

  • Coordinate Reference System (‘crs’)

  • Precision (‘precision’)

  • Bounding box (‘bbox’)

  • Number Empty (‘n_empty’)

sf 6 - recapped

TAXON DBH HEIGHT LONG LAT geometry
Robinia pseudoacacia 40 24 -87.67607 42.05663 POINT (-87.67607 42.05663)
Quercus alba 32 21 -87.67399 42.05715 POINT (-87.67399 42.05715)
Geometry set for 1 feature 
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -87.67607 ymin: 42.05663 xmax: -87.67607 ymax: 42.05663
Geodetic CRS:  WGS 84
Precision:     2 
sfc_POINT of length 1; first list element:  'XY' num [1:2] -87.7 42.1

  • We now finally have both attributes and coordinates which are forming a point geometry.
  • This is a Simple Feature.

sp

  • Predecessor to the ‘sf’ R package
  • Some spatial statistics programs still only take sp objects as input
  • A good introduction to S4 objects; popular in new spatial packages too!

sp - History

  • Released in 2005

  • First package to hold all major vector types of geometries

  • Unified spatial class allowed for support in many spatial statistics packages

  • Improved mapping of spatial objects

  • Formed the centerpiece of spatial statistics in R for over a dozen years.

sp - Spatial Classes for Topology

  • SF holds different geometries in list columns, sp has many different types of objects/classes for holding geometries.

  • SpatialMultiPoints

  • SpatialMultiLines

  • SpatialMultiPolygons

  • SpatialMultiGrids

sp - Structures of the object

  • Less detailed than sf (‘n_missing’, ‘precision’ lacking)
Formal class 'SpatialPoints' [package "sp"] with 3 slots
  ..@ coords     : num [1:15, 1:2] -106.2 -97.6 -127.3 -122.4 -88.7 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr [1:2] "xc" "yc"
  ..@ bbox       : num [1:2, 1:2] -127.3 4.7 -88.7 98.4
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:2] "xc" "yc"
  .. .. ..$ : chr [1:2] "min" "max"
  ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
  .. .. ..@ projargs: chr "+proj=longlat +datum=WGS84 +no_defs"
  .. .. ..$ comment: chr "GEOGCRS[\"WGS 84\",\n    ENSEMBLE[\"World Geodetic System 1984 ensemble\",\n        MEMBER[\"World Geodetic Sys"| __truncated__

sp - Attributes - DataFrame

  • Attachment of attributes (e.g. data frame), to a Spatial topology
  • Each S*DF is accessed the same way:
    • SPDF@data[[‘col_name’]]
    • SPDF$col_name
str(SPDF@data)
'data.frame':   15 obs. of  2 variables:
 $ temperature: num  3.72 3.74 3.5 7.07 4.81 4.47 5.93 4.61 4.34 3.96 ...
 $ aspect     : int  7 32 33 41 25 30 27 13 24 12 ...

sp overview of a Spatial*DataFrame

  • Data frame is held in a different slot from the geometry and topology
  • Data frame columns accessed via object@data[[‘colname’]] indexing

sp - Recapped

  • If you need to use spatial statistics, you will come across these objects.
  • Don’t sweat them too much; you can always convert from and to sf to run certain analyses

Raster data in R

  • Rasters are used to represent (continuous) features on the planet using grids
  • Each raster is composed of grids
  • Raster data sets are often distributed in adjoining tiles

Each cell is a tile, one raster.

Raster Components

  • Bounding Coordinate(s)
  • Cell Resolution (Size of cell)
  • Dimensions (No of cells in rows and columns)
  • Coordinate Reference System (CRS)

Note that both cell sizes, and the resolution of the values in cells can, within reason, be converted to finer and coarser resolution. We will discuss these types of calculations next class.

Example Raster 1 Create Frame

Rasters tend to confuse people. We are going to create our own example here before we talk about them much more.

Example Raster 1 Background

  • Fairy shrimp (endangered) live in bodies of water which dry out in late Spring, and refill in early Fall (playas).
  • Determine suitable habitat for fairy shrimp
  • Satellite imagery at two time points

Suprise Valley by: John Glen

Example Raster 1

empty_raster <- raster(
  # rasters have 4 bounding edges
  # Here we define each 'corner' of the raster'
    xmn = 697129.7, 
    xmx = 811775.7,
    ymn = 4388466,
    ymx = 4502382,
    
  # Here we set the number of cells 118*118
    nrows = 118, 
    ncols = 118,

  # set the rasters Coordinate Reference System
    crs = "+proj=utm +zone=10 +datum=WGS84", 
  # we are using the WGS84 ellipsoid, with a UTM planar projection 
)

Example Raster 1

The raster currently looks like this, a frame in space with a specified origin, CRS, and cells, but lacking any content (values).

Example Raster 2 Set Values

Example Matrix showing values underlaying a raster layer.
0 1 1 0 1 0 0 1
1 0 1 0 0 1 1 0
1 0 0 1 1 0 1 0
0 1 1 0 1 0 0 1
1 0 1 0 0 1 1 0
1 0 0 1 1 0 1 0

  • The values of a raster are essentially a matrix.

Example Raster 3

The width of each raster cell is: 971.57627 meters
The height of each raster cell is: 965.38983 meters
This example_raster_dec contains: 13924 elements

Raster Package - in R!

  • Does not need to load all files into active memory at once, allows ambitious projects.
  • Many functions use generics from base R

Raster Layer

A single raster layer

Raster Stack

The main use of Raster Stacks is to hold layers of similar themes.

  1. Each layer is the same variable from a different time. e.g. mean temerpature by month (12 layers per stack)
  2. Each layer is a different independent variable (theme) in an analysis. e.g. yearly mean temperature, mean precipitation, etc.

Raster Brick

  • Multiple bands of imagery held in the same file
  • The sensors of a camera, e.g. Red, Green, and Blue
  • Used for performing image classification to produce data raster layers

Terra

  • Developed by the same team as the Raster Package
  • Functionally virtually identical to Raster, but calculations run more quickly ! :-)
  • Rasterbricks/layers no longer need specification, all objects are Spatrasts
  • Will supersede Raster, but see points 1 & 2.

Cartography

  • use sf!
  • You have scripts to do many types of mapping in your course resources
  • sf objects are ggplot2 compatible
  • the order of mapping operations is more important than code

read in spatial data

  • considerably integration of sf and tidvyerse
  • allows for plotting of spatial data for maps
  • maps should provide context of studies where results vary/informed by space
data("us_states", package = "spData") # lazy load
us_states <- st_as_sf(us_states) # convert from sp to sf 

north_carolina <- read_sf( # read in vector data from disk. 
  system.file("shape/nc.shp", package = "sf")
  )

plot borders of sf

  • Map an sf object using their own geom geom_sf
ggplot(north_carolina) + 
  geom_sf() + 
  theme_bw() # we will use this theme for class, play around with others, but more
  # miniminal themes tend to be the best. It's very easy to clutter maps. 

fill sf interiors

  • Fill the interior of polygons by a variable
ggplot(north_carolina) + 
  geom_sf(aes(fill = BIR79)) + 
  # fill the interior of polygons by a variable
  theme_bw()

colour sf borders I

  • Colour the borders of each polygon in an sf object
  • But why are you doing this?
ggplot(north_carolina) +
  geom_sf(
    aes(color = BIR79),  # color the borders of polygons by a variable
    lwd = 1 # just making the borders thicker.
    )  +
  scale_color_viridis_c(option = "plasma", trans = "sqrt") +
  # just using a very colourful color scheme so you can see it. 
  theme_bw()

colour sf borders II

  • Colour the border and remove any fill from the polygon interior
ggplot(north_carolina) +
  geom_sf(
    aes(color = BIR79),
    fill = NA  
    # fill to 'NA' leaves only the borders of  polygons
    ) + 
  scale_color_viridis_c(option = "plasma", trans = "sqrt") +
  theme_bw()

fill interior & remove borders

  • Colour the interior of polygons and remove the border
  • Can be useful to de-clutter maps
ggplot(north_carolina) +
  geom_sf(
    aes(fill = BIR79),
    color = NA  # set to 'NA' to 'remove' borders. 
    ) +
  scale_color_viridis_c(
    option = "plasma", 
    trans = "sqrt" # some data have a couple very large  values which can swamp the palettes
    ) +
  theme_bw()

fill interior and colour borders

  • fill the interior of polygons and colour the borders
ggplot(north_carolina) +
  geom_sf(
    aes(fill = BIR79, 
        color = FIPS)
    ) + # both fill and color
  guides(color = 'none') + # removed the categorical legend (is too big!)
  theme_bw()

do it all at once

  • all of this could also have been done as:
ggplot() + # leave empty
  geom_sf(data = north_carolina, #  need to specify `data = object`. 
          aes(fill = BIR79, 
              color = FIPS)
          ) + # both fill and color
  guides(color = 'none') + # removed the categorical legend (is too big!)
  theme_bw()

multiple sf data sets on one map

  • Use two data sets to create one map
ggplot(us_states) + # two data sets.
  geom_sf() + # will inherit 'us_states' from plot call 
  geom_sf( # now specify the second data set
    data = north_carolina, 
    fill = 'purple', 
    # set as constant color - don't map using aes()
    color = NA 
    # this removes borders for a cleaner map. 
    ) +
  theme_bw()

order of data matters

  • Be diligent about the order of data sets
ggplot(north_carolina) + # oh no! we drew over North Carolina!
  geom_sf(fill = 'purple') +
  geom_sf(data = us_states) +
  theme_bw()

build plots from the bottom up

  • build plots and maps from the ‘bottom’ up
ggplot() + # two data sets.
  geom_sf(data = us_states, fill = NA) +
  geom_sf(data = north_carolina, fill = 'purple') +
  theme_bw()

coord_sf

  • coord_sf is a helpful modifier to geom_sf
  • Can set a CRS for the map, modify extent, and change some rendering styles
ggplot() +
  geom_sf(data = us_states) +
  geom_sf(data = north_carolina) +
  coord_sf(
    crs = 4267, # we can convert each data set to the same CRS 
    datum = NA # we can remove the grid lines/graticules from plot
    )

crop map extent

  • clip the extent of a map using the bounding box of the data set of interest
bound <- st_bbox(north_carolina) 
# retrieve the bounding box from the sfc list column

ggplot() +
  geom_sf(data = us_states) +
  geom_sf(data = north_carolina, aes(fill = BIR79)) +
  coord_sf( # use the bbox to 'crop' the extent of the map
           xlim = c(bound[1], bound[3]), 
           ylim = c(bound[2], bound[4])
           ) +
  theme_bw()

Assignments

For next Lab: Install these packages (if you have not done so already):

install.packages("sf", "raster", "terra", "sp", "tmap", "leaflet", "ggmap")

optionally install these:

install.packages('snow','parallel')

Download the labs .R script from the course website.

Optional Assignments

If you are interested in how Drone/LiDAR data are collected please check out the ‘RMBL Spatial Data Science Webinar Series’:

  • https://github.com/ikb-rmbl/SpatialDataScienceWebinars2020
  • Collecting UAS Data (Video): https://youtu.be/Pq8btEZRCvM (1.25 hours)

Assignments

For next Lecture:
Assigned Reading: Chapter 3 of Spatial Data Science

Future Bonus SDS Office Hours
Wednesday Night at 5:00 - 6:00.

Notes on this Lab
My lecture notes are in the R script I used to generate all of the novel figures for this presentation. Likewise this presentation is an .HTML file and can be launched from your computer (it was rendered directly from R using the script).

Works Cited

Anna Krystalli Accessed 01.20.2022

Geocomputation with R Accessed 01.09.2022

raster Accessed 01.09.2022

NEON Accessed 01.19.2022

sf Accessed 01.10.2022

proj Accessed 01.20.2022.

R spatial Accessed 01.11.2022

Wikipedia Accessed 01.09.2022

NOAA Accessed 01.25.2022

Packages

[[1]]
Hijmans R (2024). _raster: Geographic Data Analysis and Modeling_. R
package version 3.6-30, <https://CRAN.R-project.org/package=raster>.

[[2]]
Pebesma E, Bivand R (2005). "Classes and methods for spatial data in
R." _R News_, *5*(2), 9-13. <https://CRAN.R-project.org/doc/Rnews/>.

Bivand R, Pebesma E, Gomez-Rubio V (2013). _Applied spatial data
analysis with R, Second edition_. Springer, NY.
<https://asdar-book.org/>.

[[3]]
Pebesma E, Bivand R (2023). _Spatial Data Science: With applications in
R_. Chapman and Hall/CRC. doi:10.1201/9780429459016
<https://doi.org/10.1201/9780429459016>, <https://r-spatial.org/book/>.

Pebesma E (2018). "Simple Features for R: Standardized Support for
Spatial Vector Data." _The R Journal_, *10*(1), 439-446.
doi:10.32614/RJ-2018-009 <https://doi.org/10.32614/RJ-2018-009>,
<https://doi.org/10.32614/RJ-2018-009>.

[[4]]
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E,
Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi
K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the
tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.

[[5]]
Hijmans R (2024). _terra: Spatial Data Analysis_. R package version
1.7-83, <https://CRAN.R-project.org/package=terra>.