A Practical Guide to Retrieving SSURGO Soil Data in R

Last updated on Feb 18, 2026 R, spatial, soil data, SSURGO

Working with agricultural data often requires accurate soil information. Instead of manually searching soil survey data from Web Soil Survey, we can access the SSURGO** (Soil Survey Geographic Database) through the soilDB package in R using spatial queries.

This guide shows how to:

Query the SSURGO database for a specific point
Retrieve soil attributes like texture, bulk density, soil series classification, soil depth, soil organic matter and water holding capacity
Write a reusable function
Apply it to a whole data set
Visualize results spatially

Required Packages

We’ll need the following R packages installed and loaded:

These libraries enable querying soil databases (soilDB), manipulating data (dplyr), and handling spatial formats (sf).

Step 1: Retrieve Soil Data for a Single Point

Start with a test location. We will use latitude and longitude to generate a spatial point in Well-Known Text (WKT) format and query the SDA (Soil Data Access) API for its MUKEY, a key that uniquely identifies a soil map unit.

# Coordinates for the example point (a point roughly 19 miles northwest of Fargo, ND)
lon <- -97.0
lat <- 47.0
my_point <- sprintf("POINT(%f %f)", lon, lat)

# Get MUKEY (Map Unit Key)
sql <- sprintf("SELECT mukey FROM SDA_Get_Mukey_from_intersection_with_WktWgs84('%s')", my_point)
mukey_result <- SDA_query(sql)

## single result set, returning a data.frame

mukey <- mukey_result$mukey[1]
print(mukey)

## [1] 2521602

This code:

Formats the point as WKT, well-known text, a standard format for representing geographic data
Sends a SQL query to the Soil Data Access API
Extracts the Map Unit Key (“mukey”) for the location

What does mukey[1] retrieve?

When we query the SSURGO database using spatial coordinates, the result may include more than one map unit (MUKEY), especially if the point falls on the border of two or more soil units. The result is returned as a data frame. To keep things simple, we just select the first MUKEY using:

mukey <- mukey_result$mukey[1]

Step 2: Retrieve Soil Properties by MUKEY

Once we have the mukey, we can extract soil attributes using a SQL query that joins multiple SSURGO tables.

sql_comp <- sprintf("
  SELECT c.mukey, c.compname, c.comppct_r, ch.hzdept_r, ch.hzdepb_r,
         ch.awc_r, ch.om_r, ch.dbthirdbar_r
  FROM component c
  INNER JOIN chorizon ch ON c.cokey = ch.cokey
  WHERE c.mukey = '%s'", mukey)

Understanding the SQL query:

SELECT variables

component table, ‘c': a table with general information about each soil component (name, percentage of map unit, and unique IDs). Variables from that table are appended onto this table name (e.g. ‘c.mukey’)
- compname: soil series name
- comppct_r: percentage this component contributes to the map unit
chorizon table, ‘ch': Contains detailed data about soil horizons (layers) within each component
- hzdept_r and hzdepb_r: top and bottom depths of each horizon
- awc_r: available water capacity (how much water the soil can store)
- om_r: organic matter content
- dbthirdbar_r: bulk density (mass per volume, dry weight)

FROM the component table ‘c’, do the following action
INNER JOIN: merge the component table to the horizon table using ‘cokey’, the component key. This allows us to pull horizon-level data for each component.
WHERE clause: Filters results to only the specified map unit key (mukey).

The result is a long data frame where each row represents a single horizon from a component within the selected map unit.

Now we summarize this data in R to make it easier to interpret:

soil_properties <- SDA_query(sql_comp)

## single result set, returning a data.frame

soil_summary <- soil_properties |>
  mutate(thickness = hzdepb_r - hzdept_r) |>
  group_by(mukey) |>
  summarise(
    soil_series    = first(compname),
    soil_depth_cm  = max(hzdepb_r, na.rm = TRUE),
    awc            = round(
      sum(awc_r * thickness, na.rm = TRUE) / sum(thickness[!is.na(awc_r)], na.rm = TRUE),
      2),
    organic_matter = round(
      sum(om_r * thickness, na.rm = TRUE) / sum(thickness[!is.na(om_r)], na.rm = TRUE),
      2),
    bulk_density   = round(
      sum(dbthirdbar_r * thickness, na.rm = TRUE) / sum(thickness[!is.na(dbthirdbar_r)], na.rm = TRUE),
      2),
    .groups = "drop"
  )

print(soil_summary)

## # A tibble: 1 × 6
##     mukey soil_series soil_depth_cm   awc organic_matter bulk_density
##     <int> <chr>               <int> <dbl>          <dbl>        <dbl>
## 1 2521602 Fargo                 200  0.13           1.48         1.33

What this code does:

mutate(thickness = hzdepb_r - hzdept_r): Computes the thickness of each horizon in cm, which is used as the weight in subsequent calculations
group_by(mukey): Groups data by map unit (usually just one in this context)
soil_series = first(compname): Uses the first component name as a representative soil series (we could refine this by selecting the most dominant component based on ‘comppct_r’)
soil_depth_cm = max(hzdepb_r): Calculates total depth by finding the bottom of the deepest horizon
awc, organic_matter, bulk_density: Computed as thickness-weighted means, so that thicker horizons contribute more to the summary value than thin ones. The denominator includes only horizons with non-missing values for each variable, preventing NAs from biasing the weights
round(): Makes outputs cleaner (e.g., 1.47 instead of 1.472365)
.groups = "drop": Ensures the result is a flat data frame, not a grouped tibble

This collapses multiple horizon rows into one summary row. The thickness-weighted mean is a principled default, but the right aggregation method depends on our goal, for example, we might restrict to a fixed depth interval (e.g. 0–30 cm) or weight by component percentage across map units rather than by horizon thickness within a component.

Step 3: Automate Soil Data Retrieval with a Custom Function

Repeating the same queries for every location is tedious and error-prone. Instead, we’ll write a custom function that takes any point with a defined longitude and latitude and returns soil data for that location. This makes it trivial to process many locations at once.

get_soil_info_from_coords <- function(ID, lon, lat) {
  tryCatch({
    # Step 1: Convert coordinates to WKT and get MUKEY
    wkt <- sprintf("POINT(%f %f)", lon, lat)
    sql_mukey <- sprintf(
      "SELECT mukey FROM SDA_Get_Mukey_from_intersection_with_WktWgs84('%s')", wkt
    )
    mukey_result <- SDA_query(sql_mukey)
    if (nrow(mukey_result) == 0) return(NULL)
    
    mukey <- mukey_result$mukey[1]
    
    # Step 2: Query soil properties for this MUKEY
    sql_comp <- sprintf("
      SELECT c.mukey, c.compname, ch.chkey, ch.hzdept_r, ch.hzdepb_r,
             ch.awc_r, ch.om_r, ch.dbthirdbar_r, tg.texture
      FROM component c
      INNER JOIN chorizon ch ON c.cokey = ch.cokey
      LEFT JOIN chtexturegrp tg ON ch.chkey = tg.chkey
      WHERE c.mukey = '%s'", mukey)
    
    soil_props <- SDA_query(sql_comp)
    if (nrow(soil_props) == 0) return(NULL)
    
    # Step 3: Extract surface texture (shallowest horizon)
    surface_texture <- if ("texture" %in% names(soil_props)) {
      soil_props |>
        filter(hzdept_r == min(hzdept_r, na.rm = TRUE)) |>
        pull(texture) |>
        first()
    } else {
      NA_character_
    }
    
    # Step 4: Summarize all properties using thickness-weighted means
    soil_props |>
      mutate(thickness = hzdepb_r - hzdept_r) |>
      summarise(
        ID              = ID,
        soil_series     = first(compname),
        surface_texture = surface_texture,
        soil_depth_cm   = max(hzdepb_r, na.rm = TRUE),
        awc             = round(
          sum(awc_r * thickness, na.rm = TRUE) / sum(thickness[!is.na(awc_r)], na.rm = TRUE), 2),
        organic_matter  = round(
          sum(om_r * thickness, na.rm = TRUE) / sum(thickness[!is.na(om_r)], na.rm = TRUE), 2),
        bulk_density    = round(
          sum(dbthirdbar_r * thickness, na.rm = TRUE) / sum(thickness[!is.na(dbthirdbar_r)], na.rm = TRUE), 2)
      )
  }, error = function(e) {
    message(sprintf("Error processing observation %s: %s", ID, e$message))
    return(NULL)
  })
}

How this function works:

Step	Action
Inputs	Takes three inputs: a unique ID, and geographic coordinates (lon, lat)
WKT Conversion	Converts lat/lon to spatial format (“POINT(lon lat)") understood by the SSURGO API
MUKEY Query	Retrieves the map unit key for that location
Soil Properties Query	Fetches detailed horizon information from SSURGO’s component and horizon tables
Surface Texture	Extracts the surface horizon’s texture (the shallowest layer) if available
Summarization	Calculates thickness-weighted means for AWC, organic matter, and bulk density, plus total soil depth
Output	Returns a single-row data frame with all soil information for one location
Error Handling	If anything fails (no mukey found, failed query, etc.), it returns NULL and prints a helpful message

By encapsulating this workflow into a function, we can easily apply the same logic to many locations using a simple loop.

Step 4: Apply function to Real Metadata (Dataset of Locations)

Now assume we have a data set called “my_data: with 3 variables: “ID” (a unique string), “trial_latitude”, and “trial_longitude” (both numeric variables constrained to earthly longitudes and latitudes).

my_data <- tibble::tibble(
  ID              = c("T001", "T002", "T003", "T004", "T005"),
  trial_latitude  = c(47.12, 47.35, 47.55, 47.25, 47.45),
  trial_longitude = c(-97.45, -97.20, -97.30, -97.15, -97.40)
)
head(my_data)

## # A tibble: 5 × 3
##   ID    trial_latitude trial_longitude
##   <chr>          <dbl>           <dbl>
## 1 T001            47.1           -97.4
## 2 T002            47.4           -97.2
## 3 T003            47.6           -97.3
## 4 T004            47.2           -97.2
## 5 T005            47.4           -97.4

We can now apply the get_soil_info_from_coords() function to each row:

soil_results_list <- lapply(1:nrow(my_data), function(i) {
  row <- my_data[i, ]
  get_soil_info_from_coords(
    ID  = row$ID,
    lon = row$trial_longitude,
    lat = row$trial_latitude
  )
})

## single result set, returning a data.frame
## single result set, returning a data.frame
## single result set, returning a data.frame
## single result set, returning a data.frame
## single result set, returning a data.frame
## single result set, returning a data.frame
## single result set, returning a data.frame
## single result set, returning a data.frame
## single result set, returning a data.frame
## single result set, returning a data.frame

# Combine all the rows of results into a single data frame
soil_data <- do.call(rbind, soil_results_list)

This loop processes each location and returns a list of results. We then stack them into one data frame using do.call(rbind, ...).

Step 5: Merge Soil Queries with Metadata

Now that we’ve gathered soil metrics for each location, merge them back with our original dataset:

my_data_with_soil <- my_data |>
  left_join(soil_data, by = "ID")

Now our metadata includes matched soil metrics like texture, available water content, organic matter, and bulk density for each geolocation.

Step 6: Practice with Simulated Soil Data

Just want to try the visualizations quickly, here’s a sample dataset with realistic values based on SSURGO outputs.

soil_data <- tibble::tibble(
  ID              = c("T001", "T002", "T003", "T004", "T005"),
  trial_latitude  = c(47.12, 47.35, 47.55, 47.25, 47.45),
  trial_longitude = c(-97.45, -97.20, -97.30, -97.15, -97.40),
  soil_series     = c("Barnes", "Fargo", "Bearden", "Barnes", "Fargo"),
  surface_texture = c("silty clay loam", "clay loam", "loam", "silt loam", "clay loam"),
  soil_depth_cm   = c(150, 180, 120, 160, 170),
  awc             = c(0.15, 0.17, 0.13, 0.16, 0.18),
  organic_matter  = c(3.2, 4.0, 2.8, 3.5, 3.8),
  bulk_density    = c(1.3, 1.2, 1.4, 1.3, 1.25)
)

Step 7: Creating Summary Visualizations

These visualizations help explore how soil properties vary across trials and between soil series.

Bar Chart of Total Observations per Soil Series

ggplot(soil_data, aes(x = soil_series)) +
  geom_bar(fill = "#a47764", width = 0.7) +
  labs(title = "Number of Trials per Soil Series",
       x = "Soil Series", y = "Trial Count") +
  theme_minimal(base_size = 15)

Point Plot: Available Water Capacity by Observation

ggplot(soil_data, aes(x = ID, y = awc)) +
  geom_point(size = 5, fill = "#b1d1b2", shape = 23) +
  labs(title = "AWC by Trial ID", x = "Trial ID", y = "AWC (cm/cm)") +
  theme_minimal()

Boxplot: Soil Depth by Soil Series

ggplot(soil_data, aes(x = soil_series, y = soil_depth_cm)) +
  geom_boxplot(fill = "#e6d9d4") +
  labs(title = "Soil Depth by Soil Series",
       x = "Soil Series", y = "Depth (cm)") +
  theme_minimal(base_size = 15)

Boxplot: Organic Matter by Soil Series

ggplot(soil_data, aes(x = soil_series, y = organic_matter)) +
  geom_boxplot(fill = "#cdb4a9") +
  labs(title = "Organic Matter by Soil Series",
       x = "Soil Series", y = "Organic Matter (%)") +
  theme_minimal(base_size = 15)

Bar Chart: Surface Texture Distribution

soil_data |>
  count(surface_texture) |>
  ggplot(aes(x = reorder(surface_texture, -n), y = n)) +
  geom_col(fill = "#ef5919", width = 0.7) +
  labs(title = "Surface Texture Distribution",
       x = "Surface Texture", y = "Number of Trials") +
  theme_minimal(base_size = 15) +
  theme(axis.text.x = element_text(angle = 0, hjust = 1))

Step 8: Mapping Soil Variables Spatially

Geographic visualization reveals spatial patterns in soil properties. The following maps show where different soil conditions exist across our study area.

Map: Locations by Soil Series

For this example, we’ll use the built-in maps package to draw a basic background map. You can customize the region argument to fit your study area.

base_map <- map_data("state", region = c("north dakota"))

ggplot() +
  geom_polygon(data = base_map,
               aes(x = long, y = lat, group = group),
               fill = "white", color = "gray") +
  geom_point(data = soil_data,
             aes(x = trial_longitude, y = trial_latitude, color = soil_series),
             size = 4, alpha = 0.9) +
  coord_fixed(1.3) +
  scale_color_brewer(palette = "Accent") +
  labs(title = "Sample Locations by Soil Series",
       x = "Longitude", y = "Latitude", color = "Soil Series") +
  theme_minimal(base_size = 15)

This map shows the geographic distribution of our trials and helps identify any spatial clusters of particular soil types.

Map: Soil Depth Across Locations

ggplot() +
  geom_polygon(data = base_map,
               aes(x = long, y = lat, group = group),
               fill = "white", color = "gray") +
  geom_point(data = soil_data,
             aes(x = trial_longitude, y = trial_latitude, color = soil_depth_cm),
             size = 4, alpha = 0.8) +
  coord_fixed(1.3) +
  scale_color_gradient(low = "orange", high = "darkblue") +
  labs(title = "Soil Depth at Trial Locations",
       x = "Longitude", y = "Latitude", color = "Depth (cm)") +
  theme_minimal(base_size = 15)

This continuous color gradient makes it easy to spot regions with shallower or deeper soils.

Final Thoughts

By wrapping the query logic in a function, we save time and reduce errors when extracting data for many locations. This method is especially useful for agronomic field trials, ecological sampling, and geospatial modeling projects.