Making Biodiversity Data Work Together

13 Nov

A landscape full of data, but not built to connect

There’s no shortage of biodiversity data. The problem is, it doesn’t come together. We have observation records, telemetry tracks, acoustic recordings, species interaction networks, and eDNA samples. Each tells part of the story, but they were created separately, with different names for species, timeframes, metadata formats, and licensing rules. They all describe the same natural world, but not in ways that easily fit together.

You can watch the full NTC Now session with David Jirout from Illuminum and Luke Emblem from Verdant Metrics to hear more

Why the data doesn’t line up

Observation records come from thousands of providers. These include global platforms, regional birding groups, museum archives using legacy naming systems, and long-tail spreadsheets from field teams and local clubs. Some datasets use modern taxonomies; others use names that haven’t been revised in decades. Spatial metadata may be precise, approximate or missing. Temporal structure varies wildly: daily logs, annual surveys, opportunistic sightings or a single “collected in 1998.”

Species interactions are mostly embedded in scientific papers. Large-scale extraction systems like Globi help surface those relationships, but the output mirrors publication patterns. Some taxa dominate, synonyms cause conflicts, and certain interaction types are underrepresented.

Telemetry adds movement and behaviour, but its structure is inconsistent. Some datasets track animals hourly; others use seasonal windows. Sensor accuracy varies. Battery depletion creates gaps. Metadata isn’t standard across projects. Access depends on institutional permissions and licensing agreements more than scientific need.

Acoustic data brings continuous coverage, but its weaknesses are specific: thin training libraries in many regions; local dialect variations; overlapping calls; equipment and environmental noise sitting inside biological frequency bands; humidity-driven distortion; wind effects; reflective surfaces that alter signatures; and sensor directionality. False positives and ambiguous detections aren’t rare, they’re expected.

eDNA is powerful but far from uniform. Primer choices influence what appears. Pipelines differ between labs. Threshold decisions change species lists. Seasonality changes shedding rates. Even with good protocols, two labs can process the same sample and produce different outputs.

No single system is wrong. They simply evolved separately.

Bringing coherence to scattered datasets

Illuminum’s work on Gaia.eco tackles this fragmentation directly. It reconciles taxonomies, standardises metadata, aligns spatial formats and links observation records with species-interaction information extracted from literature. Those relationship layers serve as a plausibility filter. If a species record appears in a place where none of its associated species have ever been found, the system marks it. If the presence makes sense ecologically, it stays.

In regions with thin datasets, Gaia.eco uses interaction networks and related-species patterns to make careful inferences, but still displays confidence levels clearly so users can interpret results with appropriate caution.

Temporal misalignment across modalities is also handled intentionally. Instead of flattening data to a single timeline, Gaia.eco preserves each dataset’s natural resolution. Sporadic observations, weekly telemetry, hourly acoustics and seasonal eDNA are interpreted with their differences fully visible.

Telemetry is integrated where licensing allows, but this remains one of the harder streams to align. Even when access is granted, the granularity and metadata gaps force careful interpretation.

This isn’t one big dataset. It’s a system that respects complexity and makes it understandable.

Using sound when the landscape is noisy

BioBox approaches biodiversity from a different angle. Its devices record continuous sound, day and night, across weather and seasons, revealing patterns that manual surveys never capture. Migration timing, dawn chorus shifts, peaks in insect activity, nocturnal behaviour, storm-driven changes: these show up immediately in acoustic data.

But raw detection isn’t simple. Classifiers misread species when their training library is incomplete, when wind masks calls, when humidity shifts frequencies, when insects overlap with birds, when buildings reflect sound or when devices are oriented in ways that distort incoming signals.

Verification comes from comparing detections with species histories and ecological relationships. A flagged species is kept if its known distribution and ecological dependencies match local records. If not, it becomes a candidate for review. When two or more modalities agree, acoustics plus observations, acoustics plus interaction networks, confidence rises. When they clash, the mismatch helps locate noise, gaps or anomalies.

This forms a practical triangulation loop, not a theoretical integration framework.

Why multiple modalities create a clearer picture

Each type of biodiversity data contributes something different:

Observations anchor species presence
Telemetry explains movement and spatial use
Acoustics reveal daily and seasonal activity
Interaction networks show ecological dependencies
eDNA provides an independent detection channel

The advantage appears in how they intersect. Agreement strengthens the signal. Disagreement highlights errors, sampling gaps, seasonal effects or classifier issues. Temporal resolution differences are kept intact so that each modality can be interpreted on its own terms.

No single stream tells the full story, but together they form something that is both richer and more stable.

What data users actually need

Land-use planners, restoration teams and local governments rarely need to inspect raw observation tables or acoustic probability arrays. They need outputs that tell them:

which species are present
how they move
how active they are
what ecological relationships matter locally

The real obstacles tend to be practical: inconsistent file standards, missing metadata, unclear licensing, incompatible formats and datasets that can’t be loaded into the systems planners use. A multimodal layer that reconciles taxonomies, metadata and spatial structure removes most of that friction.

Where this work is heading

The field is shifting away from the idea of a single, unified biodiversity platform. A more realistic direction is a federated one: specialised systems that exchange enough structure to reinforce each other. Illuminum is expanding long-tail data sources, refining reconciliation pipelines and deepening ecological interaction layers. BioBox is working on strengthening training libraries, improving classifier stability and refining how temporal clusters support verification.

The goal isn’t to merge datasets but to let different systems collaborate.

Joining the parts without flattening them

The biodiversity community already has the data it needs. The challenge is making those datasets usable together without erasing their differences. Observations, telemetry, acoustics, interactions and eDNA capture different angles of the same ecosystems. When aligned thoughtfully, they create a more grounded, reliable understanding of biodiversity than any single dataset can achieve.

The strength lies in connecting the parts, not simplifying them.

For anyone who wants to dig deeper into the underlying data work, Gaia.eco is a good place to see how these layers come together.

Megha Chadha