Sensor Integration

From SCADA Historian to Digital Twin: Building a Reliable Data Pipeline

Lars Bergstrom
Lars Bergstrom  ·   ·  8 min read
Data pipeline architecture diagram from SCADA historian to digital twin simulation

The question we hear in early conversations with process plants is usually some version of: "We already have all our data in the historian — how hard is it to hook that up to a digital twin?" The honest answer: connecting the historian is a few hours of configuration work. Building a pipeline that actually delivers clean, reliable, correctly-timed data to a live simulation model is a 2–4 week engineering effort on a first deployment, and most of that time is spent on problems that aren't obvious until you start working with the data.

The Four Problems You Will Hit

1. Timestamp Drift and Time Zone Ambiguity

Process historians log data with timestamps. Simple in principle. In practice, a historian deployment that has been running for years may have timestamps from multiple sources: DCS controllers that were set to local time, field devices that transmit in UTC, an intermediate data aggregator layer that's on a different NTP server, and a historian server clock that has drifted. When you pull tags and compare timestamps, you may find that tag A's readings are consistently 3 seconds behind tag B's readings even though they're measuring the same event from different sensors on the same vessel.

For a digital twin's state estimator, this matters. If the inlet temperature reading arrives 15 seconds late relative to the flow rate reading that uses it, the heat balance calculation at each update cycle is computing with misaligned data. For process variables that change slowly (ambient temperature, steam header pressure), 15-second misalignment is irrelevant. For variables in fast-responding loops (feed flow rate, control valve positions), it matters.

The fix: establish a single canonical timestamp reference for the pipeline. Every data point that enters the twin's ingestion layer must have its timestamp normalized to UTC and verified against a reference. Where timestamp-source consistency can't be guaranteed, build a jitter buffer — a short holding window that waits for lagging tags before computing the state update, at the cost of a few seconds of additional latency.

2. Historian Compression and Data Gaps

Process historians use compression to manage storage. The most common approach is exception-based recording (EBR): a new data point is only stored when the value changes by more than a configured deadband. For a temperature tag configured with a 0.5°C deadband, no new record is written until the temperature moves 0.5°C from the last stored value — which could mean minutes or hours between stored records when the process is stable.

For a digital twin that runs an update cycle every 15 seconds, you need a value for every tag at every update step. The historian's query interface will typically interpolate between stored records to produce a value at a specific timestamp — this is a "stepped" interpolation (hold-last-value) or "linear" interpolation depending on the tag configuration. For most process variables, linear interpolation between compressed records is appropriate. For discrete state variables (like valve open/closed status), you want hold-last-value.

More problematic are data gaps — periods where no data is recorded because an instrument was offline, a communication link failed, or the historian server was restarted. These gaps can be minutes to hours long. When a gap occurs, the twin's pipeline needs to detect it (not just interpolate smoothly through missing data) and respond appropriately: either freeze the twin's last known state, flag "low data confidence" on the prediction output, or, for short gaps, apply a physics-based imputation (the model itself predicts what the missing tag value "should" be based on other constraints).

3. Unit Conversion Mismatches

This is the most embarrassing category of bugs to find late in a project, but it's extremely common. A DCS tag named "FT-201 Feed Flow" stores values in m³/h because that's what the primary element measures. The historian stores raw engineer units from the DCS. The simulation model's mass balance equation expects flow in kg/s. The pipeline must convert: multiply by the process fluid density (which may be temperature-dependent) and divide by 3600. If the density is wrong (using a design-basis value when the actual fluid density has shifted due to composition change) or the conversion factor is hard-coded to the wrong power of ten, the mass balance in the simulation computes with an order-of-magnitude error.

Less obvious: temperature tags stored in °F when the plant standard is SI (common in plants with a mix of US-origin and European instruments). Pressure tags in bar(g) vs. bar(a) (gauge vs. absolute — a constant 1.01325 bar offset that matters significantly for vapor-liquid equilibrium calculations in low-pressure systems). The tag mapping exercise — creating the explicit mapping from historian tag name to simulation model variable, including the unit conversion — needs to be reviewed by both an instrumentation engineer (who knows what the tags actually represent) and a process simulation engineer (who knows what units the model requires).

4. Bad-Quality Values Propagating into the Model

Every process historian has a quality or status field on data records. In OSIsoft PI, this is the "system digital set" — values like "No Data," "Shutdown," "I/O Timeout," or the specific instrument codes. In other historians, it's an integer quality flag where 0 = good, non-zero = some problem.

A data pipeline that doesn't filter on quality will, eventually, feed a "frozen" or "substituted" value into the simulation model — because a sensor was replaced, a transmitter went offline, or the DCS was put in manual with a constant value stored. A frozen sensor value looks like a steady-state measurement to the model's state estimator, which will happily reconcile all other tags against it, producing a distorted picture of the process state. The result is a twin that gives plausible-looking but subtly wrong predictions for as long as the bad-quality value persists.

The Tag Mapping Document: Your Pipeline's Foundation

Every serious historian-to-twin integration project needs a tag mapping document — a structured table that records, for each data point the twin consumes:

  • Historian tag name (exactly as stored)
  • Instrument description (what it physically measures)
  • Historian engineer units
  • Simulation variable name
  • Simulation expected units
  • Conversion formula
  • Quality filter rules (what quality codes to reject)
  • Deadband configuration and interpolation method
  • Owner (who to contact when this tag has issues)

This document is maintained as a living configuration artifact. When instruments are replaced, when the DCS is upgraded, when tag names change in a controls system revision — the tag mapping document and the pipeline configuration update together. The twin's reliability over a multi-year deployment is directly proportional to the quality of tag mapping maintenance.

Backfill and Catchup Logic

One additional requirement for production deployments: what happens when the twin's data pipeline is interrupted — scheduled maintenance, an edge agent restart, a network blip — and then recovers? The twin needs to reconcile its model state against the historian data from the gap period before resuming live predictions.

The backfill procedure runs historical data through the simulation in compressed time (running the ODE integration faster than real-time, using the stored historian data as inputs) to advance the model state from its last known good state to the current time. For most process plants, a 1-hour gap can be replayed at 10x real time in about 6 minutes of compute time. The twin then resumes live prediction with a correctly reconciled model state rather than starting from a stale or default initial condition.

Building the pipeline right isn't glamorous work — it's the plumbing behind the twin's predictions. But a twin whose predictions are unreliable because its data pipeline delivers bad data at the wrong times is worse than no twin at all. The pipe has to be clean before the model can be trusted.