Unified Manufacturing Correlation + Streaming ML

AEGIS · FORESIGHT

Fuse high-frequency PLC telemetry with live MES work orders — every bolt torqued, every firmware flashed, every VIN correlated in real time. Then predict the defect before the car leaves the station.

Rust Edge Gateway Go Microservices NATS JetStream ClickHouse OLAP ONNX Inference React Dashboard GraphQL · REST · SSE · gRPC
10 ms
PLC sample interval
5
Concurrent enrichers
500
Row ClickHouse batches
0
Downtime model hot-swap
02 / 08
Problem Statement

The ISA-95 pyramid is a data silo

Traditional factory software passes data sequentially up a rigid hierarchy. By the time a torque anomaly reaches an enterprise dashboard, it has lost its link to the specific VIN being built.

Level 5 · Enterprise / ERP — days of latency
Level 4 · MES · Work Orders · Firmware — minutes
Level 3 · SCADA / DCS ◄── the silo wall
Level 1 · PLC / Sensors — raw truth @ 10 ms

🔒 Siloed Context

Edge telemetry and MES state live in separate systems. Root-cause analysis becomes a forensic nightmare.

⏱ Late Detection

Quality issues surface after the vehicle leaves the station — expensive rework and recall risk.

✅ Aegis Collapses the Stack

Edge and MES publish to a shared event stream. Correlation joins them in memory — enriched telemetry in milliseconds, not minutes.

03 / 08
Core Idea

Stream–table join + streaming inference

Every PLC reading is a point in time. Every MES update is a slow-moving fact. Aegis performs an in-memory join before data touches the database.

BEFORE — raw edge telemetry

{
  "station_id": "5",
  "torque":     43.21,
  "timestamp":  1714234567890
}

AFTER — enriched in ClickHouse

{
  "station_id": "5",
  "torque":     43.21,
  "timestamp":  1714234567890,
  "vin":        "1HGBH41JXMN100005",
  "firmware":   "v2.1.4"
}

🧠 Foresight — defect alert before the station clears

When anomaly score exceeds threshold, a DefectAlert publishes to NATS and streams to the dashboard via SSE:

{
  "station_id": "5", "anomaly_score": 0.97,
  "trigger_torque_nm": 61.8, "timestamp": 1714234567890
}
04 / 08
Architecture

Event-driven correlation core

Edge, MES, correlation, inference, and API layers communicate exclusively through NATS JetStream — independent durable consumers, no coupling.

Edge (Rust) ──► aegis.telemetry.raw ──┬──► Correlation Worker ──► ClickHouse │ MES (Go) ──────► aegis.mes.state ────┤ (RWMutex station cache) │ └──► Inference Worker ──► aegis.defect.alerts ──► SSE Dashboard API Gateway (:8081) ├── REST /auth/token · /webhooks/mes ├── GraphQL /graphql → ClickHouse queries ├── SSE /api/v1/alerts/stream └── gRPC :9090 AnomalyScorer.ScoreTelemetryBatch

🦀 Edge Gateway

Deterministic PLC mock at up to 1 kHz. Publishes torque samples to JetStream.

🐹 Correlation Worker

5 enricher goroutines · stream–table join · 500-row batch writer to ClickHouse.

🧠 Inference Worker

Hot-swap ONNX scorer under RWMutex. Fail-open — never stops the line.

05 / 08
Data Flow

Seven steps from PLC to alert

End-to-end path from raw sensor sample to live dashboard alert — all under a second in dev, sub-second at scale.

Edge Publish
20 samples/sec → raw topic
MES State
VIN + firmware → cache
Enrich
5 goroutines join
ClickHouse
250 ms flush
Score
Foresight inference
Alert
NATS → SSE
Dashboard
React live UI

Query that becomes trivial

"Show every torque reading > 50 Nm for VINs built with firmware v2.1.4 that later reported a steering defect."

One ClickHouse SELECT — because enrichment happened at ingest time.

Why NATS JetStream?

Durable at-least-once delivery with low ops overhead. Independent consumers for correlation and inference advance at their own pace — no coupling.

06 / 08
Technology

Polyglot stack, single event bus

LayerTechnologyRole
EdgeRust · Tokio · async-natsPLC mock → JetStream, graceful SIGINT
BrokerNATS JetStreamAEGIS stream on aegis.> · 24h retention
MESGo · pgx/v5 · PostgreSQL 16Work order CRUD + state publisher
CoreGo · sync.RWMutex · ClickHouse 24Stream–table join · OLAP storage
MLGo inference + Python trainingONNX hot-swap · z-score in dev
APIGo · GraphQL · SSE · gRPCPolyglot gateway on :8081 / :9090
UIReact 18 · TypeScript 5 · ViteMES snapshot · live SSE alerts

GraphQL for UI

Supervisors need varying detail — no over-fetching on telemetry queries.

REST for webhooks

Legacy Level-4 MES systems ingest state via JSON-over-HTTP.

SSE for alerts

One-direction server→browser. No WebSocket complexity for defect streams.

07 / 08
Foresight MLOps

Online scoring + offline training

Inference runs continuously on live telemetry. A nightly Python job retrains from ClickHouse labels and exports ONNX to S3 — zero-downtime hot-swap every 5 minutes.

Model Registry (S3 / MLflow) │ poll every 5 min ▼ Inference Worker ┌─────────────────────────────────┐ │ scorer.Swap(newConfig) │ ← write lock (µs) │ sync.RWMutex hot-swap │ │ scorer.Score(torqueReadings) │ ← read lock (concurrent) └─────────────────────────────────┘ │ score > 0.95 ▼ aegis.defect.alerts → SSE → Dashboard

Fail-open by design

The ML model is experimental. A scoring bug must never stop the assembly line. NATS ack happens before scoring — errors are logged, production continues.

gRPC for batch scoring

Protobuf binary encoding for 1,000-metric batches — smaller and faster than JSON arrays over HTTP.

08 / 08
Summary

Built to answer the question a quality engineer
should never struggle to ask

"What was happening to VIN 1HGBH41JXMN100042 — physically and in software — at 09:22:47?
And did Foresight see it coming?"

github.com/vgandhi1/aegis make infra-up && make web Docker Compose included
View Repository →