Portfolio case study · 2020–2021

CVML
Computer Vision
Machine Learning Manufacturing Quality Inspection System

End-to-end automated visual quality inspection pipeline — from edge image capture on Raspberry Pi, through MQTT transport, to cloud ML inference on AWS EC2, returning PASS/FAIL decisions to MES and Ignition SCADA in under 4 seconds.

Raspberry Pi 4 Mosquitto MQTT AWS EC2 ResNet50 RabbitMQ MES Integration Ignition SCADA AWS S3
< 4s
End-to-End Latency
Capture → MES
94.5%
Classification Accuracy
v1.1 Post-Patch
−28%
Defect Escape Rate
Reduction
6
Inspection Stations
Phase 1 Deployment

Project Flow — Inspection Path

Visual path of an inspection from sensor trigger to MES record and SCADA update. Total latency <4 seconds.

📷
Step 1
Sensor → RPi
🖼
Step 2
Capture & Encode
📡
Step 3
MQTT Publish
Step 4
Mosquitto → EC2
🧠
Step 5
ResNet50 Inference
🐇
Step 6
RabbitMQ
📋
Step 7
MES Write
📺
Step 8
SCADA Update
Trigger → Preprocess → Transport → Cloud inference → Result routing → MES record → Operator HMI · <4s

CVML Quality Inspection

Project scope: Deployment of an automated visual inspection system across six production lines to replace manual quality checks. End-to-end latency: <4 seconds from capture to MES record.

1. Core Technical Architecture

Five-layer integrated OT/IT pipeline: Edge — Raspberry Pi 4 + HQ Camera captures and preprocesses 224×224 images via OpenCV. Transport — Paho-MQTT to Mosquitto broker (TLS 1.2). Cloud — AWS EC2 runs ResNet50 transfer-learning model for PASS/FAIL. Bridge — RabbitMQ (AMQP) routes results to plant. Execution — MES record entry and Ignition SCADA HMI updates in under 4 seconds.

2. Performance & Operational Impact
  • Speed: Inspection cycle 8–12 min → <4 sec
  • Accuracy: 94.5% classification (v1.1)
  • Quality: 28% reduction in defect escape rate
  • Efficiency: 50% fewer QC personnel per shift
3. Critical Lessons & Evolution
  • Resilience: Move from single-point-of-failure (on-premise brokers) to AWS IoT Core.
  • Scalability: Replace persistent EC2 with serverless SageMaker GPU endpoints to remove 1.5s inference bottleneck.
  • Integrity: Automated drift monitoring and MLOps pipelines to prevent silent accuracy degradation.
  • Security: Per-device X.509 identities and formal DMZ segmentation.

End-to-End Architecture — Edge to Action

Single integrated pipeline from sensor trigger to MES write. Messaging brokers (Mosquitto + RabbitMQ) were co-hosted on the Ignition Server — the OT/IT hub and primary single point of failure.

Layer 1 · Edge
Raspberry Pi 4
+ HQ Camera
  • Python 3.8 + OpenCV 4.x
  • GPIO trigger on sensor
  • Resize 224×224 px
  • Gaussian blur + EQ
  • Base64 encode + metadata
  • SQLite offline buffer
Paho-MQTT
Layer 2 · Transport
Mosquitto MQTT
Broker
  • Eclipse Mosquitto 1.6.x
  • Hosted on Ignition Server
  • MQTT 3.1.1 / TLS 1.2
  • Username + cert auth
  • Bridge plugin → EC2
  • Port 8883 (plant only)
MQTT 3.1.1
Layer 3 · Cloud
AWS EC2
Inference Engine
  • t3.large Ubuntu 20.04
  • Paho-MQTT subscriber
  • ResNet50 (transfer learning)
  • ~85K labeled training images (train/val/test split)
  • PASS/FAIL + confidence
  • AMQP publish → RabbitMQ
AMQP 0-9-1
Layer 4 · Bridge
RabbitMQ
AMQP Broker
  • RabbitMQ 3.8.x
  • Hosted on Ignition Server
  • Direct exchange + routing
  • Durable queues
  • Fan-out: MES + SCADA
  • Consumer ACK guarantee
AMQP → REST/OPC-UA
Layer 5 · Plant
MES +
Ignition SCADA
  • Python MES adapter
  • Stored procedure write
  • NCR auto-creation on FAIL
  • OEE quality factor update
  • Ignition MQTT Engine
  • SCADA SPC + alarms
OPC-UA + REST
Ignition Server hosted both Mosquitto and RabbitMQ — one integration boundary and the primary single point of failure.
RPi Capture MQTT Publish Mosquitto MQTT Bridge EC2 Subscriber model.predict() AMQP Publish RabbitMQ MES Write + SCADA Update

AWS S3 — Storage

AWS S3 acts as the central repository for ~85K training images, versioned model artifacts (.h5), and live inference archives for quality auditing.

End-to-End Latency — Where the 4 Seconds Go

Measured at Station 3, 2021. Total observed: 3.2–4.0 seconds from capture trigger to MES record write.

# Stage Duration Relative Size Note
01 Sensor → GPIO Trigger 50–100 ms
Photoelectric sensor → RPi GPIO HIGH → camera shutter
02 Image Capture + OpenCV Preprocess 250–400 ms
Full-res capture + resize + blur + equalization + base64 encode
03 MQTT Publish → Mosquitto 100–250 ms
Plant Wi-Fi variability — main jitter source
04 Mosquitto Bridge → EC2 Subscribe 150–300 ms
Mosquitto bridge plugin + Internet round-trip to EC2
05 ML Inference (ResNet50 CPU) 1.0–1.5 sec
Biggest consumer — CPU-only EC2, no GPU. SageMaker GPU endpoint → ~200ms.
06 AMQP Publish → RabbitMQ 80–150 ms
EC2 AMQP client → RabbitMQ on Ignition Server
07 MES Adapter Consume + Write 400–800 ms
Biggest variability — MES DB stored procedure write time
08 SCADA / Ignition Update 30–80 ms
OPC-UA tag update + operator HMI refresh (step 8 in decision flow)
Key Finding: Inference (step 5, CPU-only EC2) and MES write (step 7) account for ~65% of total latency. GPU-enabled SageMaker endpoint would cut step 5 from 1.0–1.5s to ~200ms. Async MES write with retry queue would eliminate step 7 blocking entirely.

Results & Business Impact

Measured impact from Phase 1 deployment across 6 inspection stations during the first year of operation.

<4s
Inspection Latency
vs 8–12 min manual
94.5%
Model Accuracy
v1.1 post lighting patch
−28%
Defect Escape Rate
4.0% → 2.9%
−50%
QC FTE at stations
4 → 1–2 per shift
Metric Before CVML After CVML 2020–21 Change
Inspection cycle time8–12 min (manual)3–4 seconds~98% faster
Defect escape rate~4.0% of parts~2.9% of parts−28%
QC FTE per shift4 FTE at stations1–2 FTE (review only)−50% FTE
Inspection data in MESPaper log, manual entryDigital, real-time100% digitised
False positive rate~13% (subjective)~5.5% (ML + review)−58%
Line stops (quality)Avg. 3× per shiftAvg. 1.5× per shift−50%

Known Limitations — Legacy Architecture

An honest assessment of the 2020–21 design decisions and their operational consequences. These limitations directly motivated the next-generation architecture.

Single Point of Failure — Ignition Server
Both Mosquitto and RabbitMQ were hosted on the Ignition Server. Any planned or unplanned server downtime halted the entire inspection pipeline — image transport, result routing, and SCADA visibility all stopped simultaneously. No redundancy or failover.
No OT/IT Security Segmentation
Mosquitto bridged directly to a public AWS EC2 IP with no DMZ layer. Plant network had a direct outbound path to the Internet. Shared username/password auth across all RPi stations — no per-device certificate management.
EC2 Inference — Not Managed or Scalable
ML model ran as a Python Flask/subscriber process on a single EC2 instance. No auto-scaling, no model versioning, no health monitoring. A process crash required manual SSH restart. Model updates meant manual EC2 redeployment and process restart.
No Offline Edge Buffering (v1.0)
Initial deployment had no local buffering on Raspberry Pi. A Wi-Fi dropout caused complete inspection data loss for that window. SQLite offline buffer was added as a patch in v1.1 — not part of the original design specification.
Model Drift Not Monitored
No automated mechanism to detect when model accuracy degraded due to new defect types, lighting changes, or supplier material variation. Drift was only caught through manual quality audits — sometimes weeks after accuracy had already degraded significantly.
MES Integration Tightly Coupled
MES adapter was a Python script with direct stored procedure writes. Any MES schema change or upgrade broke the adapter. No retry logic — failed writes in v1.0 were silently discarded. No dead-letter queue or error tracking.
Lessons Learned
  • SPOF: Ignition Server hosted both brokers — no failover. Next: AWS IoT Core + dedicated DMZ broker.
  • Model drift: No automated detection; drift found only via manual audits. Next: MLOps pipelines and drift monitoring.
  • EC2 inference: 1.5s CPU bottleneck, manual restart on crash. Next: SageMaker GPU endpoints, auto-scaling.
  • Security: Shared credentials, no DMZ. Next: Per-device X.509, formal OT/IT segmentation.

What I Would Do Differently

Next-generation architecture addressing every 2020–21 limitation. Each change maps directly to a known gap in the legacy system.

Legacy (2020–21) Next-Generation
Mosquitto on Ignition ServerAWS IoT Core
EC2 persistent MQTT processIoT Core + Lambda trigger
EC2 Flask + ResNet50 (CPU)SageMaker Real-Time Endpoint
Base64 image in MQTT payloadImage → S3, URL in MQTT
RabbitMQ on Ignition ServerRabbitMQ in dedicated DMZ
MES adapter (Python script)MES microservice + retry queue
Manual model retrainingSageMaker Pipelines CI/CD
No S3 lifecycle policyS3 + Kinesis Firehose data lake
No OT/IT segmentationDMZ + IoT Core + VPN gateway
⚡ Next-Gen Stack
Edge

Raspberry Pi 4 · OpenCV
S3 image upload → MQTT URL message

Transport

AWS IoT Core · X.509 mTLS
Device Shadow · Rules Engine → Lambda

Cloud ML

SageMaker Endpoint (GPU)
Model Monitor + Pipelines
Kinesis Firehose → S3 data lake

IT/OT Bridge

RabbitMQ (DMZ dedicated)
AMQP → REST / OPC-UA

Plant Systems

MES microservice + retry queue
Ignition SCADA + SPC
Power BI DirectQuery

Technology Stack — 2020–2021

Five-layer architecture and the components in each layer. Data flows left to right: Edge → Transport → Cloud → Bridge → Plant.

1 · Edge
Raspberry Pi 4 (4GB)
RPi HQ Camera (12MP)
Python 3.8 + OpenCV 4.x
Paho-MQTT 1.5.x
2 · Transport
Eclipse Mosquitto 1.6.x
TLS 1.2 · Port 8883
Bridge → EC2
3 · Cloud
AWS EC2 t3.large
TensorFlow 2.x / Keras
ResNet50 (.h5)
AWS S3
4 · Bridge
RabbitMQ 3.8.x
AMQP 0-9-1
Fan-out MES + SCADA
5 · Plant
Ignition 8.0.x SCADA
SAP ME / Sepasoft
MES adapter · OPC-UA
Category Component Role
Edge HardwareRaspberry Pi 4 Model B (4GB)Image capture + GPIO I/O at each station · Plant floor
CameraRPi HQ Camera Module (12MP)Inspection image capture mounted on fixture · Plant floor
Edge SoftwarePython 3.8 + OpenCV 4.xImage preprocessing pipeline + GPIO control · Raspberry Pi local
MQTT ClientPaho-MQTT 1.5.xPublish images to Mosquitto broker · Raspberry Pi local
MQTT BrokerEclipse Mosquitto 1.6.xReceive RPi images, bridge to AWS EC2 · Ignition Server (on-premise)
Cloud PlatformAWS EC2 t3.large (Ubuntu 20.04)MQTT subscriber + ML inference engine · AWS Cloud — single instance
ML FrameworkTensorFlow 2.x / KerasResNet50 transfer learning + inference · AWS EC2
ML ModelResNet50 CNN (.h5)Fine-tuned on ~85K labeled images (pass + defect classes) · AWS EC2 + S3
StorageAWS S3Training data, model artifacts, inference images · AWS Cloud
Result BrokerRabbitMQ 3.8.x (AMQP)Route JSON result → MES + SCADA · Ignition Server (on-premise)
SCADA PlatformIgnition 8.0.x (Inductive Automation)Operator dashboards + OPC-UA gateway · Ignition Server (on-premise)
MESSAP ME / Ignition SepasoftQuality records, NCR creation, OEE quality · Plant data center

Training Pipeline & Model Versioning

How the ResNet50 model was built, evaluated, versioned, and deployed to EC2 — from raw labeled images to live inference in 2020–2021.

Training Pipeline
  1. 1 Labeling — Label Studio, ~85K images, 5 classes (pass + scratch/dim/foreign/other). JSON → S3.
  2. 2 Prep & augmentation — S3 load, train/val/test 70/15/15. Rotation, brightness, noise, flip. class_weight for imbalance.
  3. 3 ResNet50 transfer learning — Frozen base, custom head (GAP → Dense(256) → Dropout → Dense(5)). Train head then fine-tune last 30 layers.
  4. 4 Evaluation gate — Accuracy ≥ 93%, FAIL recall ≥ 91%. Confusion matrix sign-off.
  5. 5 S3 artifact — model_v{N}.h5 + metadata JSON to s3://.../model-artifacts/v{N}/.
  6. 6 EC2 deploy — SSH, boto3 pull from S3, restart inference process. model_version in every MES record.
Model Version History
Version Date Accuracy Key Change
v1.0 Jun 2020 88.3% Baseline deployment. 5-class ResNet50. Accuracy below target — lighting variability identified.
v1.0.1 Aug 2020 91.2% Preprocessing fix: histogram equalization added to edge pipeline. Retrained on expanded dataset (+8K images).
v1.1 Dec 2020 94.5% GPIO-controlled LED ring light added. Retrained with consistent illumination images. Class weights rebalanced. Production stable.
v1.2 (draft) Q3 2021 93.8% Triggered by supplier material change — new surface finish caused drift. Expedited retraining. Deployed after audit sign-off.
Version Deployment Procedure (Legacy)
  1. 01Quality engineer signs off evaluation report (accuracy ≥ 93%, FAIL recall ≥ 91%)
  2. 02Model artifact uploaded to S3: model_v{N}.h5 + metadata JSON
  3. 03SSH into EC2 → download model from S3 via boto3 script → replace local model file
  4. 04Restart Python inference process → verify startup log confirms correct version loaded
  5. 05Run 20-part test batch → confirm MES inspection records show expected accuracy
  6. 06Version string model_version field written to every MES inspection record from this point
⚠ No blue/green deployment or rollback automation in legacy. A failed deployment required manual re-download of previous .h5 and process restart — no automated rollback.

Security Layer — Legacy & Known Gaps

What security controls were in place in 2020–21, and where the architecture fell short against modern OT/IT security standards. Understanding the gaps is the foundation for the next-generation redesign.

🔐
Transport Security
MQTT + AMQP layer
  • TLS 1.2 on Mosquitto port 8883 — all RPi → Mosquitto traffic encrypted
  • TLS on Mosquitto bridge to AWS EC2 — Internet hop encrypted
  • Username/password auth on Mosquitto — shared credentials across all 6 RPi stations
  • RabbitMQ AMQP behind plant firewall — no external exposure
  • TLS on EC2 → RabbitMQ AMQP connection
⚠ Gap: shared credentials, no per-device identity. Any compromised RPi exposed all stations.
AWS Cloud Security
EC2 + S3 layer
  • EC2 instance in default VPC — public subnet with security group rules
  • Security group: inbound MQTT port 1883/8883 from Mosquitto bridge IP only
  • IAM role attached to EC2 instance — S3 read/write permissions scoped to cvml bucket
  • S3 bucket policy: deny public access, EC2 role only
  • SSH access restricted to engineering VPN IP range only
⚠ Gap: EC2 in public subnet, no VPC private subnet isolation, no WAF. IAM role scoped but overly broad (full S3 bucket access).
🏭
Plant OT Network Security
Firewall + segmentation
  • Plant firewall allows outbound port 8883 only from Ignition Server IP
  • RPi devices on dedicated plant Wi-Fi SSID — isolated from corporate network
  • Ignition Server on OT VLAN — no direct route to corporate IT network
  • MES on separate VLAN — adapter accesses via stored procedure only
  • Physical access to RPi units controlled by plant floor access badge
⚠ Gap: no DMZ between Mosquitto and EC2. Direct bridge from OT network to public cloud IP. No network traffic monitoring or IDS on OT VLAN.
Security Gap Summary — Legacy vs Next-Generation
Control Legacy 2020–21 Next-Gen Target
Device IdentityShared username/password across all RPi unitsPer-device X.509 certificates via AWS IoT Core CA
MQTT BrokerSelf-managed Mosquitto on Ignition ServerAWS IoT Core — managed, auditable, scalable
IT/OT BoundaryDirect bridge from OT → public EC2 IP, no DMZDMZ with dedicated RabbitMQ, VPN gateway, no inbound OT ports
VPC IsolationEC2 in public subnet, default VPCPrivate subnet, NAT gateway, VPC endpoints for S3/SageMaker