Field Notes Operations

OTA Firmware Updates Across 1,000-Node Mesh Networks

Rolling a firmware update across a 1,000-node mesh without taking down active sensors requires careful blob segmentation, acknowledgment windows, and rollback triggers. This is how MeshOS handles it.

Meshkindle Engineering Jul 15, 2025 5 min read

OTA Firmware Updates Across 1,000-Node Mesh Networks

The Problem That Gets Ignored Until It Doesn't

Every wireless mesh deployment eventually needs a firmware update. Security vulnerabilities get patched, protocol stack bugs get fixed, new features get added for new use cases. In a consumer device context, OTA (over-the-air) firmware delivery is routine — smartphones update overnight, and a failed update bricking one device is an inconvenience. In a facility with 1,000 mesh nodes managing HVAC setpoints in a hospital or lighting in a critical operations center, a failed OTA that takes down 200 nodes during business hours is not an inconvenience. It is an incident.

Most mesh OTA implementations underestimate the operational constraints. A firmware rollout that works cleanly on a 50-node test bench at the office will encounter problems at 1,000 nodes in a real building: RF congestion from simultaneous firmware block transmissions, nodes that miss blocks and need retransmission, mixed firmware versions coexisting for hours as the update propagates, and the occasional node that silently accepts the image, reboots, and fails to start because the image was corrupted at a single block.

Bluetooth Mesh DFU: The Protocol Mechanics

The Bluetooth SIG's Mesh Device Firmware Update (DFU) specification (Mesh Model spec, Section 3.4) defines the protocol for distributing firmware images across a BT Mesh network. The architecture has three roles: the Firmware Update Server (each target node), the Firmware Distribution Server (typically the gateway, which holds the full image), and the Firmware Update Client (the initiator, typically MeshOS).

Distribution uses the Blob Transfer model to break the firmware image into variable-size blocks, and each block into smaller chunks. The gateway multicasts chunk messages to a target group address; nodes receive chunks and assemble them into blocks. After all chunks in a block are received, nodes send a Block Status response. The gateway tracks which nodes have confirmed each block and retransmits missing chunks to nodes that reported gaps. This confirmed multicast with selective retransmission is what keeps the DFU process tractable even with dozens of nodes that need retransmission due to RF packet loss.

For Thread networks, the firmware update mechanism is different: Thread nodes receive firmware images via CoAP block transfer (RFC 7959) directly from the Border Router. The MeshOS OTA pipeline supports both paths — BT Mesh DFU for BT Mesh nodes and CoAP block transfer for Thread nodes — through a unified campaign interface that lets you target mixed-protocol deployments in a single firmware rollout campaign.

Dual-Bank Flash and the Rollback Guarantee

Every MK-NODE device ships with dual-bank flash partitioning: a running firmware bank (bank A) and a staging bank (bank B). Incoming firmware updates write to bank B while bank A continues running. When the write is complete, the bootloader verifies the bank B image signature before executing a swap. If the signature check fails, bank B is marked invalid and the node continues running bank A — the node never touches a running image with an unverified replacement. After swapping to bank B, the node sends a first-boot confirmation back to the gateway within a configurable grace period (default 120 seconds). If the confirmation is not received, the watchdog timer expires, and the bootloader rolls back to bank A automatically.

The cryptographic signing chain: firmware images are signed with an ECDSA-P256 private key held in HSM storage in the MeshOS build pipeline. The corresponding public key is provisioned into each node's secure storage (in the MK-NODE's ARM TrustZone-backed key store) during factory programming. The bootloader verifies the image signature before any swap. A network-delivered image that does not carry a valid signature from the provisioned key chain is rejected silently — no error state, no partial write, no reboot loop. The node stays on its current firmware.

Staged Rollout: Managing 1,000 Nodes Without Disruption

Distributing a firmware update to 1,000 nodes simultaneously is not a good idea. Beyond the RF congestion from simultaneous DFU transmission (each node drawing from the multicast stream in the same time window), there is the operational risk of discovering a firmware bug that affects all 1,000 nodes at once. Staged rollout — distributing to a small percentage of nodes, monitoring for issues, then expanding — is the standard practice, and MeshOS enforces it.

A typical rollout campaign structure in MeshOS: Stage 1 — 20 nodes, distributed across floors and node types, 24-hour soak. Monitor first-boot confirmation rates, system health metrics (mesh hop counts, RSSI averages), and application-level behavior (are temperature readings still arriving at expected intervals?). Stage 2 — 100 nodes, 12-hour soak. Stage 3 — 250 nodes. Stage 4 — remaining nodes. Each stage has a configurable success threshold (default: 95% of targeted nodes must confirm successful boot before advancing to next stage). If a stage falls below threshold, the campaign pauses and alerts the operator.

Rollout throughput depends on image size and mesh density. A 512 KB firmware image delivered over BT Mesh DFU to 100 nodes on a dense floor takes approximately 8–12 minutes under normal RF conditions. At 1,000 nodes across multiple floors with good relay coverage, expect 2–4 hours total campaign time for a full rollout at moderate concurrency. Running the campaign during off-hours — overnight for a typical office building — avoids any impact on occupant-facing systems during delivery, even though the running bank continues operating normally throughout the transfer.

Live HVAC and Lighting: Operating Under OTA Traffic

One of the more nuanced engineering decisions in the DFU spec implementation is how to share radio bandwidth between the OTA distribution traffic and the live sensor and control traffic. On a BT Mesh network carrying lighting control commands and occupancy state updates, the DFU multicast traffic competes for the same 2 Mbps BLE advertising channel bandwidth. If the DFU distributor sends at maximum rate, lighting response latency can degrade noticeably — scene transitions that normally complete in 200 ms may take 600 ms during peak DFU transmission windows.

MeshOS's DFU campaign configuration includes a traffic throttle setting that limits DFU chunk transmission rate to a configurable percentage of channel capacity — default is 40%, leaving 60% for live application traffic. At 40% DFU throttle, a 512 KB image distribution to 100 nodes takes about 18 minutes instead of 8 minutes, but lighting and sensor traffic remains within normal latency budgets. For critical systems — hospital ward lighting, data center environmental monitoring — you can reduce the DFU throttle further, or restrict DFU transmission windows to off-peak hours configured in the campaign schedule.

We are not saying you should always throttle OTA distribution to the slowest possible rate. In a warehouse deployment with no lighting or HVAC control traffic — only periodic sensor reporting — running DFU at 80% channel utilization is fine. Match the throttle to the operational sensitivity of the live traffic, not to a blanket rule. MeshOS's traffic monitoring shows real-time channel utilization per gateway, which gives the operator the data to make that call.

What Fails in the Field (and How to Prepare)

The most common OTA failure mode we see: nodes that successfully receive and stage the firmware image, execute the swap, but fail first-boot confirmation because the new firmware version requires a configuration migration that was not accounted for in the update package. This is not a DFU protocol failure — it is an application-layer issue. The node reboots into the new firmware, discovers that its stored configuration schema has changed in the new version, and enters a degraded state because the migration code was not included or failed silently.

The mitigation: every firmware update that changes stored configuration schemas must include a migration handler that runs before normal boot operation, converts the existing configuration to the new schema, and writes the migrated configuration before the application layer initializes. MeshOS's firmware package manifest includes a config_migration_version field that the bootloader checks against the node's current configuration version; a mismatch without a migration handler present causes the node to reject the update before swapping. This check runs before the ECDSA signature verification step, so nodes on incompatible configuration versions show as "update rejected — migration required" in the campaign dashboard, not as update failures.

Planning for OTA reliability is the kind of detail that separates a mesh platform built for facilities from one built for demos. At 50 nodes, a 2% failure rate is one node. At 1,000 nodes, it is 20 nodes requiring manual recovery visits in a building where some of them are ceiling-mounted in mechanical spaces or above suspended ceilings. The dual-bank rollback, staged campaign, and configuration migration infrastructure exist to push that recovery rate as close to zero as a production deployment requires.