When someone asks which IoT architecture is more resilient under failure conditions, the answer depends entirely on what "failure" means in your facility. A lab environment where a node reboot causes a 30-second data gap is very different from a cold-chain warehouse where a missed temperature reading during a compressor fault means spoiled inventory. The architecture question is not theoretical — it has a cost attached to it.

I have spent the last several years building mesh firmware and working through exactly these failure scenarios in real facility deployments. Here is a direct comparison of the two approaches, including the failure modes that vendor proposals tend to leave out.

How Hub-and-Spoke IoT Works

Hub-and-spoke is the default architecture for most first-generation IoT deployments. Sensors and devices connect to a central gateway — the "hub" — which aggregates their data and forwards it to a cloud platform. The hub handles all the protocol translation, all the data formatting, and all the connectivity management. End devices are simple; the hub is complex.

This architecture has real advantages. It is easy to design, because the hub is a well-defined piece of hardware with a known behavior. It is straightforward to troubleshoot, because data flow is linear: device → hub → cloud → dashboard. And it is relatively inexpensive to deploy at small scale, because you need one gateway rather than a distributed set of nodes.

The problem shows up when you stress-test the gateway.

The Single-Point Failure Problem

In a hub-and-spoke topology, the hub is a single point of failure. When it goes offline — power interruption, hardware fault, firmware crash, network outage — every device connected to it goes dark simultaneously. The cloud platform sees a data gap. The dashboard shows stale readings. And depending on how your alerting is configured, you may not know the hub is down until you notice that temperature alerts have been suspiciously quiet.

This is not a hypothetical failure mode. In our experience deploying sensor infrastructure across facilities in New England, gateway outages account for a disproportionate share of monitoring gaps. A single gateway serving 40 Modbus-connected sensors does not fail at the rate of 1/40th of one sensor — it fails at gateway rate, and when it does, all 40 sensors go silent. At facilities where environmental monitoring feeds compliance records, a 45-minute gateway outage can create a documentation gap that triggers an audit finding.

Recovery time in a hub-and-spoke deployment depends on human intervention: noticing the gap, diagnosing the cause, rebooting or replacing the hardware, re-establishing cloud connectivity. Median recovery time from a gateway failure that requires a technician visit is typically 2–4 hours.

How Mesh Topology Handles the Same Failure

In a mesh topology, every node maintains a routing table and is capable of acting as a data relay for its neighbors. There is no single gateway that all traffic must pass through. When a node goes offline, its neighbors detect the route failure and negotiate alternative paths. The routing update propagates through the mesh and telemetry delivery resumes through the new path — automatically, without human intervention.

In practice, here is what that looks like numerically. Each Meshkindle node updates its routing table every 15 seconds. When a node disappears from the network, adjacent nodes detect the route failure at the next routing update cycle and begin negotiating alternatives. Path convergence — the time from node failure to restored telemetry delivery on the new route — typically completes within 12 seconds of the routing update that detects the failure. Wall-clock time from node failure to restored data delivery: under 30 seconds in most deployment configurations.

That 30-second recovery window compares directly against the 2–4 hour human-intervention recovery window in hub-and-spoke deployments. For a facility monitoring 200 sensors across three zones, the operational difference is significant.

Coverage Geometry: Why Physical Topology Matters

There is a geometry problem in hub-and-spoke deployments that becomes visible in large facilities. A single gateway has a finite RF coverage radius — typically 30–100 meters for 802.15.4 mesh radio, depending on obstruction density. In a warehouse with steel racking, concrete columns, and refrigerated compartments, the effective radius shrinks considerably. Covering 150,000 square feet with a hub-and-spoke architecture requires either multiple gateways (each representing a separate single point of failure) or extending the range with signal repeaters (each repeater adding latency and complexity).

Mesh topology addresses coverage geometry differently. Each node extends the network's effective range by acting as a relay for nodes beyond the gateway's direct coverage radius. A 20-node mesh deployment can cover the same floor area as a hub-and-spoke system with 8 gateways, with better path redundancy and lower individual-node cost. And because any node can relay traffic for any other node, there is no architectural equivalent of "losing a gateway and all its children."

Performance Under Partial Failure

One of the questions we get frequently from facility operators evaluating mesh deployments is: what happens when 20% of the nodes go offline simultaneously? It is a fair question — in a real facility, power events, RF interference from large motors, or a firmware update in progress can take multiple nodes offline at the same time.

The answer depends on node density. A mesh with adequate node density — generally one node per 800–1,200 square feet in a facility with standard obstructions — maintains telemetry delivery even with 30% of nodes simultaneously unavailable. The routing algorithm finds paths around the gaps, with slightly increased hop counts and modestly higher per-packet latency.

In contrast, a hub-and-spoke deployment with 30% of its gateways offline loses coverage in exact proportion: 30% of the facility goes dark, with no automatic recovery path.

Where Hub-and-Spoke Still Makes Sense

It would not be accurate to say mesh is always the right answer. Hub-and-spoke is a reasonable choice in a small, well-defined environment: a single server room with 12 sensors and a reliable power infrastructure, or a compact office suite where the gateway is on the same UPS circuit as the critical equipment it monitors. When the scale is small and the gateway's power reliability is high, the single-point failure risk is manageable.

Hub-and-spoke also has a lower up-front hardware cost at small scale. One gateway serving 20 sensors costs less than a 6-node mesh serving the same sensors. The break-even — where mesh's resilience value justifies its hardware cost differential — typically occurs around 40–60 sensors or 10,000 square feet of coverage area, depending on the facility's failure tolerance and compliance requirements.

For facilities above that threshold, or facilities where a monitoring gap has direct compliance or spoilage consequences, mesh topology is the architecture that actually delivers the uptime that the IoT deployment was installed to provide.

Routing Overhead: The Technical Trade-Off

Mesh topology is not free. Maintaining a distributed routing table, propagating routing updates, and negotiating path failover all consume node processing time and radio bandwidth. In a dense mesh with 50+ nodes, routing protocol overhead can consume 10–15% of available channel capacity. That overhead is the cost of resilience — it is the structural reason that the network can heal itself.

For most facility sensor applications — temperature, humidity, vibration, power draw — the data rates are low enough that routing overhead is not a practical constraint. A vibration sensor reporting at 2 Hz generates roughly 8 bytes per reading; the routing overhead for that node is orders of magnitude less than the sensor data payload. The bandwidth cost of mesh routing only becomes a real consideration in applications with continuous high-frequency data streams, such as audio or video, which are outside the use case for facility environmental monitoring.

The architecture decision comes down to this: hub-and-spoke is simpler and cheaper to deploy at small scale, but its failure mode is total loss of visibility in the affected zone. Mesh is more complex to design correctly, but its failure mode is graceful degradation with automatic recovery. For a facility operator whose monitoring system needs to be on when the facility needs it most — during an equipment fault, a power event, or an environmental excursion — the failure mode is the decision criterion that matters most.