Field note · Migrated 2026-03-30 · 6 min

Cross-cloud business continuity for an ERP, and what we'd skip in 2026.

Our 2023 BCM design for a SG real-estate agency's ERP — primary on AWS, secondary on Azure, JSON sync APIs, Windows resync service — passed an audit and survived three years. Here is what we'd cut, what we'd keep, and the SG listing-readiness angle.

In 2022 we designed and implemented a Business Continuity Management plan for a Singapore real-estate agency’s ERP, ahead of an SGX listing audit. The agency manages 2,000+ active agents and processes commission calculations daily. The architecture: primary on AWS, secondary on Azure, JSON web APIs forwarding every Create / Update / Delete asynchronously, a Windows backend service running every 10 minutes to repair sync drift, plus IP restrictions and SMS / Authenticator MFA.

It passed the audit. It still runs. Here’s what we’d build differently if we got the same brief in 2026.

What the original got right

Cross-cloud, not cross-region. A region failure on a single provider is not the only thing that takes you down — provider-wide control-plane incidents do too. Cross-cloud insulated the agency against a category of outages region-redundancy doesn’t.
Async write-forwarding API. Rather than synchronous replication, the primary writes locally and queues a forwarded write to the secondary. The primary stays fast; the secondary catches up.
A self-healing reconciliation service. A timer that walks tables every 10 minutes, comparing checksums, was the thing that saved us from silent drift.
MFA on top of IP allow-listing. Belt-and-braces; an audit-defensible posture.

What we’d change in 2026

1. Move the sync layer to a managed message bus

We hand-rolled the sync API in 2022 because the managed options at the time felt expensive for the volume. In 2026 we’d use Amazon MQ / EventBridge on the AWS side and Azure Service Bus on the Azure side, with a small adapter in between. Cheaper, more reliable, and with built-in dead-letter handling we had to write ourselves.

2. Drop the 10-minute reconciliation, mostly

The reconciliation service ran every 10 minutes and was, in retrospect, fixing the same handful of write-forwarding bugs in our own code. Once the bugs were squashed and the message bus was managed, the reconciliation service ran for a year without finding drift. We now run it weekly, as an audit signal, not as a corrective control.

3. Treat the secondary as test traffic, not just standby

In 2022 the secondary cluster ran cold — bits and bytes intact, but no real users hitting it. In 2024 we changed that: a small percentage of read-heavy traffic is routed to the secondary in normal operation. The secondary is now known to work every day, not just on the day we need it.

4. SMS-based MFA is no longer acceptable

In 2022 we offered SMS or Authenticator. In 2026, SMS-based MFA is below the bar for any workload with this much PII. We migrated the agency to a TOTP-only posture in 2023 and would not start a new build with SMS today.

5. Add agent-identity to the audit story

If the ERP has any AI-driven feature (we added a draft-commission-statement helper in 2024), the agent gets its own scoped identity and its own audit log entry per action. MGF-aligned by construction; useful for audit; cheap to add early.

The listing-readiness angle

This client’s BCM was driven by an SGX listing audit. A few things we’ve learned since about what auditors actually ask for SG listings:

They will ask to see the failover runbook, not just the architecture diagram. Have one.
They will ask when you last actually ran the failover. “Never, but we’re confident it works” is the wrong answer. Run a tabletop or a real failover at least annually.
They will ask about data residency for SG-resident PII. Cross-cloud is fine if both regions are SG. Many of the painful audit moments come from a secondary that quietly placed data outside SG.

What we cut from the original

The 10-minute reconciliation service as a recommendation. It is overkill for a clean architecture; it was patching our own bugs.
SMS as an MFA option, full stop.
The implied claim that BCM is “done” once the architecture is shipped. It isn’t. It’s done when the failover has been exercised end-to-end and a CFO has watched it happen.

What carries over unchanged

The instinct: cross-cloud over cross-region for high-stakes workloads, async over sync where the latency budget allows, and the discipline of treating the secondary as a real production environment.

— wGrow studio · migrated from Team-Notes #59

← All articles Brief a crew →