Cross-cloud business continuity for an ERP, and what we'd skip in 2026.
Our 2023 BCM design for a SG real-estate agency's ERP — primary on AWS, secondary on Azure, JSON sync APIs, Windows resync service — passed an audit and survived three years. Here is what we'd cut, what we'd keep, and the SG listing-readiness angle.
In 2022 we designed and implemented a Business Continuity Management plan for a Singapore real-estate agency’s ERP, ahead of an SGX listing audit. The agency manages 2,000+ active agents and processes commission calculations daily. The architecture: primary on AWS, secondary on Azure, JSON web APIs forwarding every Create / Update / Delete asynchronously, a Windows backend service running every 10 minutes to repair sync drift, plus IP restrictions and SMS / Authenticator MFA.
It passed the audit. It still runs. Here’s what we’d build differently if we got the same brief in 2026.
What the original got right
- Cross-cloud, not cross-region. A region failure on a single provider is not the only thing that takes you down — provider-wide control-plane incidents do too. Cross-cloud insulated the agency against a category of outages region-redundancy doesn’t.
- Async write-forwarding API. Rather than synchronous replication, the primary writes locally and queues a forwarded write to the secondary. The primary stays fast; the secondary catches up.
- A self-healing reconciliation service. A timer that walks tables every 10 minutes, comparing checksums, was the thing that saved us from silent drift.
- MFA on top of IP allow-listing. Belt-and-braces; an audit-defensible posture.
What we’d change in 2026
1. Move the sync layer to a managed message bus
We hand-rolled the sync API in 2022 because the managed options at the time felt expensive for the volume. In 2026 we’d use Amazon MQ / EventBridge on the AWS side and Azure Service Bus on the Azure side, with a small adapter in between. Cheaper, more reliable, and with built-in dead-letter handling we had to write ourselves.
2. Drop the 10-minute reconciliation, mostly
The reconciliation service ran every 10 minutes and was, in retrospect, fixing the same handful of write-forwarding bugs in our own code. Once the bugs were squashed and the message bus was managed, the reconciliation service ran for a year without finding drift. We now run it weekly, as an audit signal, not as a corrective control.
3. Treat the secondary as test traffic, not just standby
In 2022 the secondary cluster ran cold — bits and bytes intact, but no real users hitting it. In 2024 we changed that: a small percentage of read-heavy traffic is routed to the secondary in normal operation. The secondary is now known to work every day, not just on the day we need it.
4. SMS-based MFA is no longer acceptable
In 2022 we offered SMS or Authenticator. In 2026, SMS-based MFA is below the bar for any workload with this much PII. We migrated the agency to a TOTP-only posture in 2023 and would not start a new build with SMS today.
5. Add agent-identity to the audit story
If the ERP has any AI-driven feature (we added a draft-commission-statement helper in 2024), the agent gets its own scoped identity and its own audit log entry per action. MGF-aligned by construction; useful for audit; cheap to add early.
The listing-readiness angle
This client’s BCM was driven by an SGX listing audit. A few things we’ve learned since about what auditors actually ask for SG listings:
- They will ask to see the failover runbook, not just the architecture diagram. Have one.
- They will ask when you last actually ran the failover. “Never, but we’re confident it works” is the wrong answer. Run a tabletop or a real failover at least annually.
- They will ask about data residency for SG-resident PII. Cross-cloud is fine if both regions are SG. Many of the painful audit moments come from a secondary that quietly placed data outside SG.
What we cut from the original
- The 10-minute reconciliation service as a recommendation. It is overkill for a clean architecture; it was patching our own bugs.
- SMS as an MFA option, full stop.
- The implied claim that BCM is “done” once the architecture is shipped. It isn’t. It’s done when the failover has been exercised end-to-end and a CFO has watched it happen.
What carries over unchanged
The instinct: cross-cloud over cross-region for high-stakes workloads, async over sync where the latency budget allows, and the discipline of treating the secondary as a real production environment.
— wGrow studio · migrated from Team-Notes #59