wGrow
menu
Field note · Migrated 2026-03-30 · 6 min

Cross-cloud business continuity for an ERP, and what we'd skip in 2026.

Our 2023 BCM design for a SG real-estate agency's ERP — primary on AWS, secondary on Azure, JSON sync APIs, Windows resync service — passed an audit and survived three years. Here is what we'd cut, what we'd keep, and the SG listing-readiness angle.

In 2022 we designed and implemented a Business Continuity Management plan for a Singapore real-estate agency’s ERP, ahead of an SGX listing audit. The agency manages 2,000+ active agents and processes commission calculations daily. The architecture: primary on AWS, secondary on Azure, JSON web APIs forwarding every Create / Update / Delete asynchronously, a Windows backend service running every 10 minutes to repair sync drift, plus IP restrictions and SMS / Authenticator MFA.

It passed the audit. It still runs. Here’s what we’d build differently if we got the same brief in 2026.

What the original got right

  • Cross-cloud, not cross-region. A region failure on a single provider is not the only thing that takes you down — provider-wide control-plane incidents do too. Cross-cloud insulated the agency against a category of outages region-redundancy doesn’t.
  • Async write-forwarding API. Rather than synchronous replication, the primary writes locally and queues a forwarded write to the secondary. The primary stays fast; the secondary catches up.
  • A self-healing reconciliation service. A timer that walks tables every 10 minutes, comparing checksums, was the thing that saved us from silent drift.
  • MFA on top of IP allow-listing. Belt-and-braces; an audit-defensible posture.

What we’d change in 2026

1. Move the sync layer to a managed message bus

We hand-rolled the sync API in 2022 because the managed options at the time felt expensive for the volume. In 2026 we’d use Amazon MQ / EventBridge on the AWS side and Azure Service Bus on the Azure side, with a small adapter in between. Cheaper, more reliable, and with built-in dead-letter handling we had to write ourselves.

2. Drop the 10-minute reconciliation, mostly

The reconciliation service ran every 10 minutes and was, in retrospect, fixing the same handful of write-forwarding bugs in our own code. Once the bugs were squashed and the message bus was managed, the reconciliation service ran for a year without finding drift. We now run it weekly, as an audit signal, not as a corrective control.

3. Treat the secondary as test traffic, not just standby

In 2022 the secondary cluster ran cold — bits and bytes intact, but no real users hitting it. In 2024 we changed that: a small percentage of read-heavy traffic is routed to the secondary in normal operation. The secondary is now known to work every day, not just on the day we need it.

4. SMS-based MFA is no longer acceptable

In 2022 we offered SMS or Authenticator. In 2026, SMS-based MFA is below the bar for any workload with this much PII. We migrated the agency to a TOTP-only posture in 2023 and would not start a new build with SMS today.

5. Add agent-identity to the audit story

If the ERP has any AI-driven feature (we added a draft-commission-statement helper in 2024), the agent gets its own scoped identity and its own audit log entry per action. MGF-aligned by construction; useful for audit; cheap to add early.

The listing-readiness angle

This client’s BCM was driven by an SGX listing audit. A few things we’ve learned since about what auditors actually ask for SG listings:

  • They will ask to see the failover runbook, not just the architecture diagram. Have one.
  • They will ask when you last actually ran the failover. “Never, but we’re confident it works” is the wrong answer. Run a tabletop or a real failover at least annually.
  • They will ask about data residency for SG-resident PII. Cross-cloud is fine if both regions are SG. Many of the painful audit moments come from a secondary that quietly placed data outside SG.

What we cut from the original

  • The 10-minute reconciliation service as a recommendation. It is overkill for a clean architecture; it was patching our own bugs.
  • SMS as an MFA option, full stop.
  • The implied claim that BCM is “done” once the architecture is shipped. It isn’t. It’s done when the failover has been exercised end-to-end and a CFO has watched it happen.

What carries over unchanged

The instinct: cross-cloud over cross-region for high-stakes workloads, async over sync where the latency budget allows, and the discipline of treating the secondary as a real production environment.

— wGrow studio · migrated from Team-Notes #59