For many years, annual disaster recovery (DR) testing has been the accepted way for organisations, particularly in regulated sectors, to demonstrate that systems and data can be recovered following a major outage or cyber incident.
Yet, despite most annual tests passing, confidence in recoverability often erodes rapidly between those exercises. The problem is not intent. It is context.
The operating environments DR testing was designed for, no longer exist.
Modern technology estates change continuously. Cloud platforms evolve weekly. Identity rules, integrations and security controls drift quietly in the background. Data protection architectures shift. External dependencies appear and disappear. Against that backdrop, a recovery test performed once a year inevitably creates a growing confidence gap.
Under both the UK operational resilience regime and DORA, firms are expected to demonstrate that they can remain within impact tolerances under severe but plausible scenarios. That expectation assumes ongoing confidence in recovery capability, not point-in-time reassurance.
Why Annual DR Testing Struggles to Keep Pace
In practice, annual testing falls short for three simple reasons.
First, evidence decays.
The further you move away from a test date, the less representative the result becomes of the current environment.
Second, change outpaces validation.
Most organisations experience dozens of material changes between annual tests, few of which trigger any re-validation of recovery paths.
Third, risk is concentrated.
Issues are typically discovered late, during scheduled exercises, rather than when they first emerge.
The uncomfortable question is no longer whether the last DR test passed. It is whether that result still provides meaningful confidence today.
A Recovered Application is not a Recovered Service
Once we accept that annual testing is infrequent relative to the pace of change, a second challenge becomes harder to ignore.
What exactly are we proving when a DR test passes?
In many organisations, success is declared when an application comes back online. Servers are available. Services start. Monitoring looks clean. But restore success is not the same thing as service recovery. Availability is a technical state. Resilience is a business outcome.
Applications do not deliver value simply by running. They depend on data, access controls, integrations, sequencing and timing. If those elements are missing, degraded or mis-aligned, users may be unable to operate, transact or serve customers, even though the technology appears healthy.
This distinction matters because operational resilience is about the continuity of Important Business Services, not infrastructure uptime.
In practice, many DR tests stop too early. They confirm that systems can be restored, but do not always validate that:
- users can authenticate as expected
- data is accessible and usable
- integrations behave correctly
- service performance remains within tolerance
A restored application that users cannot operate is not resilient, it is merely available.
As services become more complex, resilience assurance must move beyond binary “up or down” outcomes toward evidence that services genuinely function.
From Periodic Reassurance to Continuous Confidence
Whilst regulation hasn’t kept pace with the speed and complexity of modern systems - legacy regulatory frameworks remain static and increasingly performative - the technical capability to automate continuous, evidence‑driven recovery testing now exists.
It is now possible to test recovery from real backups and snapshots on a recurring basis, without waiting for annual DR cycles. More importantly, this can be done in ways that validate not just technical restoration, but actual service usability.
When recovery testing becomes repeatable, organisations begin to learn how systems and services really behave under recovery conditions. Over time, generic recovery plans evolve into proven playbooks, grounded in observed outcomes.
This does not require testing everything all the time. A practical starting point is to focus on Important Business Services and repeatedly validate that the systems and data supporting them can still be recovered within tolerance as environments change.
For senior leaders, this transforms recovery testing from a scheduled exercise into an operational control, one that exposes degradation early, supports credible self-assessment and stands up to audit and regulatory scrutiny.
Resilience improves when recoverability is known, not assumed.
A Closer Look at IBM Power Estates
For organisations running IBM Power environments, whether IBM i, AIX or Linux, these challenges are often even more pronounced.
Business-critical workloads commonly sit at the heart of daily operations. Despite the stability and longevity of these platforms, they are not immune to modern cyber risk, silent data corruption or configuration drift.
Many Power estates rely heavily on replication technologies such as storage-based replication, PowerHA or tools like Mimix. Replication has an important role, but it also has a blind spot.
Replication faithfully copies whatever lands on the primary system, including corrupted or manipulated data.
Whether through SQL-level attacks, privilege escalation, as demonstrated by Silent Signal and discussed by IT Jungle, or silent corruption during batch processing, organisations are often unaware that their “good” copy is no longer good at all.
This is where operational resilience diverges sharply from traditional DR thinking.
Replication supports availability. Only validated, offline copies provide confidence that recovery is actually possible.
Why Testing Matters so Much on IBM i
Restoring an application is like getting all the musicians back on stage. Recovering the service is making sure they are playing in tune, in time and following the same score.
On IBM i, this goes well beyond a successful IPL. A system can start cleanly while critical components, such as access paths, are still rebuilding or have failed silently. When that happens, the application is technically “up”, but users cannot work.
If users cannot work, the business cannot operate.
Regular validation turns a hopeful plan, “this is what we think will work”, into a proven playbook, “this is what we know works”. In Power environments, the complexity of integrations, sequencing and data dependencies makes this essential.
Operational Resilience for IBM Power: Your Blueprint for Zero Downtime
Download this eBook today and learn how to future-proof your IBM Power workloads with predictable, automated and audit-ready recovery assurance.
Download ebook
Clean Rooms, Immutable Copies and Proven Recovery
Modern Power estates, on-premises or in cloud environments, can now take advantage of:
- immutable snapshots
- logical air-gapped copies
- clean-room recovery LPARs for safe testing
- isolated cloud-based recovery environments with granular access controls
But capability alone does not create resilience. Without regular validation, immutable copies are simply unknown quantities. Recoverability remains unproven.
Turning Confidence into an Operational Control
Across retail, financial services and manufacturing, a consistent pattern is emerging:
- backups and snapshots are being created, but not tested
- corruption is often discovered months after it occurs
- recovery times vary widely and are frequently far slower than expected
At Celerity, we increasingly see organisations complementing their annual DR exercises with automated, repeatable recovery validation from real snapshots — not as a compliance activity, but as a practical response to operational reality.
This approach:
- validates recovery using real data and real environments
- confirms application and service usability, not just system availability
- produces continuous, audit-ready evidence aligned to DORA, NIS2 and operational resilience frameworks
- works alongside replication rather than replacing it
Replication gets you availability. Validated recovery gives you certainty.
Moving Beyond Annual Reassurance
Annual DR testing will continue to play an important role. But on its own, it increasingly delivers retrospective reassurance rather than real-time confidence. Resilience does not come from passing a test once. It comes from knowing that recovery still works, today.
For organisations running IBM Power workloads in particular, recoverability is too important to leave to assumption. If your last snapshot has not been validated, your recovery plan is, at best, a hypothesis. The challenge is no longer whether resilience should be proven. It is whether confidence can realistically be kept current.
Share Your Thoughts
If this resonates with your own experience of disaster recovery and operational resilience, I’d love to hear your perspective.
If you’d like to explore how you can build real, continuously-proven cyber resilience across your organisation, feel free to get in touch or book a conversation with me.
Let’s turn recoverability from an assumption into a measurable, operational control.
Book a conversation with me.
Book a meeting