ICT Disaster Recovery Planning

Tabled: 29 November 2017

Audit overview

Information and communications technology (ICT) systems are critical for the operations of government agencies. Agencies depend on them to:

  • deliver public services—including essential services—to the community
  • efficiently and effectively manage operations
  • fulfil their statutory obligations.

To make sure their systems remain available and continue to operate reliably, agencies must be able to recover and restore them in the event of a disruption—such as an event that interrupts access to premises, to the data that systems rely on, or to the systems themselves. Further, agencies need to be able to recover and restore their systems within a time frame that reflects the business-critical nature of each system.

ICT disaster recovery is the process for recovering systems following a major disruption. ICT disaster recovery planning forms part of an agency's wider business continuity strategy.

Managing disaster recovery risk presents special challenges. The likelihood of a major disaster or significant disruption is generally low, often remote—but the consequences of a system failure that cannot be restored could be significant or even catastrophic.

Without effective disaster recovery capability, agencies risk:

  • extended disruption or inability to deliver public services that depend on systems
  • inability to recover systems and restore lost data
  • subsequent financial loss to themselves and the Victorian economy
  • reputational damage, including loss of community confidence in the effective delivery of government services.

Agencies can reduce the likelihood of disruption events, however this approach can require significant investment compared to the direct costs of responding to a disruption when it occurs. It can therefore be challenging for agencies to determine the balance between focusing on preventative actions and planning to manage the consequences of possible disruptions.

In this audit, we examined disaster recovery at Victoria Police and four departments that provide essential government services—the Department of Economic Development, Jobs, Transport and Resources (DEDJTR), the Department of Environment, Land, Water and Planning (DELWP), the Department of Health and Human Services (DHHS) and the Department of Justice and Regulation (DJR).

We assessed whether their ICT disaster recovery processes are likely to be effective in the event of a disruption.

Conclusion

At present, none of the agencies we audited have sufficient assurance that they can recover and restore all of their critical systems to meet business requirements in the event of a disruption.

They do not have sufficient and necessary processes to identify, plan and recover their systems following a disruption. Compounding this is the relatively high number of obsolete ICT systems all agencies are still using to deliver some of their critical business functions. This both increases the likelihood of disruptions though hardware and software failure or external attack, and makes recovery more difficult and costly. These circumstances place critical business functions and the continued delivery of public services at an unacceptably high risk should a disruption occur.

Agencies are only just beginning to fully understand the importance of comprehensively identifying and prioritising their business functions, maintaining the ICT systems that support these functions, and establishing recovery arrangements to maintain continuity of service.

They need to significantly improve and develop well‑resourced and established processes that fully account for and can efficiently recover the critical business functions of agencies following a disruption.

Findings

Business impact analysis

None of the agencies' business impact analysis (BIA) processes are robust enough to identify and prioritise critical business functions and the recovery requirements for related ICT systems. The maturity of agencies' processes varies, and there are several common weaknesses:

  • not all business functions and related ICT systems are clearly identified and prioritised
  • systems' recovery requirements are only assessed in isolation, and system dependency requirements are not identified and considered
  • systems' recovery requirements determined by the business have not been aligned with ICT service delivery and system recovery capabilities.

Agencies are either not performing BIA periodically, or their BIA does not have defined trigger events that prompt them to revise the analysis in response to changes at the agency—for example, a different operating environment, new services or an altered risk profile.

We measured agencies' BIA processes against the globally accepted model outlined in COBIT Process Assessment Model: Using COBIT 5, 2013 (the COBIT 5 model). This model assesses the capability of the processes using the scale shown in Figure A.

Figure A
COBIT 5 capability levels and descriptions

Capability level

Description

Capability rating 0 out of 5

Incomplete

Process is not in place or cannot achieve its objective

Capability rating 1 out of 5

Performed

Process is in place and achieves its purpose

Capability rating 2 out of 5

Managed

Process is implemented in a managed way and appropriately controlled and maintained

Capability rating 3 out of 5

Established

Process is implemented using a defined process that is capable of achieving its outcomes

Capability rating 4 out of 5

Predictable

Process operates consistently within defined limits to achieve its outcomes

Capability rating 5 out of 5

Optimised

Process is continuously improved to meet relevant current and projected enterprise goals

Source: VAGO, based on COBIT 5 ISO/IEC 15504 capability levels.

Figure B shows our assessment of the capability of the agencies' BIA processes.

Figure B
Capability of audited agencies' BIA process

Criterion

DEDJTR

DELWP

DHHS

DJR

Victoria Police

BIA process

Capability rating 1 out of 5

Capability rating 2 out of 5

Capability rating 2 out of 5

Capability rating 3 out of 5

Capability rating 1 out of 5

Source: VAGO.

Without a robust BIA, agencies have difficulty determining which systems need disaster recovery capability and in what order they should recover systems. The immaturity of these BIA processes means agencies risk not being able to identify all the systems that support their critical business functions. Further, they risk not having the necessary disaster recovery capability to ensure that their ICT systems can provide continuous service or be recovered rapidly following a disruption.

In this report, our assessment is based on the critical systems that agencies have identified.

Disaster recovery processes

None of the agencies' disaster recovery processes are robust enough to effectively and efficiently recover all critical systems in the event of a disruption. Agencies' disaster recovery processes show similar degrees of capability.

We used the COBIT 5 model to assess agencies' disaster recovery processes, as shown in Figure C.

Figure C
Capability of audited agencies' disaster recovery processes

Criterion

DEDJTR

DELWP

DHHS

DJR

Victoria Police

Disaster recovery processes

Capability rating 1 out of 5

Capability rating 1 out of 5

Capability rating 1 out of 5

Capability rating 1 out of 5

Capability rating 1 out of 5

Source: VAGO.

Across all the audited agencies, we identified that disaster recovery processes require improvement:

  • Agencies do not have an established, coordinated department-wide approach to ICT disaster recovery planning—instead, management of disaster recovery is decentralised and managed by individual business divisions.
  • Not all systems that support critical business functions have disaster recovery plans (84 out of 222 systems). Agencies have not performed a risk assessment to determine which critical systems need a disaster recovery plan or identified appropriate continuity processes for when systems are unavailable.

Disaster recovery testing

No agency is performing functional disaster recovery tests for all systems that support critical business functions and, when agencies do conduct testing, they are not performing it consistently.

No agencies' functional disaster recovery testing verifies whether the agency can recover systems to meet the two key recovery objectives:

  • recovery time—the target time required for the recovery of an ICT system after a disruption
  • recovery point—the point in time to which an agency must restore data after a disruption, for example, restoring data to the end of the previous day's processing.

The reason why agencies cannot verify whether their systems are able to meet these recovery objectives is because their BIA fails to determine them.

Figure D shows the number of disaster recovery plans agencies have developed and tested for the systems that support critical business functions. Most do not have disaster recovery plans.

Figure D
Critical systems and disaster recovery plans

Column chart showing the critical systems that have tested disaster recovery plans

Note: DRP = disaster recovery plan.

Source: VAGO.

Without having disaster recovery plans and testing them regularly, agencies risk not being able to recover systems in a timely way because of a lack of guidance for staff on what is required to bring systems back online. As a result, critical government services—such as criminal justice and policing operations—may be unavailable for longer than is necessary, depending on the scale of the disruption.

Disaster recovery training

None of the agencies provide enough training to staff with specific disaster recovery roles and responsibilities to equip them with the knowledge and skills needed to manage the recovery of a system after a disruption. Active participation in disaster recovery tests and theoretical training is a key tool for developing staff skills and experience.

Data centre arrangements

Victoria Police hosts its ICT systems in the same building as its operations. There is a risk that disruptions affecting the operational site—such as a fire—will also affect its systems. Victoria Police has other data centre facilities, which can mitigate the risk and provide systems with the required redundancy capability—the duplication of a system to increase its reliability and minimise downtime in the event of disruption.

Victoria Police is currently in the process of relocating its systems to a separate data centre facility, which it expects to complete by 2020. Victoria Police intends to enhance the disaster recovery capability of all critical systems by 31 December 2018, in preparation for the data centre relocation. However, the risk will remain until then.

Other audited agencies host most of their systems at purpose-built data centres operated by CenITex, a government body that provides centralised ICT support. In addition, third-party providers host a small number of their systems.

Redundancy of outsourced systems

Agencies need to consider effective redundancy capability to increase their systems' reliability and availability. Six of the seven government departments and their associated agencies outsource the hosting of the majority of their systems to CenITex (the Department of Education and Training hosts and maintains most of its systems in-house).

CenITex submitted a paper to the Victorian Secretaries' Board in November 2016 highlighting the recoverability limitations if one of its data centres was unavailable. Only 36 per cent of the 25 most important systems identified by agencies that are hosted by CenITex have secondary stand by systems to provide a full and rapid recovery of systems, as shown in Figure E.

Figure E
Critical systems hosted at CenITex

Column charting showing the number of critical systems hosted by CenITex

Source: VAGO, based on data from CenITex.

Thirteen of the remaining systems have no redundancy capability—including systems that provide services for criminal justice, marine safety and bushfire management.

Agencies intend to reassess these 25 most important systems, review their order of priority, and identify the estimated investment required to establish and maintain an appropriate level of redundancy. No date has been set for this activity to occur.

Obsolescence in systems

In the audited agencies, 41 per cent of the systems that support critical business functions are obsolete. Figure F shows the number of obsolete systems supporting critical business functions such as financial management, child protection and management of criminal justice, based on information provided by the audited agencies.

Figure F
Obsolete critical systems in the audited agencies

Column chart showing the number of current critical systems versus the number of obsolete critical systems

Source: VAGO.

At 79 per cent, Victoria Police has the highest percentage of systems that are obsolete, and DELWP has the lowest at 26 per cent.

The high rate of system obsolescence across all agencies is because:

  • agencies do not maintain detailed registers of their systems with enough version information to enable effective monitoring
  • agencies only consider and review obsolescence when a system is approaching or is already at the end of its life
  • maintaining software and hardware compatibility across a variety of technology platforms is complex and difficult—software components are often heavily customised, which inhibits the upgrade process due to potentially high upgrade costs
  • life cycle planning of systems is inadequate and often not performed regularly enough to ensure that systems are refreshed on a regular basis.

All agencies have identified obsolescence in systems as one of the key risks in their enterprise risk register. To manage the risk, agencies are implementing programs to upgrade and replace obsolete systems, although these are not occurring frequently enough and often only when systems are approaching their end of life.

When government agencies run systems that are close to or beyond their end of life, they increase the risk of these systems not being fit for purpose and, consequently, the risk of poor or degraded service delivery. Systems that operate on obsolete hardware or software present a significant disaster recovery risk, because of the limited availability of hardware spare parts, vendor technical support, and staff knowledge and skill. At worst, agencies risk catastrophic equipment failure, extended outage of public services, and exploitation of vulnerable systems by computer virus attacks.

Recommendations

We recommend that the Department of Economic Development, Jobs, Transport and Resources, the Department of Environment, Land, Water and Planning, the Department of Health and Human Services, the Department of Justice and Regulation and Victoria Police:

  1. appoint a team of suitably qualified and experienced professionals to form a collaborative disaster recovery working group to:
    • provide advice and technical support
    • share lessons learnt based on disaster recovery tests and exercises
    • coordinate disaster recovery requirements for resources shared between agencies
    • identify, develop, implement and manage initiatives that may impact multiple agencies
    • coordinate funding requests to ensure critical investments and requirements are prioritised
  2. perform a gap analysis on their disaster recovery requirements and resource capabilities to determine the extent of the capability investment that will be required
  3. develop disaster recovery plans for the systems that support critical business functions and test these plans according to the disaster recovery test program
  4. provide advice and training to staff on:
    • newly developed frameworks, policies, standards and procedures to increase awareness and adoption as needed
    • specific disaster recovery systems
  5. establish system obsolescence management processes to:
    • identify and manage systems at risk of becoming obsolete, those that will soon have insufficient support or those that will be difficult to manage when they become obsolete
    • enable strategic planning, life-cycle optimisation and the development of long-term business cases for system life-cycle support
    • provide executive with information to allow risk-based investment decisions to be made.

We recommend that the Department of Economic Development, Jobs, Transport and Resources, the Department of Health and Human Services, the Department of Justice and Regulation and Victoria Police:

  1. set up disaster recovery frameworks to provide guidelines and minimum standards for ICT disaster recovery planning, including:
    • developing a strategy to establish the minimum levels of readiness and appropriate governance oversight
    • establishing the requirements, frequency and format of disaster recovery tests based on systems' criticality
    • establishing policies, standards and procedures for a consistent approach.

We recommend that the Department of Environment, Land, Water and Planning:

  1. update its business impact analysis to identify:
    • system dependencies for critical business functions
    • requirements for the system recovery time objective and recovery point objective
  2. determine a recovery strategy for systems that support critical business functions.

We recommend that the Department of Health and Human Services:

  1. update its Business Continuity Policy to require business units to consult with system owners and the Business Technology and Information Management group as part of the business impact analysis process, to validate the maximum allowable outage and recovery time objectives
  2. update the business impact analysis process to identify system dependencies for critical business functions
  3. determine a recovery strategy for systems that support critical business functions.

We recommend that the Department of Justice and Regulation:

  1. update its Crisis and Continuity Policy to require business units to consult with system owners and the Knowledge, Information and Technology Services group as part of the business impact analysis process, to validate the maximum allowable outage and recovery time objectives
  2. develop a framework to assist business units to determine the criticality of business functions and identify disaster recovery requirements
  3. determine a recovery strategy for systems that support critical business functions
  4. update the business impact analysis process to include components that:
    • evaluate and rank the criticality of business functions
    • analyse impacts caused by disruption to critical business functions.

Responses to recommendations

We have consulted with DEDJTR, DELWP, DHHS, DJR and Victoria Police, and we considered their views when reaching our audit conclusions. As required by section 16(3) of the Audit Act 1994, we gave a draft copy of this report to those agencies and asked for their submissions and comments. We also provided a copy of the report to the Department of Premier and Cabinet.

The following is a summary of those responses. The full responses are included in Appendix A.

All of the audited agencies accepted the recommendations. DEDJTR, DHHS, DJR and Victoria Police provided detailed action plans on how they have begun to address our recommendations and the time frames for these activities. DELWP noted the findings in the report. It outlined its work to assess its ICT assets and systems under its ICT Criticality Framework and will work closely with the other audited agencies to enhance its disaster recovery planning capabilities.

Back to Top