There are some salient points to be made about disaster recovery and business continuity and the utmost importance of both. Stephen Withers outlines some positive and negative examples of organisations' efforts in these areas and categorises what is hot, warm and cold when it comes to recovery.
The fire that destroyed the Australian National University's Mount Stromlo Observatory on Saturday 18 January 2003 was a very public disaster, but the ANU had planned for the possibility.
The 100 or so researchers were relocated to the main Acton campus and issued with IP phones incorporating three-port ethernet switches which reduced contention for the limited ports in the temporarily shared offices. They were at work on Monday 20 January.
While the observatories were destroyed, the microwave terminal written off and Telstra services cut, the research buildings were only damaged. In seven days, the microwave hardware was replaced, new fibre installed and the link was fully commissioned. "That was remarkable," says communications manager John McGee, the result of "a very good relationship with our vendors."
The computer rooms survived the fire, as did the recently refurbished structured cabling system. By Monday 3 February, IP phones had been deployed and communications fully restored to the site. "We achieved a three-week recovery, which was remarkable," says McGee.
The large astronomical data sets had been protected by copying them to a StorageTek automated library at the main campus. "You don't rely on your C: drive - proper file storage and management is vital," says McGee.
BC and DR
Business continuity (BC) planning is about determining what's needed to keep the organisation in business following a disaster.
Ric Hallgren, general manager of Dowling Consulting, believes the key question is "What do I need to do to keep my customers?" That provides a starting point for BC planning.
Once a BC plan is in place, you can address the disaster recovery (DR) issues. "Disaster recovery is the IT aspect of business continuity," says Mark Heers, ANZ marketing and alliances director at NetApp.
Plan and test
Analyst firm Gartner warns that companies moving towards the 'real-time enterprise' model must recognise that even a four- to 24-hour outage would cause irrepairable damage and plan accordingly.
Yet planning alone isn't enough. David Blumanis, American Power Conversion's data centre consultant, previously held management positions at some of Australia's biggest IT operations and managed three disasters ("they never hit the media" he says). He believes most organisations' DR plans just sit on the shelf: "About 90% of Australian businesses don't test their disaster recovery plan," he says, suggesting that the figure is closer to 100% for SMEs.
Testing needn't be disruptive, says Datacom director Andrew Coombe. One client - a shipping line - deliberately fails over to its DR system at a Datacom location for one week in every six months.
Cost and risk must be balanced. Michael Cunningham, solution services and support director at Hitachi Data Systems, says BC is difficult enough, but "more difficult, however, are continuity goals and objectives which coincide with budget reality."
Outsourcer Datacom recommends the ITIL approach to DR. According to senior consultant Chris Madden, this essentially identifies what is most important to the business and where IT is involved.
When Datacom worked through this process with Simsmetal, the metal recycler's CIO and CFO initially presented the executive team with an ofF-site recovery plan that would get everything working the next day. Further discussions revealed that while some systems really were needed quickly, the remainder could be recovered over a few weeks.
RTO & RPO
The two key considerations are the recovery time objective (RTO) and the recovery point objective (RPO), says Heers. The first tells you how quickly you need to get a system running again after a disaster, the second how much data you are prepared to lose. An internet banking system might have an RTO of 15 minutes for processing transactions, but it might not be necessary to provide account balances so soon after a failure. However, the RPO would be close to zero as no transactions should be lost. "The data's the critical thing, not necessarily the processing options," he says.
Hot, warm or cold?
There are three categories of recovery sites, says Interactive Group managing director Christopher Ride: hot, warm and cold. The shorter the RTO, the hotter the recovery site must be.
A hot site uses replication to keep its copy of the data right up to date. Synchronous replication provides true real-time updates; asynchronous replication allows greater distances between sites or operation across lower-capacity links by allowing the recovery site to fall slightly behind the primary. Applications are in place and running so operations can failover instantly.
Symantec uses its own high availability/DR products to failover from its Sydney operations to sites in the UK and the US, says systems engineering manager Paul Lancaster.
Synchronous replication requires a very fast connection, such as a DWDM fibre link, says Brendan Park, strategy director at Optus subsidiary Uecomm, but means virtually no data will be lost during a failover.
For most applications, a 100 Mb ethernet over fibre service will allow near real-time replication for key applications with less important data trickling down to the secondary site during quiet periods.
Uecomm offers three services: fully managed ethernet, DWDM and dark fibre. The advantage of dark fibre in a DR situation is that "customers can do whatever they want with it," says Park. "They have total control."
KAZ, Telstra's IT services subsidiary, is involved in outsourced DR. Using a mixed model, Macmahon Holdings (a mining and construction contracting company) located its recovery site within Telstra's Perth facility. Macmahon remains responsible for the design, development and maintenance of the systems while Telstra provides the co-location facility and the communication infrastructure.
"Macmahon has been upgrading its data centre over the past year, but recognising the need to protect our systems and data even further, we decided to develop an off-site disaster recovery centre, to store synchronised copies of corporate data and systems," says CIO Jason Cowie.
A warm site provides the equipment and infrastructure needed to run the applications, but the software and data must be loaded before operations can resume. Coombe says making test and development hardware serve dual duty as DR systems can slash restart times. Relocated from the main data centre, they are trickle-fed data from the primary systems, and then production applications brought up in a few hours when needed.
Warm recovery needn't be excessively slow. When smoke from a fire at Nufarm's agricultural chemical plant at Laverton in Melbourne's west reached the headquarters server room and damaged the equipment, IT manager Geoff Jackson followed his DR plan and took a set of backup tapes to IBM's Business Recovery Centre in the outer eastern suburbs. One day's data was temporarily lost, but recovered from Nufarm's iSeries system the following day after the hardware had been taken to the recovery centre and cleaned.
"We'd taken our back-up tapes and rebuilt the server at the disaster centre before, so we understood what we needed to do to get systems running again," says Jackson. "Sure, it was an IT disaster... [but] it wasn't a commercial disaster."
A cold site is a bare facility: you may need to bring in the hardware before you can start to restore programs and data. This is clearly time consuming, but is relatively cheap and so may be appropriate for processing that can be delayed for an extended period.
Off-site backup
Remote backup is not solely the province of larger companies. For example, KAZ offers a generic PC backup service to replicated EMC servers in Sydney and Melbourne, while HP has a similar service called SURE PC. Organisations operating from multiple locations (eg, branch offices) can do their own online backup between sites.
Trinity College operates four different schools at three sites near Adelaide. Each has its own IT infrastructure, but a centralised backup system was required to minimise administrative effort. The college selected CommVault Systems Galaxy with a 1 TB disk array and a 600 GB tape library to do the job.
Network manager Mark Collis says the college's four IT staff can manage the whole process without needing to involve office staff at the remote locations. This simplicity leads to significant cost savings, and finding any lost file and restoring it to its rightful place quickly and painlessly is a "trivial exercise", he says.
Ronnie Altit, Dimension Data's national practice manager - data centre solutions, says not enough organisations are considering centralising and consolidating their backup environments, which can "save companies money, reduce the complexity of managing their backup systems and reduce the time to recovery point for an organisation post disaster."
Not just servers and data
Accommodating staff also needs to be tackled. One possibility is to make reciprocal arrangements with another company, or to take space in serviced offices. Another is to contract with a DR site that has sufficient seats for your key staff.
Interactive Group provides DR sites in Sydney, Melbourne and Brisbane. Ride says the company's focus is on restoring clients' IT and call centre operations in the event of a disaster: "If you've got those elements right, you're eight-tenths of the way."
The centres can accommodate three clients simultaneously, and access is guaranteed within two to 12 hours of declaring a disaster. Interactive's facilities include fully equipped call centres with up to 160 seats, with another 500 seats available from a partner company if needed. Earlier this year, one of Interactive's customers - a Melbourne-based superannuation fund with a small call centre - experienced a PABX failure. Following its disaster plan, the fund had its phone numbers switched to Interactive's recovery facility, had the standby PABX loaded with the correct configuration, and relocated its staff.
"Within four hours they were once again able to take calls from customers, minimising the damage caused to the business by the loss of voice communications," says Nicholas Asscher, Interactive's state manager - business relationships.
Data recovery
As a last resort, you may need to turn to a data recovery specialist. "Not every company puts in software that will automatically back up a [desktop or notebook] to the network," says Guy Riddle, manager of CBL Data Recovery's Sydney operation. While most data stored on servers is well protected, accidents can still happen, says Riddle.
