I can remember back in 1982 using the popular word processing editor WordStar, which at that time, ran on IBM PC-DOS 1.1.
Today, the challenges that you would face if you were asked to take an old, 320 KB floppy disk and print an antique WordStar letter would be immense. How could you find a working diskette drive to read the 320 KB diskette? Even if you could, how long would it take you to find a pre-Y2K version of the software, operating system and hardware in order to meet the request? Would the printer protocols used then support the printer you have today, or would you need to consider that also?
Thankfully, WordStar has a fan following which has pretty much kept it alive, and so some parts of this challenge may not be so difficult. However, any other less popular or home-grown applications might not be so lucky.
Now consider attempting to access data recorded in the 1950s, 1960s, or 1970s, such as geophysical seismic data recorded by a TI-961 tape drive which was used to record data onto 21-track magnetic tapes back in the 1960s!
Today’s new compliance laws and eDiscovery mandates are changing the way data is retained and required to be accessed. Many organisations are realising that they must retain data for far longer than was originally envisaged, and they need to retrieve the data in a usable format rapidly to support legal requests.
The example above has us looking back 26 years, but what about the future? If we decided that new data created today must be retained for 100 years into the future, how would we go about it? There are a number of considerations.
The Storage Networking Industry Association (SNIA) has a group of interested parties called the Data Management Forum (DMF), which is grappling with this precise problem. The DMF has formed the 100 Year Archive Task Force (100YrATF), which is a global, multi-agency group working to define best practices and storage standards for long-term digital information retention.
The DMF has decided on three things:
- This is a real requirement that companies face.
- This problem is hard to solve.
- It isn’t going to get easier to solve in the future.
In addition to this wisdom, the DMF has categorised the problem into two basic categories:
- Physical Layer Challenges: What media will the data be stored on and how will the data be migrated as the media becomes obsolete?
- Logical Layer Challenges: Even if you could keep the bytes alive, how would you represent the data in a human-readable form?
As challenging as it sounds, the physical layer can be solved with access to the right skills and resources. We are still reading tapes recorded in the 1960s, and occasionally earlier.
Solving the logical layer challenge is quite tough. There are two common approaches at the moment, and one other hope for the future.
The first current method is to encapsulate the data in an XML format. XML is self-describing, and this offers a somewhat robust format. Of course, the XML encapsulation is better for some data formats than others. It works best for human-readable form data such as spreadsheets and documents, but doesn’t work well for complex databases. This mandates that as you write to the archive you will need to create human readable forms for all data that could possibly be called for retrieval.
The second option is from the people at Adobe Systems. They have enhanced their PDF format with a version which is designed for longevity. Called PDF/A, it is a stable archival format that has features locked in at roughly PDF version 5. PDF/A is slowly being validated by the market, with support from applications from mid-2006. Microsoft has a downloadable plug-in for Office 2007 which will support the creation of PDF/A output.
The biggest challenge comes with files where the human-readable form hides the underlying rules, and the need for the rules is almost as key as the data. The best example of this is a spreadsheet where the cells are most often not data, but calculations, sometimes based on other data, which in itself could be hidden. To date, this problem has no simple solution.
The other emerging standard, as being defined by SNIA/DMF, is XAM. This standard is still very much in development, and not yet usable in any form today. It frees the application layer from the underlying storage and provides a level of freedom from any changes that may occur in the future. Its core focus is on CAS (content addressable storage) devices, which are tightly data-centric.
When considering long-term archival requirements, the clear message is to:
- Obtain help from someone who has experience in the long-term archival business.
- Decide on a standard for your logical layer archive, be it PDF/A, XML or XAM, or a combination to suit your data types.
- Consider the workload of converting all of your documents into an archival format: the effort will be extensive.
- Find a provider who can keep the bytes alive on current storage technology and migrate it at intervals to suit the storage solutions you have deployed.
* Guy Holmes is the Director of SpectrumData.