ANDY BAILEY discusses archiving – the space and the time involved.
Last month promised to investigate archive implementation for the media facility, as this is a key ingredient to many media IT strategies and still remains one of the biggest headaches for the operational manager.
The improvement in technology that transmission technologies underwent has been discussed already in this series and a similar scale of progress is also marked for storage technology. Some parallel improvements are also noteworthy: from a consumer’s point of view, the file sizes for audio has been reduced through improved encoding techniques. Foremost among these is the ubiquitous MP3 encoding algorithm. Similar breakthroughs can be identified within the visual media variety too, with DIVX arguably leading the way, but also noteworthy are Realnetworks and Windows Media attempts to reduce the bandwidth requirements for streamed transmission of A/V content. These examples all induce some form of loss, and so are not used in a professional capacity, except to deliver a preview or rush of the piece perhaps, or to deliver to the consumer in broadcast capacity. Instead, audio is more likely to be stored in its well-known CD Red Book Standard of 16-bit PCM sampled at 44.1kHz as purchased in your local record shop. As computer files, examples are the Macintosh Audio Interchange File Format (AIFF) and Windows WAV types. Those watching the standards themselves might already know that Philips Electronics and Sony Corporation have announced that licensees of the current CD format will have the option of extending their agreements to include the Super Audio CD standard, with the same royalty levels as currently are being paid for CD Audio. The Super Audio CD standard is to become an appendix to the Red Book CD specifications and as well issued as the Scarlet Book in its own right.
If material is already stored in digital format, it is a matter of converting the bit rate and adding some extra to determine file sizes. For audio formats this is relatively straightforward and there is a wide choice of formats depending on the QoS demanded by the end application. At the top end of the range, professional video is still enormous by today’s standards, generating around 1GB of data a minute. Again, many formats are available depending upon the QoS requirements of the task.
Once the file parameters have been identified, some other numbers also need to be collected: the throughput of your facility, or its commercial capacity measured as a size in data should be determined, as well as the length of time that an archive is kept current. Also consider the processes for backing up the data, and off-line archiving, as there are configuration limitations that need to be designed around.
Experience shows that the increased availability of data results in increased data production and so basing a size estimate on current throughput may be a misguided representation. Generally this behaviour is a result of more flexible production processes and so it may be worth limiting what is saved to the archive using procedural configuration. In larger facilities the type of data worth saving will reflect the business itself. Examples might be mixdown audio only, tracks as data, projects at a particular stage, or single samples and raw data such as MIDI files. The point of this part of the design is that it should be possible to continue working where you left off, following a full failure/recovery cycle of the system.
Taking a back-up of the system degrades its performance as processor and other resources are diverted to the job. Additionally back-ups take a long time as tape machines are slow and should be performed in the nightly ‘maintenance window’ when available. These days, the demands for 100 percent system availability spawned technologies such as Storage Area Networks, which offer increased resilience and storage availability. By allowing two processors to handle the same file system, one processor can handle the back-up job while the other handles production as normal.
Another technology is brought about by such demand, is Hierarchical Storage Management (HSM). The exact nature of the systems vary, but the idea is to create layers of storage, each layer taking more time to access than the previous, as conceptualised in Figure 1.
At the top of the hierarchy is on-line disk storage and at the bottom are the off-site data back-ups. In between the extremities are other devices such as Write Once Read Many (WORM) jukeboxes containing Magneto Optical (MO) storage, CD or DVD writables, tape silos for DAT, DLT, or Betacam tapes for example. At the streaming media show this year JVC showed a DVD RAM jukebox capable of storing 5.64TB of data with retrieval speeds of around 4.5 seconds claimed, through the use of new robotics that allow the cartridge to flip whilst it is being carried to the read/drive.
Figure 2 shows an example configuration using such a jukebox in a near-online capacity. In this example, an archive of recently used data is kept on the local drives of a SAN. 3TB capacity is available, and this is mirrored and RAID protected. Three processors operate on the same disk space, each performing a different process from index, back-up and search/retrieve. Although the JVC jukebox could be seen simply as an additional drive letter on the server, in this case the HSM product moves files invisibly from the on-line storage, leaving a small placeholder that looks like the original file to the operating system. Upon accessing this placeholder file, the HSM product intervenes and retrieves the file from the jukebox and replaces the placeholder with the original file when it is then sent to the client. Therefore the secondary near-line storage is not accessible directly. This clumsy-sounding procedure takes around 20 to 30 seconds to retrieve any file from the secondary near-line jukebox.
The HSM software sweeps the on-line storage at regular intervals and uses a user-defined mask to identify candidates for removal to near-offline storage. Each file that is retrieved stays on-line until its ‘Last Accessed Date’ exceeds the limitations set for the sweep. A good HSM product such as Veritas Netbackup Storage Migrator, will include options such as last accessed date as well as the date it was created. Settings could be as low as one hour, meaning that a snapshot is taken regularly, and files protected within the back-up hierarchy. As mentioned previously, HSM software vendors offer support for various configurations of secondary storage and the Veritas offering includes a third layer in the hierarchy.
As an example of HSM systems, the Veritas software is complete, and worth examining. To implement the HSM system, a jukebox such as the JVC is connected via SCSI2 cable to the server. A tape back-up mechanism such as a DLT drive is also required and the machine we configured had an additional tape drive for nightly system back-ups although this caused compatibility problems in early versions of the software. Pre-installation checks locate the necessary devices and the software is then pointed at a disk in order to have the HSM configured. Under NT, the software runs as a service and is configured using the HSM manager software. The default settings are fine to start with, and these specify a three-day cycle.
In operation, a drive letter is mapped to the on-line HSM volume and files are indexed on a daily basis. Once a file is saved on the HSM volume, it is subject to the sweep batch job. Again on our system this was set to run nightly and checked the ‘Last Accessed Date’ or ‘Last Modified date’ if this wasn’t available. The sweep job removes any files where this parameter is older than ten days and saves them off to the jukebox. Every time a single platter (a single side of a disk) is full, the contents of that platter are saved off to tape. Each night’s tape is brought up to the same level as the previous tape in the cycle before any platters are saved off. There are three tapes defined in the default setting and new tapes can be defined to make up a business week. As soon as a tape is full, the system defines a replacement and asks for a clean tape. Multi-tape cartridges are supported, allowing for hands-off running for some time. All the aforementioned parameters can be reduced and it is possible to set the system to perform a sweep every hour. However, the sweep job takes quite a lot of system resources, and so it would be advisable to remove this part of the process away from the user access machine and perhaps onto a separate processor by way of a SAN configuration, although it was not possible to test this configuration. Figure 3 shows how this would work.
Once all the platters in the jukebox are full, old platters can be changed for new, or add more jukeboxes on the SCSI chain. If a file is requested that is not on the platters, a warning appears and the file can either be restored from tape or the platter replaced. For the bigger operation, ADIC are now shipping libraries supporting many tape formats including Betamax, some of which scale up to 500TB.
The point, however, is to feel secure about the possibilities of replacing all the data, which the HSM provides, while at the same time offering capacity at better access times. Although the HSM system does this, there is still a huge number of files to wade through in order to find the one you want, and to address this Veritas include the Verity search engine which can display results as a web page. This took longer to configure and required some script writing to sort out. From this point of view, sympathetic design of the file and folder structures helped, since some order to the information can be brought about in this way.
Next month starts with a brief explanation of how to organise data.