June 25, 2013

Deduplication and data lifecycle management

While there is no denying that deduplication can be a tremendously helpful tool for backup administrators, there have been certain areas in which deduplication has historically proven to be inefficient.

For example, many deduplication solutions have ignored the fact that some organizations perform backups that are more sophisticated than a basic backup, which merely creates a redundant copy of production data. Larger organizations may use multiple tiers of backup storage, in which data is retained for varying lengths of time. Additionally, business continuity requirements may mean that some of the backed up or archived data may reside off-premises, either in an alternate data center or in the cloud.

To see how the deduplication process works in such a situation, imagine that an organization creates disk-based backups to an on-premises backup server which stores 30 days' worth of backups. Then, the backup server's contents are replicated to an off-site backup appliance and workflows are in place to move aging backup data onto less expensive storage. Let's say 120 days' worth of backups are stored in an off-site data center and that a full two years' worth of backups are stored in the cloud.

Data is deduplicated as it is written to the on-premises backup server. The replica backup server is a mirror of the primary backup server, so deduplicated data can be sent to the replica without the need for rehydration.

However, when data is sent to is the off-site, long-term data repositories things start to get messy. The off-site storage is not a mirror to the backup server, so the backup server's contents cannot simply be replicated. Instead, the data must be rehydrated before it can be sent to the off-site storage. And, because the data is being sent off site, it will likely need to be deduplicated at the source side prior to being sent over the wire. In essence, deduplicated data is being rehydrated only to be deduplicated once again.

Because of these inefficiencies, backup and deduplication vendors have developed various solutions to make the process of moving data to a long-term repository more efficient. CommVault, for example, provides a technology known as Deduplication Accelerated Streaming Hash, or DASH as they like to call it.

DASH makes it possible to move deduplicated data across storage tiers without the need for rehydration. DASH provides the means for creating deduplication-aware secondary copies of your data. The copy process can be either disk-read optimized or network-read optimized.

For a disk-read optimized copy, signatures are read from the source disk's metadata and then sent to the destination media agent, which compares the signatures to the signatures on the destination deduplication database.

If the signature already exists in the destination database, then the data also exists in the destination, so there is no need to transmit the data again. Instead, only the signature references are transmitted. If, on the other hand, no signature is found, then the data is presumed to be new and so the new data is transmitted to the destination, and the destination database is updated.

FalconStor takes a somewhat similar approach with its File Interface Deduplication System (FDS) solution. The product makes use of a global deduplication repository, which makes WAN-optimized replication to remote datacenters possible.

Network-read optimized copy operations are ideal for low bandwidth environments, but they are more I/O intensive than disk read optimized copy operations. The process works similarly to a disk read optimized copy, except that a disk read optimized copy reads signatures from the primary disk's metadata. A network optimized copy operation unravels the data on the primary disk and generates signatures to send to the destination media agent for comparison.
Conclusion

Historically, deduplication has not worked very well in environments in which workflows move data among storage tiers according to retention requirements. The process of moving data typically requires the data to be rehydrated prior to the move operation. New technologies such as CommVault's DASH make it possible to copy deduplicated data to a secondary storage tier without first rehydrating the data.

No comments:

Post a Comment