The Case for Online Dedupe

By Michael
January 14, 2011

With data growth challenging IT departments everywhere, deduplication—getting rid of redundant data throughout a system—is gaining popularity. Because the process can impact system performance, though, it has historically been applied to data that’s been archived, typically in a virtual tape library. But that’s changing, as dedupe algorithms and disk controller hardware improves, although offline versus online dedupe remains a hot subject of debate among storage pros.

Redundant_files-dedupe Organizations concerned with performance hits because of deduplication have gravitated toward performing the process offline. However, one drawback to the approach is that duplicate data must be written to disk before it’s subject to deduplication. On the other hand, although real-time dedupe sucks up CPU and I/O resources, it relieves some of the demand on hard disk controllers because all those duplicated files aren’t being written and reread from a drive.

Online deduplication works better for some applications than others. For example, it works well in a virtual server or virtual desktop environment because virtual disk images often will share many of the same disk, operating system and application files. Home directories shared by users are also a fertile area for online dedupe because lots of redundant data is created as users pass around documents for revisions. Very large files, transactional databases, encrypted files and compressed files like PDFs and JPGs won’t benefit much from dedupe.

When considering deduplication, it’s wise to look for a flexible solution. “At a minimum, the system should allow different deduplication configurations for each storage volume,” Kurt Marko recently wrote at Processor.com. “Better yet are dynamic algorithms that can adjust deduplication parameters on the fly based on the underlying data and current system resource consumption.”

He reminded administrators: “The primary goal of any deduplication implementation is to make optimal use of available storage capacity at a total cost, including any software licenses and administrative overhead, that’s cheaper than merely adding additional array space.”

I’d like to know your opinion. What do you think? Is online depuplication an option for you? How do you handle dedupe? Does it always come down to a balanced approach?

1 Comments

Leave A Comment