Thursday, December 31, 2009

Online Hierarchical Storage Manager for Linux

Intel, Sandisk and Samsung are investing billions of dollars into SSD technology and manufacturing capacity. Unfortunately due to the extreme cost of building the manufacturing facilities, SSD manufacturing capacity is not likely to exceed HDD manufacturing capability for at least 10 years, and it may be 20 years or more. Most data center applications heavily lean toward database applications which use random read/write disk activity. For random read/write activity the performance of SSDs is 10x to 100x that of a single rotational disk. Unfortunately, the cost is also 10x to 100x that of a single rotational disk.

Due to the limited manufacturing capability of SSD, most applications are going to remain on rotational disk for the foreseeable future. We have developed OHSM to allow SSD and traditional HDD (including RAID) to be seamlessly merged into a single operational environment thus leveraging SSD while using only a modest amount of SSD capacity.

In a OHSM enabled environment, data is migrated to and from the high performing SSD storage to traditional storage based on various user defined policies. Thus if widely deployed, OHSM has the ability to improve computer performance in a significant way without a commiserate increase in cost. OHSM being developed as open source software also abolishes the licensing issues and the costs involved in using storage solution software. OHSM being online signifies the complete abolishment of the file system downtime and any changes to the existing namespace.

Online Hierarchical Storage Manager (OHSM) is the first attempt towards an enterprise level open source data storage manager which automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, such as hard disk drive arrays, are more expensive (per byte stored) than slower devices, such as optical discs and magnetic tape drives. While it would be ideal to have all data available on high-speed devices all the time, this is prohibitively expensive for many organizations. Instead, HSM systems store the bulk of the enterprise's data on slower devices, and then copy data to faster disk drives when needed. In effect, OHSM turns the fast disk drives into caches for the slower mass storage devices. There would be certain policies that would be set by the data center administrators as to which data can safely be moved to slower devices and which data should stay on the fast devices. Under manual circumstances the data centers suffers from down time and also change in the namespace. Policy rules specify both initial allocation destinations and relocation destinations as priority-ordered lists of placement classes. Files are allocated in the first placement class in the list if free space permits, in the second class if no free space is available in the first, and so forth.

The policies have been broadly rifted into two broad categories, Allocation and Relocation policy. Allocation policies come into play whenever a new file is created on the file system. The allocation of the physical blocks is decided depending upon polices that were set by the administrators. If none of the criteria matches, it eventually lands up on the default allocation policy that is used by the file system. Wherein, the Relocation polices plays its role at different time intervals as and when it is enforced by the admin. As the relocation of data happens at a lower lever than the file systems, this is totally concealed to the file system users. Obviously, the decision for the eligibility of data for relocation requires a complete FS scan but that’s not too frequent.

Fundamentally, enterprises organize their digital information as hierarchies (directories) of files. Files are usually closely associated with business purpose—documents, tables of transaction records, images, audio tracks, and other digital business objects are all conveniently represented as files, each with a business value. Files are therefore obvious objects around which to optimize storage and I/O cost and performance.

In a typical HSM scenario, data files which are frequently used are stored on disk drives, but are eventually migrated to tape if they are not used for a certain period of time, typically a few months. If a user does reuse a file which is on tape, it is automatically moved back to disk storage. The advantage is that the total amount of stored data can be much larger than the capacity of the disk storage available, but since only rarely-used files are on tape, most users will usually not notice any slowdown.

No comments:

Post a Comment