File System Input/Output

The Challenges

During 2010 the file system I/O challenges was embodied in increased HDD capacity without a corresponding performance increase, increasing numbers of file system clients, file counts, and I/O operations per process; the introduction of flash SSDs (solid-state disks) into the storage hierarchy with the prospect of other NVMs (non-volatile memories) to come, the necessity for power-efficiency, and the necessity to limit data movement.The File I/O thrust is currently focused on understanding how the challenges are changing and what to do to address them in 2011 and beyond.

The Areal Density Challenge

Hard disk drive (HDD) areal densities are increasing at a rate that far exceeds the rate at which data ingestion or extraction can occur. As illustrated in Figure 1, the number of bits stored on a HDD is increasing by X^2 and the rate of getting bits off it is only increasing by X. Furthermore, the total seek time (head seek to track + rotational latency) is remaining constant at 4-10ms as the areal density increases. As the density increases the number of operations, and consequently the number of seeks increases. So, we have to do more seeking at the same speed.

Figure 1: Graphic view of the areal density challenge

At DISKCON 2010 in September, the discussion of maintaining high areal density increases supported keeping HDDs the predominant storage device. Attendees focused on how to relegate NAND Flash SSDs to part of the enterprise device market. They do not seem to be looking to embrace change promised in technologies such as NVM, and are instead focused on sustaining the increase of bit densities.. SATA HDDs are being released that include integrated SSDs and the controller has the intelligence to store high-access data on the SSD and lesser-accessed data on the HDDA few speakers suggested that the HDD industry focus a bit more R&D on performance over areal density increases, but that was not the popular, prevailing opinion of the attendees.

The Increasing Client & I/O Quantity Challenge

In the HPC environment data is spread across an ever increasing number of HDDs without using their full capacities. This creates resilience and power consumption challenges as there is more hardware to manage and keep running in order to meet the bandwidth and IOPS requirements of large-scale HPC applications.

The processor manufacturers’ answer to Moore’s Law is to add more processor cores to the die and more dies to the motherboard. Figure 2 depicts this trend. The number of cores that are in a computer, a node, is growing faster than the node’s memory density. Consequently, the problems are being broken up into smaller chunks and more processes do smaller I/O operations.

Figure 2 : Increasing core per die and dies per node

Because the number of file system clients continues to increase and each is doing a larger number of I/O operations, an increasing burden is placed on the metadata and data servers of file systems. Figure 3 and Figure 4 illustrate this challenge.

Figure 3 : Each client performs more file system I/O operations

Figure 4 : Larger number of clients further exacerbates increased file system I/O operation challenge

New Device Integration Challenge

There is a value consideration in using all SSD versus SSD/HDD hybrid versus all HDD. The three important cost measures are $/MB/s (dollars per megabyte per second, or how much the bandwidth costs), $/IOPS (dollars per I/O operations per second, or how much I/O operation count costs), and $/GB (dollars per gigabyte, or how much the storage space costs).  It is important to understand how I/O is done on a HDD. Figure 5 shows how a HDD is organized. Tracks are the circles of HDD surface. Sectors are the arc area of a track between straight lines dissecting HDD. Typically the sectors are 512 Bytes, but soon will be 4KiB. Reads & Writes are always done in a certain number of sectors. A partial sector read returns all sector data and a partial sector write reads all sector data, updates the new bytes, & writes the updated sector data back to the HDD.

Figure 5 : Anatomy of a HDD

To address these measures, products have been released during 2010 that use NAND flash-based SSDs as part of a storage hierarchy both as an independent component and integrated into the HDD itself.  NAND Flash-based solutions have a limited lifetime due to scalability, endurance, and relatively slow access times. It is believed that it will not scale below 22nm because after that point there are so few electrons in a cell that determining the state of the cell becomes unreliable. One may only write to NAND Flash so many times before it will no longer work. That number of times is < 106.

Figure 6 shows how a SSD is organized. Pages are collections of 4KiB of Flash memory. Blocks are collections of pages that total 256KiB. Reads always return a certain number of pages and a partial page read returns all the page data. Writes always erase the block & write certain number of pages. Especially efficient implementations try to pre-erase blocks. Vendors use smart controllers to maximize the life of the drive, but it is not sufficient to support applications with continuously streaming data, which require at least 109 writes.

Figure 6 : Anatomy of a SSD

Finally, while NAND Flash is faster than HDD at reading and writing values to the drive, it is orders of magnitude slower than emerging technologies like PCM. Even technologies like PCM have issues like read/write imbalance. It takes longer to write a value to PCM than to read it. Figure 7 shows how a PCM device is organized. A cell is 1 (in SLC) or more (in MLC) bits of PCM. Cell Banks are collections of PCM cells, where each cell can be individually read or written. Reads can be any arbitrary number of bytes. Writes only affect the cells of the byte(s) written.

Figure 7 : Anatomy of a PCM Device

Future technologies, like Race Track or CNT (carbon nanotube) memory, promise improvements over PCM not only in scalability and endurance, but also balanced, faster access times.  In the longer-term view, these emerging and future technologies show promise to change how HPC systems are built. The technology area that is most directly applicable to the FSIO thrust is that of NVM. NVM components are expected to migrate first onto the disk drive bus (SATA and SAS), then onto the motherboard peripheral device bus (PCIe), and finally onto the motherboard memory or high-speed buses, either supplementing or replacing DRAM. Furthermore, these NVM components are expected to have near-DRAM speeds with small disk drive densities initially, and progress to better-than-DRAM speeds and densities of the largest current disk drives.

The FSIO thrust will investigate how applications behave with such devices and consider how application architectures may need to change to take advantage of them as they progress from the disk drive bus to the peripheral bus to the memory/high-speed buses. Once the NVM components make it to the memory/high-speed buses, another challenge is before us. Already processors become idle waiting for data to be transferred from DRAM to the CPU. The thrust will want to investigate ways to move processing out to the NVM to limit data movement to only that data needed for further computing on the CPU.

UCSD NVM Emulation

New NVM technologies, such as PCM, STT (spin transfer torque) MRAM (magnetoresistive random access memory), Memristors, and CNT (carbon nanotube) memory, nearing DRAM performance with lower power consumption and improved density as process technology scales, will arise in future systems. Figure 8 depicts such a system.

Figure 8 : Storage system architecture of the future

An extensive suite of test applications has been assembled and used to measure the work. The workloads include three types of applications: Microbenchmarks, data-intensive applications, and memory-intensive applications.  The microbenchmarks include XDD, building the Linux kernel, patching the Linux kernel, and an on-disk sort. These measure basic device and file system performance.
The data-intensive applications include the OLTP workload from Sysbench, and a computational biology database supplied by SDSC. These measure system performance for structured data access.
The memory-intensive applications include the NAS parallel benchmarks, GAMESS, and DGEMM.  These applications do not perform significant explicit IO, but we use data sets much larger than the available memory, causing them to page frequently.  They allow us to evaluate the devices’ suitability as a paging store and stress-test the NVM system for scalability issues.
In the future all storage will look like memory, and the current storage architectures and use methods, designed for disks where all disk I/O goes through the operating system, will not be adequate.

The operating system and the file system destroy all the performance gains from fast, non-volatile memories. The solution is to bypass the operating system with a transactional storage system that uses memory-style management for non-volatile storage. The abstractions will provide read and write access and automatic storage management without involving the operating system. In addition, UCSD is investigating how to preserve the strong protection guarantees provided by the file and operating system. 

TAMU Hybrid I/O with PCM/SSD/HDD Combinations

NAND Flash SSDs have different characteristics than HDDs. PCM, and other NVMs, have different characteristics than NAND Flash SSDs. Nevertheless, Linux device drivers and I/O schedulers assume any disk device is a spinning magnetic HDD. Optimal performance for a SSD or PCM requires understanding the device materials and how controllers manage them.
TAMU has looked at cache replacement policies to minimize the writebacks going to the PCM main memory, in order to reduce the write traffic, increase the PCM lifetime and improve performance and energy consumption as writes are more expensive at PCM both in latency and power consumption.
TAMU’s initial approach has devised a number of cache replacement policies. These new policies favor retaining dirty blocks in the last level cache longer than clean blocks in order to reduce the writeback traffic. The results show that these policies can reduce the writeback traffic on an average about 20-25% and improve energy consumption by about 20% without any impact on memory system performance.
TAMU is now exploring integrating a PCM memory model into current architectural simulators to explore more thoroughly the implications of the current work. TAMU is also starting some work on novel cache organizations to reduce the writeback traffic further to PCM based main memories.
TAMU developed a new cache organization that allowed more writeback reductions while keeping the miss rate low. This work is recently submitted to the HPCA conference. Adaptive policies that take application behavior into account to reduce the writebacks are being examined.
Changes to the file system will almost certainly be required. TAMU is exploring ideas on how to build new file systems for NVMs that can be directly attached to memory bus. To this end, TAMU is looking at the potential of memory bus attached PCM in the future. Some initial exploration using virtual memory style page management for the file system page allocation and management is taking place. TAMU hopes to be able to report some more concrete ideas and possibly some initial results by the end of 2011’s first quarter.

TAMU has carried out some initial work on getting the memory management infrastructure to work with the file system. Some tests of the developed code are being conducted. There are a number of issues that need to be resolved. TAMU is trying to minimize the impact of write ordering on the performance and exploring a number of novel ideas to speed up the file system performance on byte addressable NVMs.