Hierarchy storage in Tachyon. Jie.huang@intel.com, haoyuan.li@gmail.com, mingfei.shi@intel.com

Hierarchy storage in Tachyon Jie.huang@intel.com, haoyuan.li@gmail.com, mingfei.shi@intel.com Hierarchy storage in Tachyon... 1 Introduction... 1 Design consideration... 2 Feature overview... 2 Usage design... 2 System architecture... 3 Read/Write workflow in hierarchy storage... 3 Components... 5 Non-goal... 6 Constraints and Design tradeoffs... 7 Future plans... 7 Introduction Usually in most of the cases, memory space is insufficient to fit in all those hot data, which means some of the caching data will be flushed to next storage level while running out of space. Currently, Tachyon solves this problem by using CACHE_TRHOUGH write type, saving partial data in the memory space and entire data set persistent in the underlying file system with big enough space. This mechanism ensures the entire data set accessible and reliability while the memory is not big enough to hold them, but has two restrictions or constraints regarding the cost and performance by doing so. 1. To introduce certain network overhead potentially if using some distributed underlying file system with more replications. 2. Another constraint, if flushing all data set (including caching and non-caching parts) to the HDD in local file system mode, it causes huge performance penalty. The ideal way is to break data into two parts, one is on main memory and the rest is on certain high speed secondary storage devices with higher capacity. It would be better to leverage the hierarchy storage idea to provide some trade-off between capacity and performance. Some quick storage devices (like SSD) can be the secondary tier to provide certain balance between memory and HDD. The cache data can be flushed out to not only memory but also some external storage hierarchically. All the latest incoming data will be cached onto first tier. When its storage space is full, it will swap out the old data to its succession storage tier hierarchy. These are the major reasons to have hierarchy storage in Tachyon.

Design consideration Feature overview In general, the hierarchy storage in Tachyon introduces several more storage layers besides the existing single memory cache tier. The newly coming data is always cached onto top level storage (like memory) for quick speed. And if it runs out of space, the old data will be swap-out to its successor. The successor is recommended to have more storage spaces but with less read/write performance. And to retrieve the cached data, the end user can read those block files from any storage layer in the hierarchy storage of Tachyon. It helps to increase the cache spaces, and also have some read/write performance tradeoffs. Mem SSD HDD... Usage design In order to support the hierarchy storage in Tachyon, here introduces 2 sets of configuration items, which are related to worker conf and common conf respectively. Worker Conf 1. To configure the storage tiers and its storage directory list. The admin specifies the hierarchy storage tier by following configurations for each worker. Each storage level must have at least one storage folder. The maximum storage level for each worker is 5. You can enable 0 N(N < 5) storage tier in turn. Any non-configured layer will be omitted. The storage directories under single layer are delimited by comma. If those folders are not created, the worker tries to pre-create them accordingly during the initialization phase. data.level1.dirs= dir1, dir2 ; #high speed layer / small capacity data.level2.dirs= dir1, dir2 ; #medium speed layer / medium capacity data.level3.dirs= dir1, dir2 ; #low speed layer / large capacity 2. To configure the upper bound of storage capacity. Each storage layer has specific

quota for every storage dir which is separated by comma. If the cap configuration items mis-matched with storage folder number, it will be an invalid configuration. data.level1.dir.quota= 100G, 200G ; #high speed layer / small capacity data.level2.dir.quota= 500G,400G ; #medium speed layer / medium capacity data.level3.dir.quota= 1T, 1T ; #low speed layer / large capacity 3. To configure the storage tier s alias. The hierarchy storage introduces several storage layers, some of them may local, and some of them may leverage memory or some quick storage device. In order to tell if the current block is in memory or not, and make it compatible with the previous Tachyon version. It must specify storage alias accordingly. The default alias for each tier is unknown. Besides, by reading this alias, it somehow can translate the storage level to some readable words. data.level1.dir.alias= mem ; data.level2.dir.alias= ssd ; data.level3.dir.alias= hdd ; data.level4.dir.alias= hdfs ; data.level5.dir.alias= s3 ; The storage aliases are pre-defined in the Tachyon (mem, ssd, hdd, hdfs, s3, clusterfs). If you specify some unknown type, you need to register it beforehand. Otherwise, tachyon hierarchy store won t be recognize it. All those data stored in mem layer will be regarded as caching in memory, and IsInMemory() returns true. System architecture Read/Write workflow in hierarchy storage Every storage layer is expressed as StorageTier. Each worker maintains an array of StorageTier, and its children tier can be retrieved from array element. Every StorageTier consists of several StorageDirs, and every StorageDir requires either a StorageBlockReader or StorageBlockWriter to do the block file reading and writing. Different Reader and Writer implementation defines the concrete read/write operation behavior. The StorageDir can get proper StorageBlockReader or StorageBlockWriter by analyzing its dir path s scheme. Write While writing the cache data onto the local worker, the client firstly requests the blockid from master side. And then requests the storage space to that local worker with that blockid. The worker side always tries to request that space in the very first storage tier (level0). a) Randomly pickup one StorageDir to see if it meets the target, b) If success, just return back that StorageDir to client side.

c) Else, find the next neighbor StorageDir until one can fit that requested block size, and go back with that stop point. d) If no available spaces in all StorageDir of that layer, worker swap out some block files (based on certain elimination algorithm) to its child storage tier, until the free space is big enough for the new coming block file. e) Then the successor layer does the similar behavior starting from a). f) If it reaches the last storage layer, those eliminated files will be evicted. (or throw out some OutOfSpace exception) 1. RequestSpace and get StorageDir 2. getblockwriter().appendcurrentbuffer Worker Client Master StorageTier_0 StorageTier_1 Swap out heartbeat space counter StorageTier_2... Distributed File System StorageTier_3 Eviction Read While reading the data from certain StorageTier, the client needs to obtain the blockinfo(extended*) to get the location and storage information(i.e., storageids). The client requests that storage information along with the blockid to the local worker, and gets back StorageDir. Then obtain the StorageBlockReader accordingly and read that requested data. If no data found in local node, it sends the information to remote node, and receives the data remotely as usual. The remote worker will prepare the requested data and send it out by DataServer. If there needs further re-cache in the local worker while reading data from the network, the workflow follows the write operation mentioned previously.

Get clientblockinfo Worker Client Get blockinfo Master getblockreader().readbytebuffer StorageTier_0 StorageTier_1 StorageTier_2... Distributed File System StorageTier_3 Components Data store 1. StroageTier the storage tier which manages the caching blocks, the storage containers and their corresponding information (e.g., the capacity, the used/free space and etc.). The StorageTier is a linked-list-like structure. Every StorageTier points to its next StorageTier instance, unless it is the last level storage layer. The WorkerStorage only saves the reference of the frontier (i.e., the top level StorageTier), and requests the storage space and cache/free block through it. 2. StorageDir the storage container in every storage hierarchical layer, which provides the basic data manipulations and migrations. All the cache data reader and writer can be initiated based on its scheme in the StorageDir (more details can be found in next bullet - StorageBlockReader & StorageBlockWriter). And it also maintains the data migration between different layers or different containers in the same storage tier. For example, to move/copy a block file from one layer to another is quite common in the swap-out and eviction. So StorageDir provides some common way to support the data migration from one to the other. Basically, if the implementations of StorageDir are the same, it can call existing move/copy APIs in file system directly. If not, it will try to read from one StorageDir and write that input data to the destination. 3. StorageBlockReader & StorageBlockWriter the generic cache reader and writer interface, which defines the block read/write APIs. To read the cached block, the successor needs to implement ByteBuffer readbytebuffer(int offset, long length) ; To

write the block file, the successor needs to implement int appendcurrentbuffer(byte[] buf, int offset, int length). It is available to customize that StorageBlockReader and StorageBlockWriter to support different storage systems, like local file system, shared file system and etc. StorageTier -mworkerconf -mstoragelevel -mstoragedirs -mstoragedircapcities -mstoragedirfreespace -mstoragedirusedspace -mnexstoragetier +requestspace() +getstoragedir() +getstoragefile() +freeblocks() +storagetiereviction() StorageDir -mstoragedirname +getblockwriter() +getblockreader() +getfilepath() +_copyblock() +getblocklength() +existsblock() +deleteblock() +moveblock() +copyblock() «uses» «uses» «interface» StorageBlockReader +readbytebuffer() «interface» StorageBlockWriter +appendcurrentbuffer() StorageBlockReaderLocalFS StorageBlockReaderLocalFS Storage level information Hierarchy storage returns back more storage level information to hint if the block is mem-local, or ssd-local and so on. It helps the computation scheduler to allocate the resources more purposefully with better performance. Here to extend the exiting blockinfo by: 1. BlockInfo.getStorageIds - returns a list of storage ids for all storing nodes. Each storage id consists of the storage level and its alias, formatted as storagelevel_alias. For example, storageid of 0_mem means this block is stored in the first storage tier, and it s the memory storage layer. There can be other wrapper to feedback the storage information to certain computation framework based on their APIs. For example, following some API design in either Hadoop or Spark to report out if the block file is memory local based on the StorageId. 2. BlockInfo.IsInMemory - returns true if the block is stored in the storage tier named as mem. Non-goal The hierarchy won t replace the underlying file system which plays as the persistent storage pool beneath the Tachyon caching tier. Even the hierarchy storage layer setups on the persistent device, it is still regarded as a temp caching storage.

Constraints and Design tradeoffs 1. The blockinfo only knows which storage layer the file goes to. While reading the data from that storage layer, the worker needs to find certain block by checking all StorageDirs one by one. It may not so that efficient, but really avoids certain burdens in master node. Since all blockinfo are saved on the master node. More details will bring higher memory usage on that master. 2. Currently, master node only maintains the memory usage status for each worker. After adding the hierarchy storage mechanism, we need to add more usage messages for each storage layer on every worker. One solution is to maintain the total usage information as before. The other might be putting more details for each node. Future plans All the following future plans won t be covered in the first phase s implementation. 1. Async-eviction: Add threshold to avoid long time blocking during the space sweeping, for example if the free space is under 20%, do the data sweeping asynchronously. 2. Add more elimination algorithms besides LRU, and make it pluggable 3. Work as read cache. The user can decide the promote strategy, either none, exclusively or inclusive. None means the user read from that storage layer where it lives originally. Exclusively means there is no data overlap between each storage tier, if any data is promoted to the memory, it will be removed from its origin residence; Vice versa, we can keep some re-cached (promoted) data in its origin place, which means there will be 2 copies by choosing inclusive strategy. a) NONE, just read from the existing storage tier directly b) KEEP(inclusively), promote to ram and also keep original copy there c) SWAPIN(exclusively), promote to ram and also delete the original copy 4. Dynamic data movement in different storage layers from the statistics of hot and cold data. This helps to increase the cache hit ratio in the higher layers and decrease the data exchange between different layers. 5. Write/Read to/from the user specified storage layer