Optimizing Ext4 for Low Memory Environments Theodore Ts'o November 7, 2012
Agenda Status of Ext4 Why do we care about Low Memory Environments: Cloud Computing Optimizing Ext4 for Low Memory Environments Conclusion
Ext4 Status Now stable in the most common configurations Some distributions are planning on replacing ext[23] with ext4 New features recently added to ext4 Punch system call Metadata checksumming Online resizing for > 16TB file systems
Advantages of ext4 Modern file system that is still reasonably simple Lines of Code as a Proxy for Complexity (as of 3.6.5) Minix: 2441 Ext2: 9703 Ext3: 19,304 Ext4: 41,249 Btrfs: 88,189 XFS: 94,591
Advantages of ext4 Modern file system that is still reasonably simple Portions of the code base are (relatively) stable and are time-tested Userspace utilities Journal Block layer (also used by OCFS2)
Advantages of ext4 Modern file system that is still reasonably simple Portions of the code base are (relatively) stable and are time-tested Incremental development instead of rip and replace Well understood performance characteristics
Disadvantages of ext4 Incremental development means that certain design decisions are very hard to change: Fixed inode table Bitmap based allocations 32-bit inode numbers Currently RAID support is extremely weak Lack of sexy new features Compression Filesystem-level snapshots (use thin provisioned snapshots instead) FS-aware RAID and LVM
Common Ext4 Use Cases Default File System for Desktop / Servers Distributions may change this choice in the future Android devices (Honeycomb / Ice Cream Sandwich) Cloud storage servers
Rise of Cloud Computing Or Grid Computing, Utility Computing, etc. Challenges Usability How to deliver something useful to the user? SAAS PAAS Custom programming for cloud/grid/utility compluting Security Public vs. Private Clouds? Economics Is it really cheaper at the end of the day?
Rise of Cloud Computing Or Grid Computing, Utility Computing, etc. The economics of cloud computing Really big, efficient data centers More efficient use of servers Traditional servers often don't use their resources efficiently CPU Disk Networking Bandwidth To make the cloud economics work important to pack a lot of jobs onto a smaller number of servers Virtualization Containers
Using resources efficiently in file systems Restricted memory means less caching available Data Blocks Metadata Blocks Block allocation bitmaps are the big problem When they get pushed out of memory, long unlink() and fallocate() times Surpringly, CPU can be a problem too Especially for PCIe attached flash (large IOP/s) Plenty of other uses for the CPU (transcoding video formats) Also important for large-scale macro benchmarks (TPC-C)
Restricted Memory is a problem for Copy-on-Write file systems, too Suggestion from the ZFS Open Solaris list: If you are using a laptop and not serving anything and performance is not a major concern and you're free to reboot whenever you want, then you can survive on 2G of ram. But a server presumably DOES stuff and you don't want to reboot frequently. I'd recommend 4G minimally, 8G standard, and if you run any applications (databases, web servers, symantec products) then add more. http://permalink.gmane.org/gmane.os.solaris.opensolaris.zfs/44928
A short aside about latency Avoiding latency makes the users happy Fast is better than slow. We know your time is valuable, so when you re seeking an answer on the web you want it right away and we aim to please. We may be the only people in the world who can say our goal is to have people leave our homepage as quickly as possible... And we continue to work on making it all go even faster. From Ten things we know to be true
A short aside about latency Avoiding latency makes the users happy A few slow requests slow the requests behind them...
A short aside about latency Avoiding latency makes the users happy A few slow requests slow the requests behind them A few slow operations effectively slows down its peers in a distributed computation
Optimizing ext4 for low-memory environments No Journal Mode Smarter metadata caching
No Journal Mode for Ext4 General principle: Don't pay for features you don't need A review of cluster storage at Google The hardware Thousands of machines in a data center Tens of thousands of disks GFS as a clustered file system Replication at the clustered file system level (So we can survive loss of machines) Checksumming done by the clustered file system (The end to end principle)
No Journal Mode for Ext4 General principle: Don't pay for features you don't need A review of cluster storage at Google Journaling is not free
Journalling is not free FFSB Large File Creates 2 CPU's using Direct I/O 2350.00 Transactions per second 2300.00 2250.00 2200.00 2150.00 2100.00 2050.00 2000.00 ext4 ext4 nojournal
No Journal Mode for Ext4 General principle: Don't pay for features you don't need A review of cluster storage at Google Journaling is not free No journal mode one of the first Google changes to ext4 Wanted the improvements of extents, delayed allocation, etc. Google had chosen not to use ext3 since journalling had significant costs Ext4 in no journal mode is the best of both worlds
Improving metadata caching Small inodes Ext2 only supported 128 byte inodes Ext3/ext4 supports larger inodes 256 byte default Used to store extended attributes Also used to store subsecond timestamps for ext4 Small inodes means more inodes per block --- makes a huge difference in memory limited environments
Effects of 128 byte inodes FFSB Large File Creates 2 CPU's using Direct I/O 2500.00 Transactions per second 2400.00 2300.00 2200.00 2100.00 2000.00 1900.00 ext4 ext4-128i ext4 nojournal ext4 128I NJ
Improving metadata caching Small inodes Free block statistics for each block group Ext4 now caches the size of the largest available free block This allows a block group to be evaluated without needing to needing to consult the block bitmap
Improving metadata caching Small inodes Free block statistics for each block group Inode extent information Ext4's on-disk format uses 12 bytes/extent 4 in inode 340 in a 4k extent tree leaf block Maximum 128M in an extent
Improving metadata caching Small inodes Free block statistics for each block group Inode extent information Internal bigextent patch in Google An in-memory b-tree which collapses adjacent extents Originally because cache line misses was measurable while searching the on-disk representation on PCIe attached flash Takes less memory than a 4k extent block in most cases Will be going upstream soon
Conclusion General Purpose File System Myth
General Purpose File System Myth? There can only be one!
General Purpose File System Myth? There can only be one! Too hard for users to choose File systems used to be used for many things at the same time But... workloads are different Design tradeoffs; optimizing for one workload can compromise another How did this myth survive for so long? Many workloads did not stress the file system File systems were simpler fewer features Servers were more inefficiently run more idle resources
Conclusion General Purpose File System Myth Future ext4 work Extent Status Tree (provides SEEK_HOLE/SEEK_DATA support) Inline data RAID stripe awareness Can also be used to make ext4 erase block aware for emmc devices with primitive flash translation layers Atomic msync() Terence Kelly and Stan Park at HP
Conclusion General Purpose File System Myth Future ext4 work Remember to optimize the entire storage stack Functionality at the block device layer Thin-provisioned snapshots dm-cache / bcache Optimizing userspace The sqllite library Applications Improving abstractions up and down the storage stack
Thank You!