Practical Online Filesystem Checking and Repair

Similar documents
COS 318: Operating Systems

CS3210: Crash consistency. Taesoo Kim

Fault Isolation and Quick Recovery in Isolation File Systems

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

File-System Implementation

Alternatives to Big Backup

ViewBox: Integrating Local File System with Cloud Storage Service

We mean.network File System

LVM2 data recovery. Milan Brož LinuxAlt 2009, Brno

EMC DATA DOMAIN DATA INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY

Chapter 11: File System Implementation. Operating System Concepts with Java 8 th Edition

Case Study: Quick data recovery using HOT SWAP trick in Data Compass

Quo vadis Linux File Systems: An operations point of view on EXT4 and BTRFS. Udo Seidel

File Systems Management and Examples

<Insert Picture Here> Btrfs Filesystem

Ryusuke KONISHI NTT Cyberspace Laboratories NTT Corporation

Hosting Users Guide 2011

Network File System (NFS) Pradipta De

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems

Linux Filesystem Comparisons

Taking Linux File and Storage Systems into the Future. Ric Wheeler Director Kernel File and Storage Team Red Hat, Incorporated

Hyperoo 2 User Guide. Hyperoo 2 User Guide

File Management. Chapter 12

Silent data corruption in SATA arrays: A solution

Lecture 18: Reliable Storage

09'Linux Plumbers Conference

The BRU Advantage. Contact BRU Sales. Technical Support. T: F: W:

Scaling Graphite Installations

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Operating System Engineering: Fall 2005

Two Parts. Filesystem Interface. Filesystem design. Interface the user sees. Implementing the interface

Storage and File Systems. Chester Rebeiro IIT Madras

ProTrack: A Simple Provenance-tracking Filesystem

Chip Coldwell Senior Software Engineer, Red Hat

Data Storage - II: Efficient Usage & Errors

Optimizing Ext4 for Low Memory Environments

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Practical issues in DIY RAID Recovery

Jenesis Software - Podcast Episode 3

NSS Volume Data Recovery

Resolving network file speed & lockup problems

NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases

Microsoft SQL Server Guide. Best Practices and Backup Procedures

Recommended Backup Strategy for FileMaker Server 7, 8, 9 and 10 for Macintosh & Windows Updated February 2009

How to Choose your Red Hat Enterprise Linux Filesystem

Linux Powered Storage:

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

The Windows File Articles -> Software Oct , 00:45 (UTC+0)

Special Report: 5 Mistakes Homeowners Make When Selling A House. And The Simple Tricks To Avoid Them!

Database Maintenance Essentials

File System Management

A block based storage model for remote online backups in a trust no one environment

DISK DEFRAG Professional

Lab 2 : Basic File Server. Introduction

Transactions and ACID in MongoDB

Development at the Speed and Scale of Google. Ashish Kumar Engineering Tools

The Linux Virtual Filesystem

Chapter 3: Building Your Active Directory Structure Objectives

Chapter 11 I/O Management and Disk Scheduling

StorageCraft ShadowStream User Guide StorageCraft Copyright Declaration

Introduction. What is RAID? The Array and RAID Controller Concept. Click here to print this article. Re-Printed From SLCentral

Troubleshooting PCs What to do when you don't know what to do!

Simple Solution for a Location Service. Naming vs. Locating Entities. Forwarding Pointers (2) Forwarding Pointers (1)

Solution to the problem

APPLICATION VIRTUALIZATION TECHNOLOGIES WHITEPAPER

Backup and Disaster Recovery Restoration Guide

MailEnable Connector for Microsoft Outlook

5 Reasons Your Business Needs Network Monitoring

Chapter 11: File System Implementation. Operating System Concepts 8 th Edition

Exercise 2 : checksums, RAID and erasure coding

Rsync: The Best Backup System Ever

Encrypted File Systems. Don Porter CSE 506

Centralized Mac Home Directories On Windows Servers: Using Windows To Serve The Mac

Secure application programming in the presence of side channel attacks. Marc Witteman & Harko Robroch Riscure 04/09/08 Session Code: RR-203

Recommended Backup Strategy for FileMaker Pro Server 7/8 for Macintosh & Windows Updated March 2006

Creating and Managing Shared Folders

Analysis of Algorithms I: Binary Search Trees

Configuring ehealth Application Response to Monitor Web Applications

Network Monitoring with Xian Network Manager

Massive Data Storage

Five Reasons Your Business Needs Network Monitoring

Oracle Cluster File System on Linux Version 2. Kurt Hackel Señor Software Developer Oracle Corporation

File System Design for and NSF File Server Appliance

Backup & Restore Instructions

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Status and Direction of Kernel Development

Using 2Can. There are three basic steps involved in migrating all of your data from your BlackBerry to your Android phone:

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

PARALLELS CLOUD STORAGE

SOLIDWORKS FILE MANAGEMENT

Release Notes OPC-Server V3 Alarm Event for High Availability

Naming vs. Locating Entities

The Win32 Network Management APIs

PCI vs. PCI Express vs. AGP

Transcription:

Practical Online Filesystem Checking and Repair Daniel Phillips Samsung Research America (Silicon Valley) d.phillips@partner.samsung.com 1 2013 SAMSUNG Electronics Co.

Why we want online checking: Fsck time at tens of minutes and heading higher SSD arrive and fsck algorithms got better But still an issue for: Most laptops and workstations Your bulk storage Data centers Everybody has had the fsck experience at just the wrong time. 2 2013 SAMSUNG Electronics Co.

Waiting for the next fsck 3 2013 SAMSUNG Electronics Co.

"You should strongly consider the consequences of disabling mount-count-dependent checking entirely. Bad disk drives, cables, memory, and kernel bugs could all corrupt a filesystem without marking the filesystem dirty or in error." -- tune2fs man page 4 2013 SAMSUNG Electronics Co.

What about checksumming? Easy to implement Relatively efficient Increases confidence in your data Not a substitute for repair! Does not tell you much about what is corrupted Eventually a checksum will certainly fail. Then what? 5 2013 SAMSUNG Electronics Co.

Reasons for not attempting online checking: SSD makes the problem go away 20 times smaller, 20 times faster = 400 times less annoying Handheld revolution changed the game Phone reboots are relatively rare Hide the problem in the cloud Let somebody else worry about it It s not easy We have more pressing issues Maybe it won t happen to me 6 2013 SAMSUNG Electronics Co.

It s like changing a tire without stopping the car. 7 2013 SAMSUNG Electronics Co.

8 2013 SAMSUNG Electronics Co.

Reasons for doing it: Sweetens the availability equation for data centers Ultimately saves money Longer practical uptime Even a small stall is annoying on a phone It s a challenging engineering problem The last big unsolved problem in storage? It s a great research problem Carve your initials in the history of storage technology! 9 2013 SAMSUNG Electronics Co.

10 2013 SAMSUNG Electronics Co.

The purpose of this work is to remove "we don't know how to do it" as a reason for not doing it. 11 2013 SAMSUNG Electronics Co.

Damage that really hurts: Block marked free when it isn't Corruption propagates rapidly Multiple references to block Becomes a double free Lose block of pointers high in the tree Can lose the entire filesystem A constant danger for CoW filesystem designs Disconnected directory tree Where did my files go?? 12 2013 SAMSUNG Electronics Co.

How do you check a filesystem while it is changing? Frontend/Backend separation is a big help Can still write to cache while backend suspended Way better than freezing at VFS level It has to be incremental Does not have to be fast Must not kill performance Topology could change at any time Algorithms must tolerate this Global structure much harder to check than local From here on, just consider global checks 13 2013 SAMSUNG Electronics Co.

Global structure we need to check Physical structure: block leaks and double references Logical structure: directory connectivity and loops inode leaks and link counts This is hard. 14 2013 SAMSUNG Electronics Co.

How do we do it offline? Check physical structure Data structure: Shadow bitmap One bit for each volume block Algorithm: 1) Walk tree and mark off blocks in shadow Blocks already marked are double referenced 2) Compare bitmaps to shadow bitmaps Blocks free in shadow but not in bitmap are lost Blocks free in bitmap but not in shadow are future corruption 15 2013 SAMSUNG Electronics Co.

How do we do it offline? Check logical structure Data structure: Link count hash map for all inodes Algorithm: 1) Walk directories in directory tree order Increment entry target in hash Already seen directories are loops 2) Walk inode table comparing link counts to hash Unreferenced inodes are lost Excess references are future double frees 16 2013 SAMSUNG Electronics Co.

Online checking is way harder. Let s try to make it easier... 17 2013 SAMSUNG Electronics Co.

Big idea: a new purpose for an old idea Block groups! Already need them for allocation algorithms Introduce a far map of far pointers per group Far pointers are supposed to be relatively rare Allocation policy tries to enforce this Low level pointer changes update far map Update far map only if destination changes High level filesystem algorithms otherwise unchanged 18 2013 SAMSUNG Electronics Co.

More about far maps Far map entry fields: Source group Destination offset Pointer type Inode table tree File data tree Data extent Block Count (Note: no source offset) 19 2013 SAMSUNG Electronics Co.

Monotonic progress... a recurring theme Need stable enumerations: By block group By inode number Examples of unstable enumerations: Filesystem tree order unstable for physical checks Directory order unstable for namespace checks 20 2013 SAMSUNG Electronics Co.

The master plan: 1) Sweep bottom to top by block groups Check block group and mark good 2) Sweep bottom to top in inode order Check directory path to root and mark good 3) Clear checked bits and do it again. 21 2013 SAMSUNG Electronics Co.

Hermetic accounting per group Inodes + far map entries local subtree roots Walk local subtrees shadow map If shadow map matches bitmap the group is good What does that give us? Less than a second to read group into cache Old school hard disk Usually no visible stall Frontend updates continue in cache Already a nontrivial level of integrity checking 22 2013 SAMSUNG Electronics Co.

the Atlas experiment 23 2013 SAMSUNG Electronics Co.

A short break for some philosophy. We rely on far maps to do incremental group consistency How do we know the far maps are consistent? We don t. But what is consistency, really? An inconsistent system has ambiguous interpretations Choose the one that does the least harm How do we know an updated group is still consistent? We don t. Consistency has a shelf life. A group marked good just means we trust it more We trust far maps if we can read them Great place for a checksum! 24 2013 SAMSUNG Electronics Co.

Online filesystem repair Rebuilding far maps online Start with all empty far maps Live far pointer updates can continue Do not complain about missing ones Walk itable per group Inode number to group correspondence Rely on inode structures, bitmaps and partial far maps Expect non hermetic accounting Any used blocks not accounted for are "in the fog" Keep a map of fogged regions by group Eventually find a far pointer to earlier fog region Recursively resolve other fogged regions 25 2013 SAMSUNG Electronics Co.

Incremental directory connectivity and loops Can t walk in directory order because tree can mutate during walk Walk in inode order instead For each directory inode, walk parent pointers to root To find loops, keep hash of directories already seen Stop at first directory marked good At each step, verify parent references child Store name attribute per directory Not as efficient as directory order walk, but we have more time because it s online 26 2013 SAMSUNG Electronics Co.

Inode reference counts and leaks Walk in inode table order Update hash of inode reference counts Mirror link count updates to hash when source inode lies below walk cursor Otherwise, just like the offline check 27 2013 SAMSUNG Electronics Co.

Online filesystem repair Repair is the easy part. The hard part is deciding what is broken Helpful cache model: Tree rooted in dirty cache Repair the cache view of the filesystem Commit delta when things look good Prefer to trust structure over bitmaps Be quick to mark referenced blocks used Be slow to mark unreferenced blocks free Recreate missing inodes from open files If things get ugly we can always go offline 28 2013 SAMSUNG Electronics Co.

Online filesystem repair Low level freespace scan Directory was lost and user wants data back! We should probably go offline, but... Why not try hunting through free space and see what we can find? Caveat: our running system might overwrite what we are looking for OK, it looks bad, let s just go offline Then we need to deal with... 29 2013 SAMSUNG Electronics Co.

Online filesystem repair litter 30 2013 SAMSUNG Electronics Co.

Online filesystem repair Which one is the real one? 31 2013 SAMSUNG Electronics Co.

Online filesystem repair Litter Copy on write filesystem designs all create litter Fixed metadata position like Ext4 is a big advantage Introduce uptags that we can scan for: Magic number type tag Delta counter Owner Logical position To save space, just store low order bits This is all about improving probabilities 32 2013 SAMSUNG Electronics Co.

Online filesystem repair Conclusion Add a little metadata, get a lot of results Repairing far map metadata is tricky but doable Repair is easy if we know what is wrong Even basic physical checks are already useful Runtime overhead looks small Desktop, servers and data centers benefit The only thing standing in the way: a lot of hard work 33 2013 SAMSUNG Electronics Co.

Questions? Daniel Phillips Samsung Research America (Silicon Valley) d.phillips@partner.samsung.com 34 2013 SAMSUNG Electronics Co.