ProTrack: A Simple Provenance-tracking Filesystem Somak Das Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology das@mit.edu Abstract Provenance describes a file s history. A user may want to discover the origin of a program on his system and study its source code or follow the propagation of mistakes in presentation slides and fix them. ProTrack is a simple yet effective provenance-tracking filesystem. Our approach divides a file into applicationspecific parts and stores provenance metadata in a hashtable. We believe querying file ancestry is ProTrack s most important feature, so we focus on optimizing traversals of persistent, on-disk data. The system design reduces forward and backward searches to fast lookups from within an inode. By storing parent/child links twice, we sacrifice some storage for speed, but being able to follow and manipulate dependency paths in either direction increases the responsiveness of our system. I. DESI GN A. File representation The main components of ProTrack on disk are the storage of versions, parts, provenance metadata, and parent directories for each file (see Figure 2). Versioning filesystem. To accurately track each file s provenance over time, the system needs an underlying filesystem that stores all versions of each file over time. Thus we build ProTrack on top of a versioning filesystem (VFS). We assume that the VFS creates separate inodes for different versions of the same file. File parts. ProTrack supports partitioning a file into a set of parts. Each part represents an applicationdefined unit of the file, such as a slide in a PowerPoint Presentation or an individual compressed file in a Zip archive. This enables fine-grained provenance tracking for the user: instead of hunting through an entire slide deck, he is able to quickly look up a particular slide s source. We implement parts by taking hints from the Unix FS. Each file s inode in Unix stores a set of device address pointing to data blocks on disk, where the last three point to indirect blocks for larger files. The ondisk inode in ProTrack stores an array of 32 numbered parts, which are each pointers to a Unix-style set of device addresses. The array index corresponds to the part number. Since some applications (like unmodified ones) will not define parts, Part 0 is reserved for file data that does not belong to any part. For files that need more parts, the system uses Unix s strategy of indirect layers the last (32 nd ) entry points to an indirect array of parts. This strategy does not adversely affect performance because, in practice, a device cache mechanism eliminates most of the indirect fetches. Provenance metadata. Each file stores its own provenance as lists of direct parents and children. In particular, the inode is augmented with a pointer to an int->provinfo hash-table on disk (with twice as many slots as parts and linear probing). This data structure, with amortized O(1) cost per lookup, maps the file s part numbers to their corresponding provenance information linked-lists of immediate ancestors and descendants, specified by (inode number, part number) pairs (see Figure 1). Fig. 1: Pseudocode for the ProvInfo data structure, which contains linked-lists of ProvNodes. struct ProvInfo { ProvNode *parents; ProvNode *children; } struct ProvNode { int inum, partnum; ProvInfo *next; } Parent directories. Provenance data stores inode and part numbers, but those are not user-friendly. While Unix supports converting file paths to i-numbers, the system should support reverse lookups too, from i- numbers to human-readable paths with filenames. Our simple and space-efficient solution is to only store i- numbers of the parent directories of the file s hard links inside the inode (note that parent directories refer to directories containing the file, not provenance) and recreate absolute paths with them. B. API implementation The implementation of ProTrack s system introduces new part and provenance operations, as well as functionality to existing file I/O operations. We designed the application programming interface to be minimal, so as to not introduce redundancy between calls. File I/O. ProTrack adds extra tracking functionality to the internals of standard, unchanged file I/O calls to support provenance. This allows an unmodified program, like a file copying utility, to update ancestor and descendant data as accurately as possible. When a process executes, the system updates provenance metadata by linking the files being read (parents) to the files being written to (children). The Unix kernel 1
today tracks a process s open files in an in-memory file descriptor table. ProTrack augments this existing table with read and write flags for each file (simple bits representing 0 for unflagged and 1 for flagged, see Figure 3). Since the process will probably call read() and write() many times, the system as a performance optimization only updates a written file s provenance once: when the process calls close() on it. The update consists of adding the files that were read to that file s list of immediate ancestors, using the link_prov() system call. The following file operations act like file I/O methods from Unix, with some additions: open(..) adds the file to the process s file descriptor table, with read and write unflagged. read(..) flags the file as read. write(..) flags the file as written. close(..) updates provenance of the file if it is written, making it dependent on all the files the process read (files in the table with read flagged) except those that are the file itself or are redundant with existing parents. The extended link() and unlink() system calls, which affect hard links, ensure that a file s list of parent directories is up-to-date. Since they use a lookup procedure that traverses over absolute paths, they can monitor the inode number of the directory containing that file. link(name, link_name) adds a new parent directory s i-number by examining the absolute path of link_name. unlink(name) removes an old parent directory s i- number by examining the absolute path of name. Part operations. ProTrack s part operations provide a simple interface for manipulating file parts, using Unix s existing file descriptors as a handle: create_part(fd) allocates a new part containing addresses pointing to data blocks. It returns the unique part number for the application to use for later operations. delete_part(fd, partnum)first updates provenance data of the file part s parents and children and then frees the part and its corresponding data blocks. The part number is now available for reuse. seek_part(fd, partnum) sets the file cursor to the beginning of the partnum th part. Since fd points the correct location, the application can use standard read() and write() system calls. Provenance operations. The supported provenance operations allow applications, along with system calls like close(), to update and query file history: Fig. 2: The contents of an inode in ProTrack. Note that it is augmented with file parts as well as information about provenance and parent directories. Fig. 3: A file descriptor table for a process copying src. file to dest.file. Flagged states are shaded green. As the blue arrow indicates, ProTrack s file I/O methods will eventually set the destination as the source s child and the source as the destination s parent. read_prov(fd, partnum, numlevels) finds the files on which the file part depends specifically, its ancestors up to the specified number of levels and returns a list of their i-numbers/parts. For example, if numlevels = 1, then calling read_prov() just returns the file s direct parents. Recall that Part 0 is special, so calling the function with partnum = 0 returns all the ancestors of that file as a whole. search_prov(fd, partnum, numlevels) similarly finds the files which depend on the file part descendants down to the specified number of levels and returns a list. link_prov(parentinum, parentpartnum, childinum, childpartnum) adds the child file part to the parent s list of immediate descendants and adds the parent file part to the child s list of immediate ancestors ensuring a complete, bidirectional mapping. unlink_prov(parentinum, parentpartnum, childinum, childpartnum) does the opposite of link_prov, removing. inode_number_to_paths(inum) converts an inode number to a list of human-readable paths that have hard links to the inode. It quickly resolves the i-number using the procedure in Figure 4. 2
Fig. 4: Pseudocode for the inode_numbers_to_path() system call inode_number_to_paths(inum) foreach parent directory stored in the inum inode Initialize current i-number = inum, current directory = parent directory while the current directory s parent is not itself (root) Look-up the current i-number in the current directory Prepend the filename found to the beginning of the path Update current i-number = current directory s i-number, current directory = current directory s parent return all unique paths found to inum Calls to read_prov() and search_prov() perform a backward and forward search, respectively, to traverse ancestors and descendants. Conceptually, the references to parents and children organize the VFS files into a graph. Thus the breadth-first search algorithm enables ProTrack to find all dependencies. With dynamic programming optimizations such as storing visited inodes to prevent unnecessary revisits, the search is time and space-efficient. File versions and garbage collection. For correct behavior, different versions of the same file over time have different provenance data. Since the VFS stores them as separate i-nodes, ProTrack stores the versions provenance in separate locations. When the VFS creates a new version of a file, it must call ProTrack to maintain the right relationships. ProTrack then copies the old file version s parents by value into the new version. An alternate implementation would just have a reference to the old version, but this approach is preferable. It does not rely on the old version staying alive in the FS, and it does not have long chains of files pointing to older versions of themselves. Our system does not assume a continuous VFS that keeps old versions of a file, because, in practice, many versions will be deleted to reclaim disk space ProTrack implements garbage collection based on this observation. When the VFS decides to delete an old file version (not necessarily when reference count decrements to zero), it must call ProTrack to immediately update the provenance of that file s dependencies. ProTrack then dynamically joins the file s ancestors to its descendants. (The same also applies to individual file parts.) This just-in-time garbage collection policy eliminates the need for scheduling a periodic process that does the same thus always keeping ProTrack consistent with the VFS. II. ANALYSI S A. Usage scenarios The following examples illustrate the behavior of ProTrack in four particular use cases. Copying files. ProTrack correctly updates provenance of a file even when the copy utility has not been modified to be provenance-aware. A program that copies file src to dest (from source to destination) continuously reads from src and writes to dest, which its file descriptor table knows via read/write flags. When it closes dest, ProTrack will correctly infer that dest depends on src and link them together as child and parent. We made the design decision to copy by reference (to src) instead of copy the provenance information by value. As a result, the user is able to track all the locations he copied src to. Compiling software. Similar to copy, ProTrack has full native support for the Make utility that builds executables and libraries from source code. Because Make reads source code and writes to intermediate files (like C object files), the intermediates become the source code s children. The resulting binaries, to which Make writes machine language, become children of the source and intermediate files because both were read during the compilation process. Copying PowerPoint slides. ProTrack introduces simple interfaces to read/write file parts and provenance data that provenance-aware applications can use. For example, PowerPoint can create a part for each slide, uniquely identified by a part number, and manipulate it with part operations. When the application detects that the user copied a slide from one presentation to another, it can: 1) Call link_prov() with the parent part and the child part. 2) Track the descendants of a particular slide by calling read_prov() with its PowerPoint file and part number, since ProTrack stores the provenance metadata of individual parts. 3) Use inode_number_to_paths() to produce human-readable paths of the descendant slides and display them to the user. The paths, along with part numbers, even allow PowerPoint to open parts in the background and show previews of those slides. Handling Zip archives. Using read_prov() and link_prov() on the files to be compressed, Zip can read, compress, decompress, and write their provenance data to preserve parent/child relationships between files in the Zip archive. We put the functionality of (a) filtering the lists of ancestors for files in the 3
archive and (b) marshalling provenance data in the application. First, Zip reads the parents of each file part, and if any of them will also go into the archive, then it stores the (file part, parent) relationship, replacing the i- numbers with unique archive IDs. Second, the un-zip application sees the provenance data as pairs of IDs, maps the archive IDs to the new i-numbers, and calls link_prov() to set up both parent and child relationships between each pair on disk. Due to file I/O system calls, the Zip archive itself is dependent on all the files it compressed, but the application ensures that the uncompressed files are not. Thus, this design propagates provenance by value. B. Performance analysis We analyze the system performance using the assumptions in Table I. Table I: Storage and timing assumptions Hard disk capacity 500GB Block size 0.5kB Number of files stored by the VFS 1,000,000 Parts per file 32 Parents per file part 8 Children per file part 8 Parent directories per file 4 CPU cycle 0.1ns Memory access 10ns Hard disk seek 10ms Hard disk latency per block 0.1ms Storage analysis. Overall, ProTrack uses less than 2% of the total disk space, even with storing the history of all file versions, bidirectional parent/child relationships, and assuming that every file uses 32 parts. The part data is the same as file data, so we do not consider it as ProTrack overhead. Note that the current system does not use any inmemory data structures except adding two bits to each row of file descriptor tables, which are negligible. Time analysis. Even though the provenance data is on disk, ProTrack stores the parent/child links twice, and these bidirectional links speed up the system call times. We now analyze two workloads, assuming hard disk drives to demonstrate the efficacy of our design. For a large FS, it should take a provenance-tracking system at most a few seconds to find all descendants of some file if there are only several (say, 8). In ProTrack, search_prov() performs a breadth-first search that easily surpasses this metric (see Table III). Table III: Timing the search_prov() system call Operation Worst-case time foreach descendant inode, including the start (8 + 1) Seek the inode, then seek its provenance metadata (2 10ms seek + Read its provenance metadata [0.1ms/0.5kB] (16B 32 (8 + 8) = 8kB full 8kB latency) hash-table) Total: 0.19s We can convert the return i-numbers to paths for an application within the reference time as well. If each path passes through 16 folders, and each Unix folder is ~2kB on average, then inode_number_to_paths (), without memoization, returns 8 paths in just 8 files 16 folders/absolute path (10ms + [0.1ms/0.5kB] 2kB) = 1.3s. ProTrack also sustains solid transfer rates in a continuous file copy workload. A usual 10kB file copy takes (2 10ms) + ([0.1ms/0.5kB] [2 10kB]) = 24ms disregarding the buffers in memory. ProTrack adds (2 10ms) + ([0.1ms/0.5kB] [8kB full hash-table + ~0kB new hash-table]) = 22ms of overhead, increasing the time to 46ms. The rate is thus 22 file copies per second. If the user was continuously deleting the original file as well, then the dominating step in the garbage collection would involve seeking the provenance of the old file s parents and new file (and joining them). Consequently the file transfer rate, assuming 8 parents, would be 5 file copies/second. Tradeoffs and scalability considerations. We faced two major design tradeoffs while building ProTrack. The first was interface simplicity vs. additional functionality. Our implementation of parts, using only an array of addresses, stays close to Unix and minimizes storage overhead. We did not want file parts to overly complicate the system s programming interface. Aligning with the end-to-end principle, extra features Table II: Disk space usage Item Size Number Cost File part pointers 4 bytes/address 32 1,000,000 128MB Provenance metadata 16 bytes/i-number + part 32 (8 + 8) 1,000,000 8,192MB Hash-table overhead 32 bytes/file part 32 1,000,000 32MB Parent directories 8 bytes/i-number 4 1,000,000 4MB Total: 8,356MB 4
like naming parts with freeform strings are left out, although a particular application can easily define and change part names inside the file. The second was performance vs. on-disk storage. We sacrifice storage for duplicate parent/child links and a garbage collection policy that immediately joins the deleted file s parents to its children. This cost is offset by the speedups in bidirectional searches and full consistency between ProTrack and the VFS. Still, a main problem when accessing provenance is disk performance. We could complicate the system by storing fixed-size provenance in inode s xattrs and reduce the disk seeks. But seeks are becoming less of an issue as more users switch to solid-state drives, which effectively have zero seek times. Thus, we predict that the next step forward will be introducing in-memory data structures for provenance. IV. CONCLUSI ON S In summary, ProTrack offers simple yet fully functional support for provenance-tracking operations. The filesystem enables file history, versions, and parts for end users. It can support anticipated workflows such as building a PowerPoint Presentation or compiling source code, and meets all performance specifications. Future directions for this work include expanding provenance to multiple machines and improving performance with specialized provenance caches. Additional considerations that must be handled in the future are machine failures and security concerns, but otherwise, ProTrack is ready to be implemented. 5