Using Data Classification to Manage File Servers Adi Oltean Senior SDE, Microsoft Corporation Ran Kalach Principal Dev Manager, Microsoft Corporation
Agenda Customer challenges Solution: File Classification Manage data based on business value Grow the ecosystem in classification solutions File Classification Infrastructure The classification pipeline Aggregation, conflict resolution Incremental classification Challenges, Mitigations & Best Practices Conclusions
Customer challenges file servers Storage growth Storage cost Data sharing and search Compliance Security and Information leakage Increasing data management needs / many data management tools HSM Security Archive Backup Encryption Replication Expiration
File shares and business requirements Business IT Need per project share Make sure high business impact files do not leak out Backup files with personal information to encrypted store Expire low business impact files created three years ago and not touched for a year 4
Some time later 5
Classify and apply policy Classification methods Step 1: Classify data IT Scripts Manual Line Of Business application Automatic classification Location Content Owner Step 2: Apply policy based on classification Actions based on classification Backup Archive HSM Reports Expiration Replication Security Encryption Search Leakage prevention
File shares and business requirements Business IT Personal Business Information Impact Need per project share Make sure high business impact files do not leak out Backup files with personal information to encrypted store Expire low business impact files created three years ago and not touched for a year 7
Customer benefits - Summary Apply Policies Based on Classification = Manage data based on business value! Reduce Cost Expire files to reduce storage purchasing needs Move files to less expensive storage Optimize backup SLAs Replicate only business related files Manage risk Find sensitive files on public servers Watermark documents Keep files containing personal information encrypted in backup Apply rights management to high secrecy files Comply with retention policies
Agenda Customer challenges Solution: File Classification Manage data based on business value Grow the ecosystem in classification solutions File Classification Infrastructure The classification pipeline Aggregation, conflict resolution Incremental classification Challenges, Mitigations & Best Practices Conclusions
File Classification Infrastructure Get classification properties API for external applications Set classification properties API for external applications Discover Data Extract classification properties Classify Data Store classification properties Apply Policy based on classification File Classification Extensibility points
Classification pipeline an example This is an example of a pipeline setup with one storage module and two classifiers Property bag object Each component passes property bags to the next one Classification Runtime Process Scanner Gets basic file properties Office Storage [Load] Folder Classifier Content Classifier Office Storage [Save] Reporting Engine Hosting Process Hosting Process Hosting Process discovery load properties classification save properties run policies Property bags can cross processes Security checks are performed on cross-process data transfers Most modules are hosted within a separate process
Aggregation and Conflict Resolution Problem: A classification rule may provide conflicting value with the value already stored in the file Two classification rules may provide conflicting values for the same property Example: Solution: Admin creates a Business Impact property with possible values (LBI, MBI, HBI) A file previously classified as MBI is copied to a folder x:\foo The Folder rule for x:\foo classifies all files as LBI The Content classifier scans the file and classifies it as HBI What is the correct value? Provide several types of classification rules: Default: rule runs only if the property not present in the file. Otherwise: rules can either explicitly aggregate or overwrite previously-stored properties. Value aggregation depends on the property type
Incremental Classification Goal: Minimize re-classification of already classified files Crucial for scalability (large amount of files) Automatic classification (scheduled) Cache classification results in ADS (alternate data stream) ADS contains a hash of certain file properties (last-modify-time, file-path, file-id, etc) ADS contains the last classification time Allows determining whether the cached classification is up-to-date Re-classify the file only if: The file changed or was added since previous classification (hash is different), or A rule has changed since previous classification, or The configuration of a classifier has been updated since previous classification. Get Property API (on-demand) If cache is present and up to date, return cached properties Otherwise (out-of-date classification), application can choose: Accuracy: classify the file on the fly Performance: return stored properties
Challenges, Mitigations & Best Practices 1 - Performance Content classification is expensive (I/O, CPU) Must optimize to scan & classify only when needed Must be able to cache results Minimize performance impact on host of data being classified Classify on another machine When classifying locally, throttle machine resource usage and back out when the machines becomes non-idle Be smart with how you schedule classification, support pause/resume
Challenges, Mitigations & Best Practices 2 - Accuracy Automatic Classification can almost never be 100% accurate Tune your rules for false-positive / false-negative according to the scenario Example: secure files false positive, expire files false negative Policy execution: revert in case of classification error Example: backup files one last time just before you expire them Examine classification results periodically Modify your rules or classifiers till they re optimized for your data-set Enable manual classification Clear and consistent policy for aggregating and resolving conflicts Support flexible rules that allow tuning by administrator or application One answer doesn t fit all!
Challenges, Mitigations & Best Practices 3 - Real-time Classification and Policies Some policies require real-time or near real-time execution Example: removing confidential file from unsecured share Solution: event-based classification File-system activity can be a trigger Need a hook to file-system operations, (many implementation options exist) Consider Classifying only when the file content is stable Avoid overloading the server performance with too aggressive classification
Examples of FCI-enabled solutions Solution Classification solutions Custom classifiers that extract metadata from files Custom storage modules that load/store custom metadata in files Add classification awareness to existing data management solutions. Build intelligent policy-based data management solutions Example An LOB app that maintains special classification rules for PII data it generates. A medical imaging classifier extracts embedded metadata from scanned images Load/store metadata in your custom file formats (example: videos) A backup app can have special backup policies for HBI data Define a policy to automatically apply encrypt HBI data
Opportunities for you Why participate in the File Classification Infrastructure ecosystem? Use FCI for existing software Enhance existing data-producing apps to also attach classification to generated files (ex: LOB applications) Enhance existing data management apps to consume classification Use FCI for new software solutions Develop solutions on top of FCI Develop components for the FCI ecosystem Classifiers Storage modules How I can develop against it? File Classification Infrastructure can be consumed through a rich, scriptable COM API FCI can be extended using C++/C# code, or Powershell scripts When can I start? Now: FCI is part of the latest Server releases (starting with Windows Server 2008 R2)
More information about FCI General information Home page: http://www.microsoft.com/windowsserver2008/en/us/fci.aspx Team blog: http://blogs.technet.com/filecab API documentation on MSDN: http://msdn.microsoft.com/enus/library/bb972746(vs.85).aspx Sample code Windows SDK http://msdn.microsoft.com/enus/windows/bb980924.aspx Sample FCI clients (C++, C#) Sample classifiers (C++, C#) Code Gallery: http://code.msdn.microsoft.com/fci