Fight fire with fire when protecting sensitive data White paper by Yaniv Avidan published: January 2016 In an era when both routine and non-routine tasks are automated such as having a diagnostic capsule to visualize and detect gastro disorders, autonomous cars that sense their surroundings, drones that zip through trees at 30 mph, navigated at your desktop or mobile device, and being able to view the inside of your home on your smart-phone, from anywhere in the world - and control your heating, lighting or alarm systems. Despite its importance, Data Security still requires far too much reliance on human intervention and activity with the inherent risk of error or malicious actions. It is time that we utilised existing technology, such as offered by MinerEye, to safeguard our intellectual property. Tomorrows battles It is a known fact that today s modern warfare has changed and shifted. Cybersecurity plays an increasingly important role, we are exposed to some sophisticated attacks from unexpected malware, or insider threats forcing us to act fast and change defense methodology. There are two main corner stones to that strategy: Identifying most valuable assets which are usually company s sensitive data and IP. Increasing control over the sensitive data and improving intelligence. These steps will assist companies in getting ready to tomorrow s battles by buying time until a reaction and respond to an attack will take place. In today s world enterprise data landscape compels companies to do so, as data changes, formats and file types vary on a daily basis, and enormous volumes of data are generated by the minute. Newest cyber threat will be data manipulation A Cyber Armageddon, long imagined in Washington as a catastrophic event of digitally triggered damage to physical infrastructure, is less likely than cyber operations that will change or manipulate data. US director of national intelligence, James Clapper at the House intelligence committee, Sep-10, 2015.
Growing challenges to sensitive data within organizations are generated by many factors such as cloud, mobile, social and the Internet of Things (IoT). The environmental changes caused by these factors, increase the gap between optimal data security and traditional data security. Digital businesses are going to feel this gap continually getting wider at an exponential rate. This puts security teams in a constant state of program restructure to keep up with the pace of change. We re at the era where a real need for automating data security solutions that will face the challenges to data security team's efficiency. Security program owners are faced with two options: scale their programs to keep up; or risk becoming increasingly less effective at protecting corporate sensitive data. Becoming less effective at reducing risk certainly isn't an option, but scaling a team to keep up may also be a non-option. It s time we automate the identification, classification and remediation of threats to sensitive data.
Learn and scale When it comes to data, a foundational component to automating security, is understanding your organization's full digital footprint, without static definitions. You can't secure what you can't see, and you can't take inventory information assets that are in constant flux. This is the root of the problem. Without dynamic discovery and categorization of data, any further attempt to extend automated security to the entire digital footprint of an organization will fall short. The key is to design a solution that continually learns, matches and updates the digital footprint of sensitive data in an organization while scales to cover additional areas where data is stored and integrate with existing data security systems. Data profiling technics are mostly utilized in the industry to make sure the data in structured databases is consistent, accurate and reliable to support business activity. The data profiling process comprises structure discovery, data discovery and relationship discovery. Data profiling is performed using a tool that a) automates the discovery process; b) helps uncover the characteristics of the data; c) helps uncover the relationships between data sources. The challenge to traditional data profiling technics, introduce a substantial barrier when it comes to unstructured data such as files, documents, images, computer aided designs (CAD), programs etc. Structure discovery sometimes lacks reference: an employment agreement in one company is different in structure than another; insurance policy agreements vary between insurers; same for CAD methodologies of a next generation processing units, or any other binary based intellectual property. The solution lays in a conceptual change in setting the reference to data profiling in sets of examples of files that are considered sensitive in an organization and apply pattern recognition and machine learning technics to support structure discovery, data discovery and relationship discovery. MinerEye has developed a proprietary technology to match data to a reference set using advanced technics applied in computer vision and machine learning. This set of algorithms read the bytes of a given file, and represent its content by creating a single mathematical vector on the fly, that is constant in size and very small (few k s) per each file. From this point on, the system ( ) uses this vector called signal, that is extracted from every file it scans, for its clustering and further analysis tasks. This capability allows the system to process unprecedented amount of data. The system does not disposition files from their original storage / location rather reads their byte stream, creates the vector on the fly and sends it out to the server. This saves enormous network load when learning the sensitive data patterns and attributes. The signal can be refreshed on a monthly basis since the data doesn t
change in such a pace that would affect its accuracy in the correlation process. Additionally, the monthly scan is incremental. Protecting data starts with being able to dynamically identify it, and understanding the way it behaves. The challenge CISO s face when entering a data protection program is identifying the start and end point of such program. Current solutions and technology constraints compel the organization to spend huge amounts of money and resources to define and identify sensitive data. Another challenge to discuss is the volumes of data and the rate its being created. One of the assumptions that can be made here on the way to solve this, is the fact that most of the data within organization is duplicated with slight changes to content, throughout time, business units and business processes. There is a lot of reuse taking place when generating reports, presentations, code files, computer aided designs etc.; just think how many times the operation save as is being done in a company in each working day. This means that most of the data can be grouped into groups of similar versions. The Research Team at MinerEye proved this assumption running its technology over large and heterogeneous data sets taken out of several organizations. The results analysis showed that files that changed their content physically even in 20%, were matched by the system to their original version in a very high confidence (NCC >90%). The results also showed that over 90% of total clusters that the system generated, are located at the same area of >80% correlation. This technological characteristic is a key in turning the data categorization process into a manageable one compared to any other method. Transforming the flat structure of data as it is distributed across storages, shares and desktops into a hierarchy of similarity clusters, provides a manageable solution to visualize huge amounts of similar data. Additionally, this is a scalable solution when dealing with incoming large amounts of data into the scanned and clustered repositories. The next step would be to package the technology into a product that can connect to data storages of different types, scan it and correlate data clusters to a given set of confidential files. The value proposition of such a system would be the automation of sensitive data identification and triggering protective measures to action based on their internal predefined policy.
Fight fire with fire MinerEye s is a virtual machine that encapsulates the technology discussed in detail above. The system offers a funneled 2 step process to base optimized data protection foundations: profile sensitive data and trigger action on data. Profiling the data is a foundational step in the process of deploying the automated classification process. This step is comprised by the following building blocks: a) Cluster data and match examples of sensitive documents / data elements to clusters; b) Visualizing data clusters to provide efficient tagging process and user experience. When a new file is identified during the systems continuous scan, it is automatically matched to a cluster. Once matched, the cluster s tag is propagated to this new coming file. Once a new coming file is not matched to one of the clusters, the system creates a new cluster and a tagging task to the user. The systems GUI enables the end-user to manually upload new examples of confidential files and the system classifies it on the spot, correlating it with the relevant cluster. These new examples are visualized by the system as tagging tasks for the user until they re completed and archived. This unique system process ensures an experience of convergence in discovering and categorizing sensitive data by the end user. Tagging is done manualy over clusters at a chosen resolution picked by the end-user (data security analyst). The GUI supports this tagging process by visualizing additional meta-data over files in the analyzed cluster such as: creating / modifying accounts, file formats, file names, file locations. Some of the supporting information is aggregated in order to make more sense or even recommend on a tag to a cluster e.g.: file share rate, top 3 dominant attributes, most common textual elements in cluster, file preview for pictorial formats etc. Clusters map analysis UI and confidential files examples in task panel Cluster drill down and meta-data driven tagging through the user interface
Triggering systems to action is the output of the iterative profiling process and is comprised of 2 basic capabilities in the platform: a) A set of API s to interact with external security systems and consumers upon sensitive data identification; b) A machine learning module that profiles sensitive data clusters behavior that generates outlier incidents to sensitive data; Most DLP and IAM systems are configured to protect several folders that contain sensitive data. These systems apply a pre-configured policy over identical files found by their discovery process in external systems such as endpoints, shares, email and other proxy servers. Some even enhance this policy enforcement task by searching on regular expression defined in their internal dictionary. Cloud access brokers apply access permissions or application lock based on predefined parameters that would imply on the nature of the data that is being shared or dispositioned. Naturally, in a hyper- dynamic data environment, this would not be sufficient to cover all data from volume and types perspectives. utilizes its API s end quality incidents it creates to take the above security systems throughput to the next level. Triggering fingerprinting Task in DLP system Identifying data privacy breach In email and file servers Enhancing NGFW rules execution Identity & Access mgt. Triggering access control over identified sensitive data Enriching SIEM/SOC Incidents correlation Data privacy across cloud storages Triggering CASB data encryption task Triggering proxy server rules The system uses the protected folders of DLP and IAM systems as a learning set to discover new locations of similar data that couldn t be discovered by any other system, and automatically triggers a fingerprinting task or an access control policy within these systems over those new locations. can execute an encryption or application lock command within a Cloud access security broker (CASB) upon identification of data that couldn t be identified by the CASB. Same goes for firewall rule set and proxy policy. enriches SIEM and SOC platforms with sensitive data focused outlier incidents and provides a holistic view of risk to data within the organization. The systems API s enables querying the database for specific confidential file historic movements for forensics purposes supporting data breach investigations. In order to build sensitive data cluster behavioral model, focuses more intense scans on identified sensitive data clusters (tagged clusters) in order to collect and log attributes that build the behavior model of those clusters. The model holds patterns to sensitive content behavior such as changes to physical location, accounts accessing the data, devices and applications involved, data share rates etc. easily identifies outlier behavior to the model and visualizes additional information that may provide explanation as to why this was considered an outlier. The incident GUI enables a timeline view of an outlier event that is generated directly over sensitive data. The system provides additional information regarding the top 3 data elements that triggered this incident.
System use cases Data leakage prevention and Data access control can be triggered by to activate fingerprinting or encryption task over similar data. Cloud storage Data discovery Websense DLP Trigger fingerprinting task Scan and discover similar data Network shares Data risk assessment Finance data Intellectual property Customer data Learn patterns Databases Data Leakage Data privacy is supported through visualization of similar private data transfer across entities and geo s. Intellectual property protection Data Privacy Network segregation monitoring Chief Privacy Officer Classifying applications when migrating to cloud by matching data patterns of their queries to database sensitive tables. Application classification Cloud data control Forensics Databases Cloud environment Chief Technology Officer