DATA OBFUSCATION What data obfuscation? Data obfuscations break the data structures used in the program and encrypt literals. Th method includes modifying inheritance relations, restructuring arrays, etc. Data obfuscations thoroughly change the data structure of a program. They make the obfuscated codes so complicated that it impossible to recreate the original source code. Data obfuscations operate on the data structures used in the program. Data storage obfuscations change the type of storage for variables. One example converting a local variable into a global variable. The obfuscator would ensure that different methods use the variable at different times but none of them use it at the same time. A data encoding obfuscation changes the way a program interprets stored data. For example, you can replace all references that initialize an index variable i by the expression 8*i+3. When the code needs to use the index value, the obfuscator inserts the expression (i- 3)/8. Finally, instead of incrementing the variable by one, you add eight to the value. Basically, the obfuscation scales and offsets the index from the desired value and only computes the real index when it's going to be used. A data aggregation obfuscation alters how data grouped together in memory. An example turning a 2D array into a 1D array or vice versa. The basic idea to change the familiar conceptual mapping to a less common, in-memory representation so that it's more difficult for a person to understand your algorithms. For example, a chessboard often modeled in a program as a matrix, but changing it to a one-dimensional array works just as well for the CPU. A data ordering obfuscation changes how data ordered. In C-based languages, it common to see the ith element of a collection of data accessed by indexing to position i in an array. A data ordering obfuscation would determine the index in the array of the data by calling some function f(i). Again, th simply rearranges the storage of information in a way that less closely models the normal conceptual model. 1 P a g e
Understanding a simple algorithm such as sorting elements of an array easy. Applying a simple data transformation on such algorithm can make it hard for someone to understand the code. We will apply a data transformations on the following piece of code: for(i=0;i<10;i++) for(j=i;j<10;j++) if(a[j]>a[i]) swap(a[i],a[j]); Aggregation The first data transformation we would like to dcuss restructuring arrays. Arrays can be split,merged, folded or flattened. We will merge two or more arrays into one: Applying th transformation to our example will force the attacker to evaluate details of the algorithm if he wants to understand it. The test and swap lines will be transformed into the next piece of code, assuming that a the array on the odd indices of the interleaved array. if(a[2j+1]>a[2i+1]) swap(a[2j+1],a[2i+1]); Finding similar transformations for arrays not hard and implementing them into the right tool neither. As it already difficult in TXL to get type information, it makes th data transformation impossible to apply in a safe way. E.g., modifying a datastructure, requires the location of every instance of that data structure. On a parse tree th non-trivial as the same name might be used in different scopes for different datastructures. While the parse tree does contain sufficient information to deduce the type of datastructure when, it a more straightforward to perform th on an intermediate representation which contains a symbol table. 2 P a g e
Ordering An obfuscation transformation which reorders arrays neither difficult in SUIF. A symbol table at our dposal so each pointer to the array known, which makes finding all accesses to the array straight forward. The indices used to access the array can be changed by a function mapping the original position i into its new position of the reordered array. The test and swap lines of our example will be changed into the next piece of code which will no longer order the array as in the original program. Although, all indices will be changed in the program, so the resulting code stays functionally equivalent with the original one. if(a[f(i)]>a[f(j)]) swap(a[f(i)],a[f(j)]); Storage and encoding Data flow optimizations such as common subexpression elimination and constant propagation are able to undo very trivial data obfuscations. For example when splitting constant 10 into subexpression 2+8, constant propagation will undo th transformation. Nontrivial data obfuscations such as these shown above always survive the compilation process because these transformations change the context of the program. While a compiler only has optimizing transformations at h dposal, he unable to undo such context changing data transformations. On the other hand variable splitting a deoptimization transformation and applying such transformation should take into account the optimizations performed by the compiler.we had a look at binary obfuscators and found out that no non-trivial data transformations were implemented. Only trivial data transformations such as constant splitting are implemented at binary level and without further obfuscation, an optimization run afterwards could remove these transformations. It not astonhing that binary obfuscators only contain trivial data transformations as the types of datastructures are lost during compilation. Passing extra information to do such transformations at a binary level feasible, but intensive and rather artificial if these transformations can be a source code level and afterwards survive the compiler optimizations. 3 P a g e
Why would you want to merely obfuscate data, rather than use a strong encryption algorithm? A good example would be an audit report on a medical system. Th report may be generated for an external auditor, and contain sensitive information. The auditor will be examining the report for information that indicates possible cases of fraud or abuse. Assume that management has required that Names, Social Security Numbers and other personal information should not be available to the auditor except on an as needed bas. The data needs to be presented to the auditor, but in a way that allows the examination of all data, so that patterns in the data may be detected. Encryption would be a poor choice in th case, as the data would be rendered into ASCII values outside of the range of normal ASCII characters. Th would be impossible to read. A better choice might be to obfuscate the data with a simple substitution cipher. While th not considered encryption, it may be suitable for th situation. When the auditor finds a possible case of abuse, he will need the real name and SSN of the party involved. He could obtain th by calling a customer service representative at the insurance company that supplied the report, and ask for the real information. The obfuscated data read to the customer service rep, who then inputs it into an application that supplies the real data. The importance of using pronounceable characters becomes very clear. Strong encryption would render th impossible. Here s some simple example code to do the obfuscation: create or replace package obfs function obfs( varchar2 in ) return varchar2; pragma restrict_references( obfs, WNPS, WNDS ); 4 P a g e
function unobfs( varchar2 in ) return varchar2; pragma restrict_references( unobfs, WNPS, WNDS ) create or replace package body obfs xlate_from varchar2(62) := 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz ; xlate_to varchar2(62) := nopqrstuvwxyz0123456789abcdefghijklmnopqrstuvwxyzabcdefghijklm ; function obfs ( clear_text_in varchar2 ) return varchar2 begin return translate( clear_text_in, xlate_from, xlate_to ); function unobfs ( obfs_text_in varchar2 ) return varchar2 begin return translate( obfs_text_in, xlate_to, xlate_from ); / Here some sample output: SSN OBFS SSN ---------- ---------- 540407786 srnrnuuvt 542800170 srpvnnoun 5 P a g e
542802063 srpvnpntq 541466830 srorttvqn As you can see, it wouldn t be very difficult to decipher th scheme given enough data. A somewhat more effective method involves chopping the text into segments and rearranging it as well as obfuscating it. Below some sample output from th algorithm. OBFS OBFS ---------- ---------- 540407786 &24B23B&Z 542800170-4B*23&&& 542802063-4Z&23-&_ 541466830 *2_423ZZ& While th still not encryption, th data would be more difficult to decipher without the key. Source code for th in PL/SQL available at the URL provided at the end of th article. Another way to hide sensitive data through masking. Th different from the previous example in that the clear text cannot be reconstructed from the dplayed data. Th useful in situations where it only necessary to dplay a portion of the data. A good case for th method the receipts printed at gas stations and convenience stores. When a purchase made with a credit card, the last 4 digits of the credit are often dplayed as clear text, while the rest of the credit card number has been masked with a series of X s. Slop n Slurp 1 Stop Shop 5/25/2000 8:53 P.M. Football Burrito 1 2.49 2.49 6 P a g e Premium Gasoline 12.5 1.699 21.24 ===== 23.73
Th method can also be used for reports where the person reading the report requires only a portion of the sensitive data. Th method also commonly used for the account numbers on printed transactions from ATM s. 7 P a g e