Diffusion and Data compression for data security A.J. Han Vinck University of Duisburg/Essen April 203 Vinck@iem.uni-due.de
content Why diffusion is important? Why data compression is important? Unicity distance Time to discover a secret Source coding principle How data compression works Zipf law Han Vinck 203 2
Diffusion-transposition HOW: rearrange the symbols in the data without changing the symbols i.e. the frequency of symbols remains the same GOAL: destroy the relations between symbols and make it more difficult to analyze! ANALYSIS: index of Co-incidence, finding periods Han Vinck 203 3
example of diffusion a scytale is a tool used to perform a transposition cipher http://www.youtube.com/watch?v=veh0knztljy&feature=related Han Vinck 203 4
Confusion and diffusion in AES General round structure substitution Substitute bytes Shift rows Mix columns transposition Add round key Same equipment can be used to decipher substitution http://www.youtube.com/watch?v=mlzxpkd Han Vinck 203 5 XP58
Data compression The goal of data compression is to create - a compact representation of the data to be encrpyted - create independent symbols Decompression gives the original data back! Han Vinck 203 6
Data compression Han Vinck 203 7
Source coding in Message encryption () Part Part 2 Part n (for example every part 56 bits) dependancy exists between parts of the message encypher key n cryptograms, dependancy exists between cryptograms decypher Attacker: Part Part 2 Part n key n cryptograms to analyze for particular message of n parts Han Vinck 203 8
Source coding in Message encryption (2) Part Part 2 Part n (for example every part 56 bits) n-to- source encode key encypher cryptogram decypher Source decode Attacker: - cryptogram to analyze for particular message of n parts - assume data compression factor n-to- Hence, less material for the same message! Part Part 2 Part n Han Vinck 203 9
The position of crypto in a Communication model source Analogue to digital conversion digital compression /reduction security error protection from bit to signal Han Vinck 203 0
Source coding Two principles: data reduction: data compression: remove irrelevant data (lossy, gives errors) present data in compact (short) way (lossless) original data remove irrelevance Relevant data compact description Transmitter side original data unpack receiver side Han Vinck 203
Illustration lossless/lossy original original Han Vinck 203 2
What do we want (need)? All data symbols to be enciphered must occur with equal probability and are independent from each other Han Vinck 203 3
Example: suppose we have a dictionary with 30.000 words these can be numbered (encoded) with 5 bits if the average word length is 5, we need on the average 3 bits per letter 000000 Han Vinck 203 4
This can happen Han Vinck 203 5
Letter frequency of the vigenere cipher Han Vinck 203 6
How to compres? (binary ) source x= (x, x 2,, x N ), x i Є {0,} - #0 s = f 0 N, # s = f N; F = (f 0, f ) the composition of x - Then, the number of different vectors x for a given F is x F = f N 0 N = (f 0 N! N)!(f N)! and the number of N log 2 x F - bits/ symbol needed to represent x i=0 f i log 2 f i N + log 2 N = - i=0 f i log 2 f i (entropy!) Han Vinck 203 7
en- and decoding source x N letters encoder F (composition) Lexicographical index for x F (composition) Lexicographical index for x decoder encoder x for large N,fi pi and thus filog2fi i0 is equal to the Shannon entropy! To transmit the value of F, we need N log2(n ) bits /output letter 0 for large N Lexicographical en- and decoding is a solved problem in computer science Han Vinck 203 8
exercise For sequences of length 2 with 4 ones and 6 zeros, give the lexicographical index for the sequence 0 0 0 0 0 0 0 0 What is the sequence that belongs to the index 52 Han Vinck 203 9
Binary entropy n lim log2 n pn n h(p) n pn ( ) 2 nh p interpretation: let a binary sequence contain pn ones, then we can specify each sequence with log 2 2 nh(p) = n h(p) bits Homework: Prove the approximation using ln N! ~ N lnn for N large. Use also log a x = y log b x = y log b a The Stirling approximation N N! 2 NN e N Han Vinck 203 20
The Binary Entropy: h(p) = -plog 2 p (-p) log 2 (-p) h 0.9 0.8 0.7 Note: h(p) = h(-p) 0.6 0.5 0.4 0.3 0.2 0. 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 p Han Vinck 203 2
references Information theory books MPEG, JPEG, Han Vinck 203 22
Application to text: symbols are words The distribution of words follows the law of Zipf(935): Let f n denote the frequency of the n-th most frequent word, then f n = A/n. English: A = 0. M for M 2366; filog2fi 9.72; i the average wordlength 4.5 letter; The number of bits/letter 9.72/4.5 2.6 Han Vinck 203 23
Zipf s law A web site with many references and applications http://linkage.rockefeller.edu/wli/zipf/index_ru.html Han Vinck 203 24
Web Sites rank ordered by their popularity Han Vinck 203 25
Unicity distance (3) Idea: - for a stream cipher after some time L, the plaintext and keystream can be determined uniquely from the cipher stream The smallest value where this is possible is called UNICITY DISTANCE U A necessary condition: M L x K C L, where * means cardinality (or # of) (otherwise, when M L x K > C L, some plaintexts give the same cipher) Han Vinck 203 26
Unicity distance (4) from log M L log K log C L log C L we have and : log C L log K L log M L L L U log K log M,where R M C IMPORTANT to NOTE : log M R is the maximum For low redundancy, U goes redundancy to infinity of the source sequence! Han Vinck 203 27
A probabilistic approach (Hellmann) K 2 2 M L C L Equal probable messages, Equal probable keys C L z(ci ) M L x K = z(ci ) and P(ci ) =, Ci= M L L x K z(c ) M x K z(c ) and P(c ) i where L z(c )is the number i of arrows i entering, i M x K ci i where z(ci)is the number of we used : the # of outgoing we used : arrows = # of arrows entering the # of outgoing arrows # of incoming C L 2 z (c ) M x K z = i L z(ci )P(ci ) = c M x K C ni i= n L 2 L 2 a z z(ci) a z (ci ) n i i z = gives the same result as before (one unique pair M, C) L incoming c i proof : considern we used : z(c ) = a i= i n i [z(cni ) - => z i= 2 (c i ) 2 a n n a : consider 2 [z(ci ) - ] 0 i= n Han Vinck 203 28 proof
Examples: Unicity distance (5) Assume that the German language has a rate R of 2 bits per letter -Then, for a substitution cipher with 26! keys or a permutation cipher with period 26 ( 26! keys ) we have : U log K log M R log26! log26 2 32 - For a Vigenere cipher of length 80: we have : log K U log M R 80 log26 log26 2 40 - Try to find U for the DES Han Vinck 203 29
Conclusion: Unicity distance (6) It is important to make the value of R as high as possible for a large U Hence: source compression before encryption is important for secure communications Note added: Given the message to the analyst, the value of R = 0. Hence, given the ciphertext and plaintext, log K U log M Han Vinck 203 30
Professor James L. Massey A GREAT SCIENTIST and TEACHER! MOTTO: SIMPLE but SOLID 999 THE - Professor James L. Massey MARCONI Marconi FELLOWSAward citation "For theoretical and practical contributions to cryptography and related coding problems; teacher and mentor to a generation of scientists and technologists" Professor Massey made significant advances in forward-error-correcting codes, multi-user communications, and cryptographic systems. In addition, Professor Massey is known for his contributions to the field of engineering education. He is currently an Adjunct Professor at the University of Lund, Sweden. Han Vinck 203 3
Data compression (M-ary ) source x= (x, x 2,, x N ), x i Є {,2,,M} - Suppose that a source generates N independent M-ary symbols - The frequency of a symbol i is f i and thus f i N symbols i occur in x - We call F = (f, f 2,, f M ) the composition of x - Then, the number of different vectors x for a given F is N N fn N fn f2 N fm N N! x F fn f2 N fm N (fn)!(f 2N)! (f and the number of bits/ symbolneeded to represent x N log 2 x F M i flog i 2 fn log i 2 N M i flog i 2 f i (entropy!) N)! Han Vinck 203 32 M
en- and decoding source x N letters encoder F (composition) Lexicographical index for x F (composition) decoder encoder x Lexicographical index for x for large N, f i pi and thus M filog 2fi i is equal to the Shannon entropy! M - To transmit the value of F, we need log 2 (N ) bits /output letter 0 for large N N Lexicographical en- and decoding is a solved problem in computer science Han Vinck 203 33