Basics of information theory and information complexity

Transcription

1 Basics of information theory and information complexity a tutorial Mark Braverman Princeton University June 1,

2 Part I: Information theory Information theory, in its modern format was introduced in the 1940s to study the problem of transmitting data over physical channels. Alice communication channel Bob 2

3 Quantifying information Information is measured in bits. The basic notion is Shannon s entropy. The entropy of a random variable is the (typical) number of bits needed to remove the uncertainty of the variable. For a discrete variable: H X Pr X = x log 1/Pr [X = x] 3

4 Shannon s entropy Important examples and properties: If X = x is a constant, then H X = 0. If X is uniform on a finite set S of possible values, then H X = log S. If X is supported on at most n values, then H X log n. If Y is a random variable determined by X, then H Y H(X). 4

5 Conditional entropy For two (potentially correlated) variables X, Y, the conditional entropy of X given Y is the amount of uncertainty left in X given Y: H X Y E y~y H X Y = y. One can show H XY = H Y + H(X Y). This important fact is knows as the chain rule. If X Y, then H XY = H X + H Y X = H X + H Y. 5

6 Example X = B 1, B 2, B 3 Y = B 1 B 2, B 2 B 4, B 3 B 4, B 5 Where B 1, B 2, B 3, B 4, B 5 U {0,1}. Then H X = 3; H Y = 4; H XY = 5; H X Y = 1 = H XY H Y ; H Y X = 2 = H XY H X. 6

7 X = B 1, B 2, B 3 Mutual information Y = B 1 B 2, B 2 B 4, B 3 B 4, B 5 H(X) H(X Y) I(X; Y) B B 1 1 B 2 B 2 B 3 H(Y X) B 4 B 5 H(Y) 7

8 Mutual information The mutual information is defined as I X; Y = H X H X Y = H Y H(Y X) By how much does knowing X reduce the entropy of Y? Always non-negative I X; Y 0. Conditional mutual information: I X; Y Z H X Z H(X YZ) Chain rule for mutual information: I XY; Z = I X; Z + I(Y; Z X) Simple intuitive interpretation. 8

9 Example a biased coin A coin with ε-heads or Tails bias is tossed several times. Let B {H, T} be the bias, and suppose that a-priori both options are equally likely: H B = 1. How many tosses needed to find B? Let T 1,, T k be a sequence of tosses. Start with k = 2. 9

10 What do we learn about B? I B; T 1 T 2 = I B; T 1 + I B; T 2 T 1 = I B; T 1 + I B T 1 ; T 2 I T 1 ; T 2 I B; T 1 + I B T 1 ; T 2 = I B; T 1 + I B; T 2 + I T 1 ; T 2 B = I B; T 1 + I B; T 2 = 2 I B; T 1. Similarly, I B; T 1 T k k I B; T 1. To determine B with constant accuracy, need 0 < c < I B; T 1 T k k I B; T 1. k = Ω(1/I(B; T 1 )). 10

11 Kullback Leibler (KL)-Divergence A distance metric between distributions on the same space. Plays a key role in information theory. D(P Q) P x log P[x] Q[x]. x D(P Q) 0, with equality when P = Q. Caution: D(P Q) D(Q P)! 11

12 Properties of KL-divergence Connection to mutual information: I X; Y = E y Y D(X Y=y X). If X Y, then X Y=y = X, and both sides are 0. Pinsker s inequality: P Q 1 = O( D(P Q)). Tight! D(B 1/2+ε B 1/2 ) = Θ(ε 2 ). 12

13 Back to the coin example I B; T 1 = E b B D(T 1,B=b T 1 ) = D B1 B 2 ±ε 1 = Θ ε k = Ω = Ω( 1 I B;T 1 ε 2). Follow the information learned from the coin tosses Can be done using combinatorics, but the information-theoretic language is more natural for expressing what s going on.

14 Back to communication The reason Information Theory is so important for communication is because information-theoretic quantities readily operationalize. Can attach operational meaning to Shannon s entropy: H X the cost of transmitting X. Let C X be the (expected) cost of transmitting a sample of X. 14

15 H X = C(X)? Not quite. 1 0 Let trit T U 1,2, C T = H T = log It is always the case that C X H(X). 15

16 But H X and C(X) are close Huffman s coding: C X H X + 1. This is a compression result: an uninformative message turned into a short one. Therefore: H X C X H X

17 Shannon s noiseless coding The cost of communicating many copies of X scales as H(X). Shannon s source coding theorem: Let C X n be the cost of transmitting n independent copies of X. Then the amortized transmission cost lim n C(Xn )/n = H X. This equation gives H(X) operational meaning. 17

18 H X operationalized X 1,, X n, H(X) per copy to transmit X s communication channel 18

19 H(X) is nicer than C(X) H X is additive for independent variables. Let T 1, T 2 U {1,2,3} be independent trits. H T 1 T 2 = log 9 = 2 log 3. C T 1 T 2 = 29 9 < C T 1 + C T 2 = = Works well with concepts such as channel capacity. 19

20 Proof of Shannon s noiseless coding n H X = H X n C X n H X n + 1. Additivity of entropy Compression (Huffman) Therefore lim n C(X n )/n = H X. 20

21 Operationalizing other quantities Conditional entropy H X Y : (cf. Slepian-Wolf Theorem). X 1,, X n, H(X Y) per copy to transmit X s communication channel Y 1,, Y n,

22 Operationalizing other quantities Mutual information I X; Y : Y 1,, Y n, X 1,, X n, I X; Y per copy to sample Y s communication channel

23 Information theory and entropy Allows us to formalize intuitive notions. Operationalized in the context of one-way transmission and related problems. Has nice properties (additivity, chain rule ) Next, we discuss extensions to more interesting communication scenarios. 23

24 Communication complexity Focus on the two party randomized setting. Shared randomness R X A & B implement a functionality F(X, Y). F(X,Y) Y A e.g. F X, Y = X = Y? B 24

25 Communication complexity Goal: implement a functionality F(X, Y). A protocol π(x, Y) computing F(X, Y): Shared randomness R X m 1 (X,R) m 2 (Y,m 1,R) m 3 (X,m 1,m 2,R) Y A B F(X,Y) Communication cost = #of bits exchanged.

26 Communication complexity Numerous applications/potential applications (some will be discussed later today). Considerably more difficult to obtain lower bounds than transmission (still much easier than other models of computation!). 26

27 Communication complexity (Distributional) communication complexity with input distribution μ and error ε: CC F, μ, ε. Error ε w.r.t. μ. (Randomized/worst-case) communication complexity: CC(F, ε). Error ε on all inputs. Yao s minimax: CC F, ε = max μ CC(F, μ, ε). 27

28 Examples X, Y 0,1 n. Equality EQ X, Y 1 X=Y. CC EQ, ε log 1. ε CC EQ, 0 n. 28

29 Equality F is X = Y?. μ is a distribution where w.p. ½ X = Y and w.p. ½ (X, Y) are random. X Y A MD5(X) [128 bits] X=Y? [1 bit] Error? B Shows that CC EQ, μ,

30 Examples X, Y 0,1 n. Inner product IP X, Y i X i Y i CC IP, 0 = n o(n). In fact, using information complexity: CC IP, ε = n o ε (n). (mod 2). 30

31 Information complexity Information complexity IC(F, ε):: communication complexity CC F, ε as Shannon s entropy H(X) :: transmission cost C(X) 31

32 Information complexity The smallest amount of information Alice and Bob need to exchange to solve F. How is information measured? Communication cost of a protocol? Number of bits exchanged. Information cost of a protocol? Amount of information revealed. 32

33 Basic definition 1: The information cost of a protocol Prior distribution: X, Y μ. X Y A Protocol Protocol π transcript Π IC(π, μ) = I(Π; Y X) + I(Π; X Y) what Alice learns about Y + what Bob learns about X B

34 Example F is X = Y?. μ is a distribution where w.p. ½ X = Y and w.p. ½ (X, Y) are random. X Y MD5(X) [128 bits] X=Y? [1 bit] A B IC(π, μ) = I(Π; Y X) + I(Π; X Y) = 65.5 bits what Alice learns about Y + what Bob learns about X

35 Prior μ matters a lot for information cost! If μ = 1 x,y a singleton, IC π, μ = 0. 35

36 Example F is X = Y?. μ is a distribution where (X, Y) are just uniformly random. X Y MD5(X) [128 bits] X=Y? [1 bit] A B IC(π, μ) = I(Π; Y X) + I(Π; X Y) = 128 bits what Alice learns about Y + what Bob learns about X

37 Basic definition 2: Information complexity Communication complexity: CC F, μ, ε Analogously: IC F, μ, ε min π computes F with error ε inf π computes F with error ε π. Needed! IC(π, μ). 37

38 Prior-free information complexity Using minimax can get rid of the prior. For communication, we had: CC F, ε = max μ For information IC F, ε inf π computes F with error ε CC(F, μ, ε). max μ IC(π, μ). 38

39 Ex: The information complexity of Equality What is IC(EQ, 0)? Consider the following protocol. A non-singular in Z n n X in {0,1} n A A 1 X Continue for A n steps, or until a disagreement 1 Y is discovered. A 2 X A 2 Y Y in {0,1} n B 39

40 Analysis (sketch) If X Y, the protocol will terminate in O(1) rounds on average, and thus reveal O(1) information. If X=Y the players only learn the fact that X=Y ( 1 bit of information). Thus the protocol has O(1) information complexity for any prior μ. 40

41 Operationalizing IC: Information equals amortized communication Recall [Shannon]: lim n C(X n )/n = H X. Turns out: lim n CC(F n, μ n, ε)/n = IC F, μ, ε, for ε > 0. [Error ε allowed on each copy] For ε = 0: lim n CC(F n, μ n, 0 + )/n = IC F, μ, 0. [ lim n CC(F n, μ n, 0)/n an interesting open problem.] 41

42 Information = amortized communication lim n CC(F n, μ n, ε)/n = IC F, μ, ε. Two directions: and. n H X = H X n C X n H X n + 1. Additivity of entropy Compression (Huffman)

43 The direction lim n CC(F n, μ n, ε)/n IC F, μ, ε. Start with a protocol π solving F, whose IC π, μ is close to IC F, μ, ε. Show how to compress many copies of π into a protocol whose communication cost is close to its information cost. More on compression later. 43

44 The direction lim n CC(F n, μ n, ε)/n IC F, μ, ε. Use the fact that CC Fn,μ n,ε n Additivity of information complexity: IC F n, μ n, ε n IC Fn,μ n,ε n = IC F, μ, ε.. 44

45 Proof: Additivity of information complexity Let T 1 (X 1, Y 1 ) and T 2 (X 2, Y 2 ) be two twoparty tasks. E.g. Solve F(X, Y) with error ε w.r.t. μ Then IC T 1 T 2, μ 1 μ 2 = IC T 1, μ 1 + IC T 2, μ 2 is easy. is the interesting direction.

46 IC T 1, μ 1 + IC T 2, μ 2 IC T 1 T 2, μ 1 μ 2 Start from a protocol π for T 1 T 2 with prior μ 1 μ 2, whose information cost is I. Show how to construct two protocols π 1 for T 1 with prior μ 1 and π 2 for T 2 with prior μ 2, with information costs I 1 and I 2, respectively, such that I 1 + I 2 = I. 46

47 π( X 1, X 2, Y 1, Y 2 ) π 1 X 1, Y 1 Publicly sample X 2 μ 2 Bob privately samples Y 2 μ 2 X2 Run π( X 1, X 2, Y 1, Y 2 ) π 2 X 2, Y 2 Publicly sample Y 1 μ 1 Alice privately samples X 1 μ 1 Y1 Run π( X 1, X 2, Y 1, Y 2 )

48 Analysis - π 1 π 1 X 1, Y 1 Publicly sample X 2 μ 2 Bob privately samples Y 2 μ 2 X2 Run π( X 1, X 2, Y 1, Y 2 ) Alice learns about Y 1 : I Π; Y 1 X 1 X 2 ) Bob learns about X 1 : I Π; X 1 Y 1 Y 2 X 2 ). I 1 = I Π; Y 1 X 1 X 2 ) + I Π; X 1 Y 1 Y 2 X 2 ). 48

49 Analysis - π 2 π 2 X 2, Y 2 Publicly sample Y 1 μ 1 Alice privately samples X 1 μ 1 Y1 Run π( X 1, X 2, Y 1, Y 2 ) Alice learns about Y 2 : I Π; Y 2 X 1 X 2 Y 1 ) Bob learns about X 2 : I Π; X 2 Y 1 Y 2 ). I 2 = I Π; Y 2 X 1 X 2 Y 1 ) + I Π; X 2 Y 1 Y 2 ). 49

50 Adding I 1 and I 2 I 1 + I 2 = I Π; Y 1 X 1 X 2 ) + I Π; X 1 Y 1 Y 2 X 2 ) + I Π; Y 2 X 1 X 2 Y 1 ) + I Π; X 2 Y 1 Y 2 ) = I Π; Y 1 X 1 X 2 ) + I Π; Y 2 X 1 X 2 Y 1 ) + I Π; X 2 Y 1 Y 2 ) + I Π; X 1 Y 1 Y 2 X 2 ) = I Π; Y 1 Y 2 X 1 X 2 + I Π; X 2 X 1 Y 1 Y 2 = I. 50

51 Summary Information complexity is additive. Operationalized via Information = amortized communication. lim n CC(F n, μ n, ε)/n = IC F, μ, ε. Seems to be the right analogue of entropy for interactive computation. 51

52 Entropy vs. Information Complexity Entropy IC Additive? Yes Yes Operationalized lim C(X n )/n n CC F n, μ n, ε lim n n Compression? Huffman: C X H X + 1???!

53 Can interactive communication be compressed? Is it true that CC F, μ, ε IC F, μ, ε + O(1)? Less ambitiously: CC F, μ, O(ε) = O IC F, μ, ε? (Almost) equivalently: Given a protocol π with IC π, μ = I, can Alice and Bob simulate π using O I communication? Not known in general 53

54 Direct sum theorems Let F be any functionality. Let C(F) be the cost of implementing F. Let F n be the functionality of implementing n independent copies of F. The direct sum problem: Does C(F n ) n C(F)? In most cases it is obvious that C(F n ) n C(F). 54

55 Direct sum randomized communication complexity Is it true that CC(F n, μ n, ε) = Ω(n CC(F, μ, ε))? Is it true that CC(F n, ε) = Ω(n CC(F, ε))? 55

56 Direct product randomized communication complexity Direct sum CC(F n, μ n, ε) = Ω(n CC(F, μ, ε))? Direct product CC(F n, μ n, (1 ε) n ) = Ω(n CC(F, μ, ε))? 56

57 Direct sum for randomized CC and interactive compression Direct sum: CC(F n, μ n, ε) = Ω(n CC(F, μ, ε))? In the limit: n IC(F, μ, ε) = Ω(n CC(F, μ, ε))? Interactive compression: CC F, μ, ε = O(IC F, μ, ε)? Same question! 57

58 The big picture IC(F, μ, ε) additivity (=direct sum) for information IC(F n, μ n, ε)/n CC(F, μ, ε) interactive compression? direct sum for communication? information = amortized communication CC(F n, μ n, ε)/n

59 Current results for compression A protocol π that has C bits of communication, conveys I bits of information over prior μ, and works in r rounds can be simulated: Using O(I + r) bits of communication. Using O I C bits of communication. Using 2 O(I) bits of communication. If μ = μ X μ Y, then using O(I polylog C) bits of communication. 59

60 Their direct sum counterparts CC(F n, μ n, ε) = Ω(n 1/2 CC(F, μ, ε)). CC(F n, ε) = Ω(n 1/2 CC(F, ε)). For product distributions μ = μ X μ Y, CC(F n, μ n, ε) = Ω(n CC(F, μ, ε)). When the number of rounds is bounded by r n, a direct sum theorem holds. 60

61 Direct product The best one can hope for is a statement of the type: CC(F n, μ n, 1 2 O n ) = Ω(n IC(F, μ, 1/3)). Can prove: CC F n, μ n, 1 2 O n = Ω(n 1/2 CC(F, μ, 1/3)). 61

62 Proof 2: Compressing a one-round protocol Say Alice speaks: IC π, μ = I M; X Y. Recall KL-divergence: I M; X Y = E Y D M XY M Y Bottom line: Alice has M X ; Bob has M Y ; = E Y D M X M Y Goal: sample from M X using D(M X M Y ) communication. 62

63 The dart board Interpret the public randomness as random points in U [0,1], where U is the universe of all possible messages. First message under the histogram of M is distributed M. 1 0 qu 1 qu 2 qu 3 qu 4 qu 5 qu 6 qu 7. u 1 u 2 U u 3 u 4 u 5

64 Proof Idea Sample using O(log 1/ε + D(M X M Y )) communication with statistical error ε. Public randomness: ~ U samples 1 qu 1 qu 2 qu 3 qu 4 qu 5 qu 6 qu 7. 1 M X M Y u 3 u 3 u 1 u 1 u 4 u 2 u 4 u M X u 4 M Y 64

65 Proof Idea Sample using O(log 1/ε + D(M X M Y )) communication with statistical error ε. 1 1 M X M Y M X u 4 u 4 u h 1 (u 4 ) h 2 (u 4 ) u 2 M Y 65

66 Proof Idea Sample using O(log 1/ε + D(M X M Y )) communication with statistical error ε. 1 1 M X 2M Y u 4 u M X u 4 u 4 h 3 (u 4 ) h 4 (u 4 ) h log 1/ ε (u 4 ) u 4 M Y h 1 (u 4 ), h 2 (u 4 ) 66

67 Analysis If M X (u 4 ) 2 k M Y (u 4 ), then the protocol will reach round k of doubling. There will be 2 k candidates. About k + log 1/ε hashes to narrow to one. The contribution of u 4 to cost: M X (u 4 ) (log M X ( u 4 )/M Y ( u 4 ) + log 1/ε). D(M X M Y ) u M X (u) log M X(u) M Y (u). Done! 67

68 External information cost (X, Y) ~ μ. X C Y A Protocol Protocol π transcript π IC ext (π, μ) = I(Π; XY) what Charlie learns about (X,Y) B

69 Example F is X=Y?. μ is a distribution where w.p. ½ X=Y and w.p. ½ (X,Y) are random. X Y MD5(X) X=Y? A IC ext (π, μ) = I(Π; XY) = 129 bits what Charlie learns about (X,Y) B 69

70 External information cost It is always the case that IC ext π, μ IC π, μ. If μ = μ X μ Y is a product distribution, then IC ext π, μ = IC π, μ. 70

71 External information complexity IC ext F, μ, ε inf π computes F with error ε Can it be operationalized? IC ext (π, μ). 71

72 Operational meaning of IC ext? Conjecture: Zero-error communication scales like external information: CC F n, μ n, 0 lim = IC n n ext F, μ, 0? Recall: CC F n, μ n, 0 + lim = IC F, μ, 0. n n 72

73 Example transmission with a strong prior X, Y 0,1 μ is such that X U 0,1, and X = Y with a very high probability (say 1 1/ n). F X, Y = X is just the transmit X function. Clearly, π should just have Alice send X to Bob. IC F, μ, 0 = IC π, μ = H 1 = o(1). n IC ext F, μ, 0 = IC ext π, μ = 1. 73

74 Example transmission with a strong prior IC F, μ, 0 = IC π, μ = H 1 n = o(1). IC ext F, μ, 0 = IC ext π, μ = 1. CC F n, μ n, 0 + = o n. CC F n, μ n, 0 = Ω(n). Other examples, e.g. the two-bit AND function fit into this picture. 74

75 Additional directions Information complexity Interactive coding Information theory in TCS 75

76 Interactive coding theory So far focused the discussion on noiseless coding. What if the channel has noise? [What kind of noise?] In the non-interactive case, each channel has a capacity C. 76

77 Channel capacity The amortized number of channel uses needed to send X over a noisy channel of capacity C is H X C Decouples the task from the channel! 77

78 Example: Binary Symmetric Channel Each bit gets independently flipped with probability ε < 1/2. One way capacity 1 H ε. 0 ε 1 ε ε ε 1 78

79 Interactive channel capacity Not clear one can decouple channel from task in such a clean way. Capacity much harder to calculate/reason about. 1 ε Example: Binary symmetric channel. 0 ε One way capacity 1 H ε. 1 Interactive (for simple pointer jumping, 1 ε [Kol-Raz 13]): 1 Θ H ε. 0 1

80 Information theory in communication complexity and beyond A natural extension would be to multi-party communication complexity. Some success in the number-in-hand case. What about the number-on-forehead? Explicit bounds for log n players would imply explicit ACC 0 circuit lower bounds. 80

81 Naïve multi-party information cost YZ XZ XY A B C IC π, μ = I(Π; X YZ)+ I(Π; Y XZ)+ I(Π; Z XY) 81

82 Naïve multi-party information cost IC π, μ = I(Π; X YZ)+ I(Π; Y XZ)+ I(Π; Z XY) Doesn t seem to work. Secure multi-party computation [Ben- Or,Goldwasser, Wigderson], means that anything can be computed at near-zero information cost. Although, these construction require the players to share private channels/randomness. 82

83 Communication and beyond The rest of today: Data structures; Streaming; Distributed computing; Privacy. Exact communication complexity bounds. Extended formulations lower bounds. Parallel repetition? 83

84 Thank You! 84

85 Open problem: Computability of IC Given the truth table of F X, Y, μ and ε, compute IC F, μ, ε. Via IC F, μ, ε = lim n CC(F n, μ n, ε)/n can compute a sequence of upper bounds. But the rate of convergence as a function of n is unknown. 85

86 Open problem: Computability of IC Can compute the r-round IC r F, μ, ε information complexity of F. But the rate of convergence as a function of r is unknown. Conjecture: IC r F, μ, ε IC F, μ, ε = O F,μ,ε 1 r 2. This is the relationship for the two-bit AND. 86