Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations C. Wright, L. Ballard, S. Coull, F. Monrose, G. Masson Talk held by Goran Doychev Selected Topics in Information Security and Cryptography Seminar 1 / 30
Overview 1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation 2 / 30
1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation 3 / 30
How does VoIP work? Control channel: SIP, XMPP, Skype negotiate IP ports, supported codecs etc. Voice data: RTP over UDP Speech codec: GSM, G.728, isac, Speex 4 / 30
Operation of a Codec audio stream sampling at 8000 or 16000 samples per second (Hz) n most recent samples compressed to packet (usually 20ms) Example 16kHz audio source: n = 320 samples per packet 8kHz audio source: n = 160 samples per packet 5 / 30
Operation of a Codec (2) brute-force search over entries in codebook of audio vectors find one that most closely reproduces audio packet audio packet 01001110 digital representation In Out 01001010 0110 01001110 0111 01011001 1000 01011010 1001 01011110 1010 codebook 0111 output 6 / 30
Operation of a Codec (3) Quality of sound depends on # entries in codebook Classification of coders according to bit-rate: Category Bit-rate range High bit-rate > 15 kbps Medium bit-rate 5 to 15 kbps Low bit-rate 2 to 5 kbps Very low bit-rate < 2 kbps 7 / 30
Variable Bit Rate Variable bit rate (VBR): adaptively choose bit rate for each packet Balance between audio quality and bandwidth In a two-way conversation: speaker silent 63% of the time 8 / 30
LEAKAGE: Bit rate depends on encoded data Variable Bit Rate (2) e.g., Speex encodes vowel sounds (aa, aw) at higher bit rate than fricative sounds (f, s) 9 / 30
1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation 10 / 30
Problem Description Given: utterances of n phrases phrase 1 phrase 2 phrase 3 packet sizes of one of the phrases (5k,7k,3k,8k,12k,2k,1k) Goal: recognize the phrase (5k,7k,3k,8k,12k,2k,1k) the phrase 11 / 30
Profile Hidden Markov Model (HMM) Match states - expected distribution of packet sizes at each position in the sequence Insert states - emit packets according to some distribution (uniform). Allows insertion of additional packets. Delete states - silent states. Allows omitting packets. 12 / 30
Building a Profile HMM Initially: set Match state probabilities to uniform distribution transition probabilities: make Match the most likely transition 13 / 30
Building a Profile HMM Initially: set Match state probabilities to uniform distribution transition probabilities: make Match the most likely transition Train the HMM using example utterances 13 / 30
Building a Profile HMM Initially: set Match state probabilities to uniform distribution transition probabilities: make Match the most likely transition Train the HMM using example utterances: Apply Baum & Welch algorithm: iteratively improves the probability of the training sequences Baum & Welch finds locally optimal set of parameters apply Simulated annealing Apply Viterbi training to further refine parameters. 13 / 30
Problem Description Given: utterances of n phrases phrase 1 phrase 2 phrase 3 packet sizes of one of the phrases (5k,7k,3k,8k,12k,2k,1k) Goal: recognize the phrase (5k,7k,3k,8k,12k,2k,1k) the phrase 14 / 30
Searching for a Phrase Changes: Random - emit packets according to uniform distribution. Matches packets not part of phrase of interest Profile Start/End - matches start/end of phrase from PS: transition to the first M state is most likely 15 / 30
Searching for a Phrase (2) Apply the Viterbi algorithm - find most likely sequence of states to explain observed packet sizes A hit : subsequence of states that belong to the profile part of the model 16 / 30
Searching for a Phrase (2) Apply the Viterbi algorithm - find most likely sequence of states to explain observed packet sizes A hit : subsequence of states that belong to the profile part of the model Evaluate the hit s goodness: l i,..., l j packet lengths of the phrase of interest score i,j = log Pr[l i,..., l j Profile] Pr[l i,..., l j Random] Discard hits below a threshold 16 / 30
1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation 17 / 30
Phrase Models from Phonemes Phonemes sounds like b, ch, t, s, aa, aw (English - 40 to 60 phonemes) Idea: words built up by concatenated phonemes model phonemes instead 18 / 30
Phrase Models from Phonemes Phonemes sounds like b, ch, t, s, aa, aw (English - 40 to 60 phonemes) Idea: words built up by concatenated phonemes model phonemes instead Advantages: Flexibility Cheaper 18 / 30
Problem Description Given: recordings of all phonemes aa, ae, ah, ao, aw, ay, b, ch, d, dh, eh, er, ey, f, g, hh, etc. packet sizes of a phrase (5k,7k,3k,8k,12k,2k,1k) Goal: recognize the phrase (5k,7k,3k,8k,12k,2k,1k) the phrase 19 / 30
Phrase Models from Phonemes (2) Straightforward method: 1 build HMMs for phonemes 2 concatenate them, build word HMM 3 concatenate word HMMs to phrase HMM 20 / 30
Straightforward method: Phrase Models from Phonemes (2) 1 build HMMs for phonemes 2 concatenate them, build word HMM 3 concatenate word HMMs to phrase HMM American English: the phrase (5k,7k,1k,8k,12k,2k,1k) (dh,ah),(f,r,ey,z) ( the ),( phrase ) the phrase 20 / 30
Straightforward method: Phrase Models from Phonemes (2) 1 build HMMs for phonemes 2 concatenate them, build word HMM 3 concatenate word HMMs to phrase HMM Scottish English: the phrase (5k,7k,1k,8k,10k,2k,1k) (dh,ah),(f,r,eh,z) ( the ),( frese?)? 20 / 30
Problem Description Given: recordings of all phonemes aa, ae, ah, ao, aw, ay, b, ch, d, dh, eh, er, ey, f, g, hh, etc. packet sizes of a phrase (5k,7k,3k,8k,12k,2k,1k) Goal: recognize the phrase (5k,7k,3k,8k,12k,2k,1k) the phrase 21 / 30
Problem Description Given: recordings of all phonemes aa, ae, ah, ao, aw, ay, b, ch, d, dh, eh, er, ey, f, g, hh, etc. packet sizes of a phrase (5k,7k,3k,8k,12k,2k,1k) phonetic pronunciation dictionary Goal: recognize the phrase (5k,7k,3k,8k,12k,2k,1k) the phrase 21 / 30
Phrase Models from Phonemes (3) Advanced method: build initial profile HMM for phrase (as usual) train it using synthetic training set search for phrase (as usual) 22 / 30
Phrase Models from Phonemes (3) Advanced method: build initial profile HMM for phrase (as usual) train it using synthetic training set search for phrase (as usual) Synthetic training set: phrase: the phrase split into words: the phrase create list of phonemes: dh ah f r ey z replace with packet sizes: 9k 20k 5k 8k 14k 3k 22 / 30
Phrase Models from Phonemes (3) Advanced method: build initial profile HMM for phrase (as usual) train it using synthetic training set search for phrase (as usual) Synthetic training set: phrase: the phrase split into words: the phrase create list of phonemes: dh ah f r ey z replace with packet sizes: 9k 20k 5k 8k 14k 3k Improved Model: use diphones and triphones instead of words 22 / 30
1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation 23 / 30
Experimental Setup Use TIMIT continuous speech corporus Concatenate sentences to conversation Training of HMM: TIMIT pronunciation dictionary ( proper American English) PRONLEX pronunciation dictionary (more colloquial English) 24 / 30
Evaluation Metrics recall: Probability that algorithm finds phrase precision: Probability that reported match is correct 25 / 30
Results of the Experiment recall precision 51% 50% 26 / 30
Results of the Experiment recall precision 51% 50% Some phrases were found with high accuracy: Young children should avoid exposure to contagious diseases. (recall = 0.99, precision = 1) 26 / 30
Results of the Experiment recall precision 51% 50% Some phrases were found with high accuracy: Young children should avoid exposure to contagious diseases. (recall = 0.99, precision = 1) A high deviation of results for individual speakers 26 / 30
Robustness to Noise Using pink noise: energy logarithmically distributed across range of human hearing harder for noise removal algorithms to filter it sound noise recall precision 100% -.51.50 90% 10%.39.40 75% 25%.23.22 27 / 30
Robustness to Noise Using pink noise: energy logarithmically distributed across range of human hearing harder for noise removal algorithms to filter it sound noise recall precision 100% -.51.50 90% 10%.39.40 75% 25%.23.22 attacker can identify an alarming number of the phrases 27 / 30
Mitigation Techniques Padding packets to a coarser granularity: granulity recall precision overhead multiples of 128bit 0.15 0.16 8.81% multiples of 256bit 0.04 0.04 16,5% In these tests: continuous speech In practice: 63% idle time in conversations greater overhead 28 / 30
Overview 1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation 29 / 30
References Charles V. Wright, Lucas Ballard, Scott E. Coull, Fabian Monrose, and Gerald M. Masson. Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations. In SP 08: Proceedings of the 2008 IEEE Symposium on Security and Privacy (sp 2008), pages 35 49, Washington, DC, USA, 2008. IEEE Computer Society. Charles V. Wright, Lucas Ballard, Fabian Monrose, and Gerald M. Masson. Language identification of encrypted voip traffic: Alejandra y Roberto or Alice and Bob? In SS 07: Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium, pages 1 12, Berkeley, CA, USA, 2007. USENIX Association. Wai C. Chu. Speech Coding Algorithms: Foundation and Evolution of Standardized Coders. John Wiley & Sons, Inc., New York, NY, USA, 2003. Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Proceedings of the IEEE, pages 257 286, 1989. S. R. Eddy. Profile hidden markov models (review). Bioinformatics, 14(9):755 763, 1998. 30 / 30