Incorporating Lexical and Prosodic Information at Different Levels for Meeting Summarization

Transcription

1 Incorporating Lexical and Prosodic Information at Different Levels for Meeting Summarization Catherine Lai and Steve Renals Centre for Speech Technology Research University of Edinburgh 17 September 2014

2 Automatic Extractive Summarization The Extractive Summarization task is to select Dialogue Acts (DAs) to form a summary. Potentially useful for browsing/analysing dialogues Possible using only prosodic features [Maskey and Hirschberg, 2005; Murray, 2008; Xie et al., 2009; Jauhar et al., 2013]. Whether prosodic features perform better than lexical features depends on the types of features used and the evaluation metric [Murray, 2008; Xie et al., 2009].

3 Automatic Extractive Summarization Questions: Where should we incorporate prosodic in extractive summarization? How do different aspects of prosody relate to what goes into meeting summaries? Current Work: Can prosodic features be used to augment lexical features for meeting summarization. How do these summaries differ from those utterance level prosody?

4 Prosodic Features in Extractive Summarization Direct modelling over dialogue acts: Aggregate stats for F0 and energy, with varying normalization. Treated as independent to that of lexical content. Possibly compensate for ASR errors. Emphatic important: e.g. higher mean and maximum F0 and energy. [Murray, 2008]. Duration based features can make a big difference depending on the evaluation [Murray, 2008; Penn and Zhu, 2008; Xie et al., 2009; Riedhammer et al., 2010].

5 Prosodic lexical features? Prosody also marks specific words as important in information structure terms [Silipo and Crestani, 2000; Calhoun, 2012]. Combining word prosody and tf.idf scores has been shown to help word level tasks, e.g: keyword extraction from voic [Koumpis and Renals, 2005], punctuation annotation [Christensen et al., 2001], topic tracking in broadcast news [Guinaudeau and Hirschberg, 2011]. Hypothesis: integrating prosodic information at the word level will improve extractive summarization performance over plain lexical features like tf.idf and DA level prosody.

6 Plan: Augmented Lexical Features Combine term frequency and prosodic features using an MLP to predict whether a word belongs to an Extracted Dialogue Act (EDA). Feed word level probabilities into the higher level DA extraction task as augmented lexical features. Look at results using precision/recall based measures: AUROC [Murray and Renals, 2007], ROUGE [Lin, 2004]. Look at variation in distribution of extracted DAs in the meeting timeline and summary redundancy.

7 Data 140 AMI scenario meetings [Carletta, 2007] 4 speakers, 4 remote control design stages per group. Standard development and test sets, 20 meetings/5 groups each. 75 ICSI research meetings [Janin et al., 2003] 3-9 speakers, 8 topics, e.g. robustness test set as in Murray [2008], dev=6 randomly selected meetings.

8 Data 140 AMI scenario meetings [Carletta, 2007] 4 speakers, 4 remote control design stages per group. Standard development and test sets, 20 meetings/5 groups each. 75 ICSI research meetings [Janin et al., 2003] 3-9 speakers, 8 topics, e.g. robustness test set as in Murray [2008], dev=6 randomly selected meetings.

9 Gold Standard Extractive Summaries Extracted Dialogue acts (EDAs), drawn from manual transcript and DA annotation. Aimed at someone who is concerned about the state of the project, like the department head. No absolute limit on dialogue act selection (10% guideline). Use only DAs linked to human abstractive summary content.

10 Prosodic Features F0 and Intensity extraction: Via Praat at 10ms intervals, Parameter settings were automatically determined for spurts [Evanini and Lai, 2010]. Missing values via linear interpolation after octave jump removal. Speaker normalization: over conversations, F0 semitones relative to speaker mean F0 (Hz) Intensity subtract speaker mean. Downdrift correction: For words, subtract predicted values from linear regression over spurts before calculating statistics. Statistics: mean, sd, max, min over words and DAs, include slope for DAs.

11 Term-Frequency Based Lexical Features After applying the Porter Stemmer, calculate: tf.idf and su.idf [Murray and Renals, 2007] Inverse document frequency was calculated over combined AMI, ICSI, TDT-2 corpora. Pointwise Mutual Information (PMI) of words with EDA/non-EDA status on training set [Galley, 2006] DA features: sum individual features for words in the DA [Murray and Renals, 2007; Xie, 2010]

12 Word Level Prediction Classify whether a word is in an EDA or not, using an MLP to learn feature combination weights. Feature ICSI AMI Logistic Regression tf.idf su.idf pmi MLP tf.idf.pros su.idf.pros pmi.pros pros tsp tsp.pros Table: Development set AUROC for word level EDA detection.

15 EDA Detection and Evaluation Classification of dialogue acts as EDAs. Multilevel logistic regression: annotators, meeting types, and corpora are group level effects [Gelman and Hill, 2007], using lme4 in R. AUROC over gold standard annotations. ROUGE-1 F-scores [Riedhammer et al., 2010] with DUC standard parameters [Xie and Liu, 2010], 15% word compression rate. Use additional annotations: 3-5 for ICSI, 2-3 for AMI. Focus on tf.idf.

16 DA Level Prediction: AUROC Adding DA prosody improves on bare lexical features. Augmented lexical features outperform DA level combinations.

17 Lexical Features vs Length Features: AUROC Features AMI ICSI word-tf.idf.pros word-tsp word-tsp.pros word-tsp+da-pros DA-PMI DA-dur DA-nwords Murray [2008] full Augmented lexical features do (a bit) better than length features and the full feature set in Murray [2008]. DA-PMI is not so predictive.

18 Lexical Features as Weights Summing over lexical features assumes they weight words as being more noteworthy for summarization. Augmented lexical features do a lot better than tf.idf and somewhat better than uniform weighting (DA-nwords). DA length is usually reported as most predictive for AUROC [Penn and Zhu, 2008; Murray, 2008] but not for n-gram based ROUGE [Xie et al., 2009; Riedhammer et al., 2010]. ICSI ROUGE-1 results show the same pattern as AUROC, but...

19 ROUGE-1: AMI Best performance from DA-tsp+DA-pros (0.595), but differences are within bootstrap confidence intervals. DA-nwords=0.582

20 Summary Differences In what other ways do the resulting summaries differ? Redundancy as in unsupervised approaches? [Zechner, 2002; Riedhammer et al., 2010]. Hold out each DA and measure its cosine distance to the rest of the summary and sum the distances [Zechner, 2002]. Location of noteworthy parts of a dialogue? Look at the proportion of summed EDA time from the summaries to that of the gold standard calculated over meeting quarters hot spots.

21 Redundancy Figure: Summed redundancy: w-lex.pros summaries were significantly less redundant than those based on bare lexical features ± level prosody (Wilcoxon p < 0.01, Holm corrected)

22 Distribution of EDAs Figure: Ratio of EDA time to gold standard EDA time in meeting quarters: The highest average proportion arises from DA level prosody models, but differences are not significant.

23 Conclusions Incorporating multiple sournces of prosodic and term-frequency information at the word level provides better performance than using DA level features in AUROC terms. Summaries derived from prosodically augmented lexical features exhibited less redundancy. While DA prosody generally performs worse, it could provide information for temporally locating larger regions of interest. Understanding of how to weight DA level prosody features requires extrinsic user based testing of how summaries are used in different tasks. e.g. browsing vs audits.

24 Thanks! Questions?

25 Thanks! This work was supported by the European Union under the FP7 project inevent (grant agreement ).

26 References Calhoun, S. (2012). The theme/rheme distinction: Accent type or relative prominence? Journal of Phonetics, 40(2): Carletta, J. (2007). Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus. Language Resources and Evaluation, 41(2): Christensen, H., Gotoh, Y., and Renals, S. (2001). Punctuation annotation using statistical prosody models. In Proc. ISCA Workshop on Prosody in Speech Recognition and Understanding. Evanini, K. and Lai, C. (2010). The importance of optimal parameter setting for pitch extraction. Journal of the Acoustical Society of America, 128(4):2291. Galley, M. (2006). A skip-chain conditional random field for ranking meeting utterances by importance. In Proceedings of EMNLP 06, number July, pages Gelman, A. and Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press Cambridge. Guinaudeau, C. and Hirschberg, J. (2011). Accounting for prosodic information to improve ASR-based topic tracking for TV Broadcast News. In Interspeech 2011, pages Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., et al. (2003). The ICSI meeting corpus. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 03), volume 1, pages I 364. Jauhar, S., Chen, Y., and Metze, F. (2013). Prosody-Based Unsupervised Speech Summarization with Two-Layer Mutually Reinforced Random Walk. In IJCNLP Koumpis, K. and Renals, S. (2005). Automatic summarization of voic messages using lexical and prosodic features. ACM Transactions on Speech and Language Processing, 2(1):1 24. Lin, C. (2004). Rouge: A package for automatic evaluation of summaries. In Proceedings of Text Summarization Branches Out, number 1. Liu, F. and Liu, Y. (2013). Towards Abstractive Speech Summarization: Exploring Unsupervised and Supervised Approaches for Spoken Utterance Compression. IEEE Transactions on Audio, Speech, and Language Processing, 21(7): Maskey, S. and Hirschberg, J. (2005). Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization. In Interspeech Murray, G. (2008). Using Speech-Specific Characteristics for Automatic Speech Summarization. PhD thesis, University of Edinburgh. Murray, G. and Renals, S. (2007). Term-weighting for summarization of multi-party spoken dialogues. In Machine Learning for Multimodal Interaction IV, volume 4892.

27 Penn, G. and Zhu, X. (2008). A Critical Reassessment of Evaluation Baselines for Speech Summarization. In ACL 2008, number June, pages Riedhammer, K., Favre, B., and Hakkani-Tür, D. (2010). Long story short Global unsupervised models for keyphrase based meeting summarization. Speech Communication, 52(10): Silipo, R. and Crestani, F. (2000). Prosodic stress and topic detection in spoken sentences. In Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000, pages IEEE Comput. Soc. Xie, S. (2010). Automatic extractive summarization on meeting corpus. PhD thesis, University of Texas at Dallas. Xie, S., Hakkani-Tur, D., Favre, B., and Liu, Y. (2009). Integrating prosodic features in extractive meeting summarization. In 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, pages IEEE. Xie, S. and Liu, Y. (2010). Improving supervised learning for meeting summarization using sampling and regression. Computer Speech & Language, 24(3): Zechner, K. (2002). Automatic Summarization of Open-Domain Multiparty Dialogues in Diverse Genres. Computational Linguistics, 28(4):

28 DA Level Prediction: AUROC Figure: AUROC for the ICSI test sets Same pattern as for AMI, but DA prosody improves on the augmented lexical features here.

29 ROUGE-1: ICSI ROUGE-1 scores mirror AUROC results. DA-nwords=0.663

30 Word and DA Prosody Features AMI ICSI word-pros DA-pros DA-pros ± Murray [2008] prosody DA prosody improves a lot with feature extraction (cf. Murray [2008]). EDAs actually have significantly lower mean pitch than non-edas on average, but have expanded pitch range. Including prosodic delta features over ±4 DAs did not produce much of a change in performance.

31 Redundancy Figure: Summed redundancy: w-lex.pros summaries were significantly less redundant than those based on bare lexical features ± level prosody (Wilcoxon p < 0.01, Holm corrected)

32 Distribution of EDAs Figure: Ratio of EDA time to gold standard EDA time in meeting quarters: The highest average proportion arises from DA level prosody models, but differences are not significant.

33 Future work How do summarizer DA rankings effect user efficiency and satisfaction in browsing tasks? How do differences in ICSI and AMI meeting structure affect intrinsic measures? How do compression techniques [Liu and Liu, 2013] change ROUGE scores? Integrate prosodic features in unsupervised summarization methods that more closely fit ROUGE s objectives [Riedhammer et al., 2010]. Look at keyword identification as an objective for the generation of augmented lexical features (cf. Koumpis and Renals [2005])