Benefits of HPC for NLP besides big data

Size: px

Start display at page:

Download "Benefits of HPC for NLP besides big data"

Jemimah Sims
10 years ago
Views:

1 Benefits of HPC for NLP besides big data Barbara Plank Center for Sprogteknologie (CST) University of Copenhagen, Denmark Web-Scale Natural Language Processing in Northern Europe Oslo, Nov 24, 2014

2 Motivation labeled data all available data

3 The problem: Training data is biased the CROSS-DOMAIN GULF accuracy WSJ POS tagger WSJ Twitter

4 The problem: Training data is scarce news bio domain... twitter... English Chinese language Dutch languages x domains Danish......

5 Goal: Robust processing Exploit unlabeled data to improve NLP across domains and across languages Possible methods: unsupervised domain adaptation (e.g. exploiting unlabeled data clustering/embeddings, importance weighting) cross-lingual learning (not today s talk, just started)

6 Traditional HPC use in NLP Parallelize data processing node1 node2 node3 node4 Distributed model training (e.g. McDonald et al.,2010; Gesmundo & Tomeh, 2012) initial w training data node1 node2 node3 node4 weighted average w

7 Additional benefits of HPC not only lots of unlabeled data unsupervised & semi-supervised algorithms models: many parameters evaluation: need robust results sharing: common data repositories

8 models Example study: Importance weighting

9 Does importance weighting work for unsupervised DA of POS taggers? SOURCE train TARGET test? assign instance-dependent weights (Shimodaira, 2001): unlabeled TARGET approximation, e.g.:!! domain classifier to discriminate between SOURCE & TARGET (Zadrozny et al., 2004; Bickel and Scheffer, 2007; Søgaard and Haulrich, 2011)

assign instance-dependent weights (Shimodaira, 2001): unlabeled TARGET

10 Domain classifier (Søgaard & Haulrich, 2011) representation n-gram size 10

11 (Plank, Johannsen, Søgaard, 2014) EMNLP Results Token-based domain classifier 96 baseline 1-gram 2-gram 3-gram 4-gram NONE IS SIGNIFICANTLY BETTER THAN BASELINE answers reviews s weblogs newsgroups on test sets; results were similar for other representations (Brown, Wiktionary) 11

BETTER THAN BASELINE 94 92 answers reviews emails weblogs newsgroups on

12 (Plank, Johannsen, Søgaard, 2014) EMNLP Random weighting uniform stdexp NONE IS SIGNIFICANTLY BETTER THAN BASELINE significance cutoff baseline 500 runs Zipfian Each plot: 500 POS tagging models; total: 1500 models; sequential: ~5m/model (7500m, 5 1/2 days) parallelized on HPC Gardar (in 3 batches) : 1.5 days 12

Zipfian Each plot: 500 POS tagging models; total: 1500 models; sequential:

13 (Plank, Johannsen, Søgaard, 2014) EMNLP Results Token-based domain classifier 96 baseline 1-gram 2-gram 3-gram 4-gram NONE IS SIGNIFICANTLY BETTER THAN BASELINE answers reviews s weblogs newsgroups avg tag ambiguity KL-div: OOV: low low high OOV! on test sets; results were similar for other representations (Brown, Wiktionary) 13

newsgroups avg tag ambiguity 1.09 1.07 1.07 1.05 1.05 KL-div: 0.05 0.04 0.03 0.01 0.01 OOV: 27.7 29.

14 Gardar we used the joint HPC cluster in Iceland for these experiments 288 nodes, 6 cores (= 3456 cores) 24GB each batch jobs submitted via TORQUE we have access since 6 months (end of April 2014): Gardar is very useful! we have locally only: 1 server with 8 cores, 384gb memory, 1.5TB disk space

we have access since 6 months (end of April 2014): Gardar is very useful!

15 evaluation How robust are our results?

16 Within sample bias Twitter POS tagger, large differences on different Twitter samples: Twitter data sets train/test Gimpel Ritter Gimpel Ritter Combined (Hovy et al., LREC 2014; Fromheide et al., 2014)

Gimpel Ritter Gimpel 90.46 82.29 Ritter 80.52 90.

17 What to do about this? Whenever possible evaluate: across several test data sets A B C on down-stream tasks Estimate significance cut-off Bootstrap-based evaluation

18 IAA-informed parsing: bootstrap learning curve average over! dev1 10 samples dev2 LAS Norwegian 20% 40% 60% 80% 100% amount training data baseline

19 IAA-informed parsing: bootstrap learning curve average over! 50 samples system LAS baseline Norwegian 20% 40% 60% 80% 100% amount training data

20 parallelization of NLP pipeline over 4 languages NO DA CA HR node1 node2 node3 node4

21 parallelization of NLP pipeline over 4 languages NO DA CA HR node1 node2 node3 node4 node5 node6 node7 node8 with 2 evaluation setups

22 sharing common data repository for Nordic countries

23 Summary: HPC for NLP besides parallel data processing and distributed training: models: parallelization over data sets, parameter search, negative results evaluation: significance cut-off, bootstrap samples sharing: common data repository

24 Thanks!

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8. Final Results on Dutch Senseval-2 Test Data Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised