Benefits of HPC for NLP besides big data



Similar documents
Chapter 8. Final Results on Dutch Senseval-2 Test Data

Effective Self-Training for Parsing

A stream computing approach towards scalable NLP

Overview of the EVALITA 2009 PoS Tagging Task

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Semi-Supervised Learning for Blog Classification

Effectiveness of Domain Adaptation Approaches for Social Media PoS Tagging

Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar anoop/

Micro blogs Oriented Word Segmentation System

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

Big Data Mining Services and Knowledge Discovery Applications on Clouds

High Availability of the Polarion Server

Identifying Focus, Techniques and Domain of Scientific Papers

Robust personalizable spam filtering via local and global discrimination modeling

FLORS: Fast and Simple Domain Adaptation for Part-of-Speech Tagging

Netapp HPC Solution for Lustre. Rich Fenton UK Solutions Architect

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Distributed forests for MapReduce-based machine learning

Twitter sentiment vs. Stock price!

Shared Parallel File System

Application of Predictive Analytics for Better Alignment of Business and IT

Bootstrapping Big Data

Anotaciones semánticas: unidades de busqueda del futuro?

Graph Database Proof of Concept Report

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Big Data Paradigms in Python

NORDIC TRAINING SCHEDULE AUGUST - DECEMBER

Large Scale Multi-view Learning on MapReduce

Intellicus Enterprise Reporting and BI Platform

Using multiple models: Bagging, Boosting, Ensembles, Forests

B-bleaching: Agile Overtraining Avoidance in the WiSARD Weightless Neural Classifier

CST

QoS-Aware Storage Virtualization for Cloud File Systems. Christoph Kleineweber (Speaker) Alexander Reinefeld Thorsten Schütt. Zuse Institute Berlin

Data Selection in Semi-supervised Learning for Name Tagging

A semi-supervised Spam mail detector

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Analysis of Representations for Domain Adaptation

Leveraging Ensemble Models in SAS Enterprise Miner

Machine Learning for natural language processing

BENCHMARKING ARCHITEKTUR WYSOKIEJ WYDAJNOŚCI ALGORYTMAMI PRZETWARZANIA JĘZYKA NATURALNEGO

High Performance Computing OpenStack Options. September 22, 2015

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

Social Media Mining. Data Mining Essentials

Open source framework for data-flow visual analytic tools for large databases

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models

Performance Analysis of Mixed Distributed Filesystem Workloads

Big Data on Microsoft Platform

Unsupervised joke generation from big data

The Scientific Data Mining Process

Hadoop Usage At Yahoo! Milind Bhandarkar

A Study of Web Log Analysis Using Clustering Techniques

Unsupervised joke generation from big data

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

A Theory of the Spatial Computational Domain

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

Interoperability, Standards and Open Advancement

File Management. Chapter 12

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

Scheduling and Load Balancing in the Parallel ROOT Facility (PROOF)

The Benefits of Networking

Microblog Sentiment Analysis with Emoticon Space Model

Analysis Tools and Libraries for BigData

Hardware/Software Guidelines

Searching biomedical data sets. Hua Xu, PhD The University of Texas Health Science Center at Houston

Machine Learning with MATLAB David Willingham Application Engineer

Active Learning SVM for Blogs recommendation

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

Server Load Prediction

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

English Grammar Checker

Server Consolidation Examples (So Easy, a Cave Man Could Do It)

HADOOP AT NOKIA JOSH DEVINS, NOKIA HADOOP MEETUP, JANUARY 2011 BERLIN

LSKA 2010 Survey Report Job Scheduler

Workshop. Neil Barrett PhD, Jens Weber PhD, Vincent Thai MD. Engineering & Health Informa2on Science

Data Mining with Hadoop at TACC

Extracting Events from Web Documents for Social Media Monitoring using Structured SVM

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

SACOC: A spectral-based ACO clustering algorithm

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Microsoft Hyper-V Server 2008 R2 Getting Started Guide

Detecting Forum Authority Claims in Online Discussions

Bigtable is a proven design Underpins 100+ Google services:

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Parallels Plesk Automation

Blog Post Extraction Using Title Finding

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Chapter 6. The stacking ensemble approach

The New Library of Alexandria, the New Bibliotheca Alexandrina, is dedicated to recapturing the spirit of openness and scholarship of the original

Benchmarking Cassandra on Violin

Optimization and analysis of large scale data sorting algorithm based on Hadoop

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Simple Type-Level Unsupervised POS Tagging

Online Semi-Supervised Learning

ILLINOISCLOUDNLP: Text Analytics Services in the Cloud

Transcription:

Benefits of HPC for NLP besides big data Barbara Plank Center for Sprogteknologie (CST) University of Copenhagen, Denmark http://cst.dk/bplank Web-Scale Natural Language Processing in Northern Europe Oslo, Nov 24, 2014

Motivation labeled data all available data

The problem: Training data is biased the CROSS-DOMAIN GULF accuracy 100 90 80 70 WSJ POS tagger 96.99 78.96 WSJ Twitter

The problem: Training data is scarce news bio domain... twitter... English Chinese language Dutch languages x domains Danish......

Goal: Robust processing Exploit unlabeled data to improve NLP across domains and across languages Possible methods: unsupervised domain adaptation (e.g. exploiting unlabeled data clustering/embeddings, importance weighting) cross-lingual learning (not today s talk, just started)

Traditional HPC use in NLP Parallelize data processing node1 node2 node3 node4 Distributed model training (e.g. McDonald et al.,2010; Gesmundo & Tomeh, 2012) initial w training data node1 node2 node3 node4 weighted average w

Additional benefits of HPC not only lots of unlabeled data unsupervised & semi-supervised algorithms models: many parameters evaluation: need robust results sharing: common data repositories

models Example study: Importance weighting

Does importance weighting work for unsupervised DA of POS taggers? SOURCE train TARGET test? assign instance-dependent weights (Shimodaira, 2001): unlabeled TARGET approximation, e.g.:!! domain classifier to discriminate between SOURCE & TARGET (Zadrozny et al., 2004; Bickel and Scheffer, 2007; Søgaard and Haulrich, 2011)

Domain classifier (Søgaard & Haulrich, 2011) representation n-gram size 10

(Plank, Johannsen, Søgaard, 2014) EMNLP Results Token-based domain classifier 96 baseline 1-gram 2-gram 3-gram 4-gram NONE IS SIGNIFICANTLY BETTER THAN BASELINE 94 92 answers reviews emails weblogs newsgroups on test sets; results were similar for other representations (Brown, Wiktionary) 11

(Plank, Johannsen, Søgaard, 2014) EMNLP Random weighting uniform stdexp NONE IS SIGNIFICANTLY BETTER THAN BASELINE significance cutoff baseline 500 runs Zipfian Each plot: 500 POS tagging models; total: 1500 models; sequential: ~5m/model (7500m, 5 1/2 days) parallelized on HPC Gardar (in 3 batches) : 1.5 days 12

(Plank, Johannsen, Søgaard, 2014) EMNLP Results Token-based domain classifier 96 baseline 1-gram 2-gram 3-gram 4-gram NONE IS SIGNIFICANTLY BETTER THAN BASELINE 94 92 answers reviews emails weblogs newsgroups avg tag ambiguity 1.09 1.07 1.07 1.05 1.05 KL-div: 0.05 0.04 0.03 0.01 0.01 OOV: 27.7 29.5 29.9 22.1 23.1 low low high OOV! on test sets; results were similar for other representations (Brown, Wiktionary) 13

Gardar we used the joint HPC cluster in Iceland for these experiments 288 nodes, 6 cores (= 3456 cores) 24GB each batch jobs submitted via TORQUE we have access since 6 months (end of April 2014): Gardar is very useful! we have locally only: 1 server with 8 cores, 384gb memory, 1.5TB disk space

evaluation How robust are our results?

Within sample bias Twitter POS tagger, large differences on different Twitter samples: Twitter data sets train/test Gimpel Ritter Gimpel 90.46 82.29 Ritter 80.52 90.40 Combined 89.19 87.43 (Hovy et al., LREC 2014; Fromheide et al., 2014)

What to do about this? Whenever possible evaluate: across several test data sets A B C on down-stream tasks Estimate significance cut-off Bootstrap-based evaluation

IAA-informed parsing: bootstrap learning curve average over! dev1 10 samples dev2 LAS Norwegian 20% 40% 60% 80% 100% amount training data baseline

IAA-informed parsing: bootstrap learning curve average over! 50 samples system LAS baseline Norwegian 20% 40% 60% 80% 100% amount training data

parallelization of NLP pipeline over 4 languages NO DA CA HR node1 node2 node3 node4

parallelization of NLP pipeline over 4 languages NO DA CA HR node1 node2 node3 node4 node5 node6 node7 node8 with 2 evaluation setups

sharing common data repository for Nordic countries

Summary: HPC for NLP besides parallel data processing and distributed training: models: parallelization over data sets, parameter search, negative results evaluation: significance cut-off, bootstrap samples sharing: common data repository

Thanks!