Large-Scale Test Mining



Similar documents
BIG DATA What it is and how to use?

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Open source Google-style large scale data analysis with Hadoop

Big Data and Apache Hadoop s MapReduce

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop IST 734 SS CHUNG

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

How To Scale Out Of A Nosql Database

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Cost-Effective Business Intelligence with Red Hat and Open Source

MapReduce and Hadoop Distributed File System V I J A Y R A O

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Data processing goes big

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Architectures for Big Data Analytics A database perspective

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Fast Data in the Era of Big Data: Twitter s Real-

Open source large scale distributed data management with Google s MapReduce and Bigtable

The big data revolution

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Oracle Big Data SQL Technical Update

MapReduce and Hadoop Distributed File System

MapReduce with Apache Hadoop Analysing Big Data

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Internals of Hadoop Application Framework and Distributed File System

A very short Intro to Hadoop

Hadoop: Embracing future hardware

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Big Data on Microsoft Platform

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

A Brief Outline on Bigdata Hadoop

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

White Paper: What You Need To Know About Hadoop

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

How To Handle Big Data With A Data Scientist

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hypertable Architecture Overview

Similarity Search in a Very Large Scale Using Hadoop and HBase

Apache Hadoop: Past, Present, and Future

Application Development. A Paradigm Shift

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Hadoop and Map-Reduce. Swati Gore

Cloudera Certified Developer for Apache Hadoop

Big Data and Market Surveillance. April 28, 2014

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Big Data Explained. An introduction to Big Data Science.

Accelerating and Simplifying Apache

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop Ecosystem B Y R A H I M A.

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Implement Hadoop jobs to extract business value from large and varied data sets

Case Study : 3 different hadoop cluster deployments

Transforming the Telecoms Business using Big Data and Analytics

Big Data and Scripting Systems build on top of Hadoop

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

BIG DATA TRENDS AND TECHNOLOGIES

Big Data Analytics OverOnline Transactional Data Set

Hadoop Architecture. Part 1

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Hadoop and Map-reduce computing

Data Warehouse design

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Manifest for Big Data Pig, Hive & Jaql

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Big Data and Data Science: Behind the Buzz Words

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

The Internet of Things and Big Data: Intro

CS54100: Database Systems

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Hadoop. Sunday, November 25, 12

Integrating Big Data into the Computing Curricula

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Integrating VoltDB with Hadoop

Transcription:

Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I

Aim Identify topic and language/script/coding of real-world informal text at highest speed possible Informal text Blogs, posts, tweets Don t necessarily follow conventional rules of spelling & grammar Transliterated language usually ad hoc In typical web documents far more bytes of HTML & JavaScript than content making everything look like English. Can be parsed out but timeconsuming. Does not look like newswire (written by journalists, rich in named entities, summarized in first paragraph) High speed Want to process documents as quickly as possible Trillions of web pages Gigabytes per second speed desired 2

What is Text? 3 NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I

6 Fundamental Questions in Mining Text Detection 1. Does a document contain text in any language? (Or is it audio, video,?) 2. If so, does the text have a topic? (41% of tweets are pointless babble.) Clustering 3. Which documents are in the same (but not necessarily known) language? 4. Which documents are on the same (but not necessarily known) topic? Identification 5. What is the language? (Is it a language the system has been trained to recognize?) 6. What is the topic? (Is it a topic the system has been trained to recognize?) 4

Specific Goals Identify specific language(s) of documents and identify documents on specific topics Accuracy requirements False negatives OK - users don t know what has been missed Precision needs to be high enough so we don t annoy users with lots of false alarms Extremely low false positive rate for topic id (<< 1 ppm) Speed requirement Fast hardware or software; simple algorithms Ideally use same algorithm for both language and topic id Language-neutral Work on all languages, including Asian languages that do not parse words with spaces 5 NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I

Text Analysis Algorithms Many algorithms have been used for topic id Bayesian Markov Markov Orthogonal Sparse Bi-word Hyperspace Correlative Entropy (Optimal Compressor) (longest string match) Minimum Description Length Term Frequency*Inverse Document Frequency Morphological Centroid-based Logistic Regression (similar to SVM and single-layer NN) Most can be expressed using additive weights of detected tokens In the domain of interest (low false alarms on informal text) logistic regression worked best and is computationally efficient 6 NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I

Additive Weight Algorithms Training Define & select tokens (e.g., words, words with spaces before & after, phrases, N-grams) Assign weights (for LR weights range from roughly -1 to +1) Testing Detect tokens Add weight to summer (S) For each document, convert weight sum to likelihood score & compare to threshold P(on-topic or on-language) = 1/(1+e -S ) 7 NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I

Token Detection In hardware Load tokens and ternary bit masks into CAM (Content Addressable Memory) Stream data through CAM to automatically identify token(s) In software Aho-Corasick algorithm creates a large state machine e.g., 50K tokens with 130K states Terminal states indicate detection of a token For each byte of data Next_state = TableLookup[Previous_state][New_byte] If Terminal_state[Next_state] == TRUE then Retrieve weight and add to summer Execution time is relatively independent of number of tokens or average length of tokens 8 NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I

4-Token State Machine sp t h e sp the n sp then y sp they a n d sp and transition to orange, green or pink state if space, t or a else transition to blue state 9 NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I

Solving Text Analysis Problems using Hadoop Hadoop is a framework for writing and running applications that process vast amounts of data in parallel on large clusters (up to thousands of nodes) of commodity hardware in a reliable, faulttolerant, manner. Hadoop is free/open-source software that emulates Google s proprietary MapReduce The master node takes the input, chops it up into smaller sub-problems, and distributes those to slave nodes where the Map tasks ingest and transform the input The Reduce task(s) then aggregate or summarize the Map output to deliver the final output. 10

Hadoop Software Hadoop will run on anything from a laptop to a vast cluster of computers Program Stacks Windows/Cygwin/Hadoop Windows/VM/Linux/Hadoop Linux/Hadoop Linux/VM/Linux/Hadoop Software packages work with Hadoop to provide: scalable distributed file systems and data warehouses (HBase, CloudBase) data summarization, ad hoc querying, scripting (Pig, Hive) massive matrix math, graph computation, machine learning, social network analysis (Hama, Mahout, X-Rime, Pegasus) 11

Hadoop Data Flow with 1 Reduce Slave 1 Map Combine Reduce Input File 1 Slave 2 Map Combine 12 Input File 2

Hadoop/MapReduce Hadoop automatically distributes blocks of data to slave nodes and then lines of text to Maps Hadoop automatically sorts and groups outputs of 1 node s Maps by key Hadoop automatically sorts and groups outputs of all Combines by key One output file from each Reduce Main/Run code defines interfaces & loads globals into distrib. cache Your Map alue pairs code transforms one line of text & outputs KVPs Your Combine code transforms sorted & grouped output of all Maps for one node & outputs KVPs Your Reduce code summarizes sorted and grouped output of the Combines & outputs KVPs 13 Your Map: Configure code defines Map globals Your Combine: Configure code defines Combine globals Your Reduce: Configure code defines Reduce globals KVP = Key Value Pair

Original Hardware-Based Text Analyzer High speed solution with expensive special hardware HW limitations (# & length of tokens, wildcarding in ternary CAM) Few people with FPGA/VHDL skills 14

New Improved Hadoop Text Analyzer High speed software solution on generic hardware Enabled use of a very sophisticated detection algorithm (Aho- Corasick Trie as in information retrieval ); unlimited token length; speed relatively independent of number of tokens Many people with Java skills 15

Cluster Configuration Master & Slave 1 Slave 2 Slave 3 Master finds slaves using IPs in Host Table Each server may host more than one slave Each slave may run many Maps Cluster may have 1 or many Reduces Slave N 16 Our servers each have 2 quad-core Nehalem Xeons, 24-48GB RAM, 4 1TB drives Per rack: 328 cores, 1TB RAM, 164TB drives, 30KW, 1700 pounds

Language Identification Results Performance varies No standard data set or procedure for testing language identification Worked very well overall except on documents with just a few words Mutually intelligible languages such as Dutch/Afrikaans, Indonesian/Malay, or Norwegian/Swedish harder to distinguish than dissimilar languages Relatively few tokens (most common words in informal language) used for each language (7 for Hindi, 55 for English, 95 for Spanish) it is possible to construct difficult documents Could not distinguish random words from real language Language is defined as words and grammar 17 NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I

Topic Identification Results Based on incremental content of newsgroup posts (quoted prior posts and metadata such as newsgroup, thread, author removed.) 18 NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I

The Bottom Line Text analysis readily performed on a cluster of commodity computers using Hadoop Comparison between software and hardware solutions In hardware achieved 2.5 Gb/s real-time throughput (but most such links operate at a fraction of their capacity) In software achieved 1.3 Gb/s off-line throughput on a cluster of 64 ancient servers (on new but unoptimized servers 1.6 Gb/s) In a properly configured Hadoop cluster performance scales linearly Elapsed time = 15-20 seconds + C/(number of servers) 19

Alan.Ratner@ngc.com 20 NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I