GPText: Greenplum Parallel Statistical Text Analysis Framework
|
|
|
- Crystal Cain
- 10 years ago
- Views:
Transcription
1 GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li E403 CSE Building Sunny Khatri Greenplum 1900 S Norfolk St #125 San Mateo, CA USA [email protected] Christan Grant E457 CSE Building [email protected] Daisy Zhe Wang E456 CSE Building [email protected] George Chitouras Greenplum 1900 S Norfolk St #125 San Mateo, CA USA [email protected] ABSTRACT Many companies keep large amounts of text data inside of relational databases. Several challenges exist in using state-of-the-art systems to perform analysis on such datasets. First, expensive big data transfer cost must be paid up front to move data between databases and analytics systems. Second, many popular text analytics packages do not scale up to production sized datasets. In this paper, we introduce GPText, Greenplum parallel statistical text analysis framework that addresses the above problems by supporting statistical inference and learning algorithms natively in a massively parallel processing database system. GPText seamlessly integrates the Solr search engine and applies statistical algorithms such as k-means and LDA using MADLib, an open source library for scalable in-database analytics which can be installed on PostgreSQL and Greenplum. In addition, GPText also developed and contributed a linear-chain conditional random field(crf) module to MADLib to enable information extraction tasks such as part-ofspeech tagging and named entity recognition. We show the performance and scalability of the parallel CRF implementation. Finally, we describe an ediscovery application built on the GPText framework. Categories and Subject Descriptors H.2.4 [Database Management]: [Systems, parallel databases]; I.2.7 [Artificial Intelligence]: [Natural Language Processing, text analysis] General Terms Design, Performance, Management Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DanaC 13, June 23, 2013, New York, NY, USA Copyright 2013 ACM /13/6...$ Keywords RDBMS, Massive parallel processing, Text analytics 1. INTRODUCTION Text analytics has gained much attention in the big data research community due to the large amounts of text data generated in organizations such as companies, government and hospitals everyday in the form of s, electronic notes and internal documents. Many companies store this text data in relational databases because they relay on databases for their daily business needs. A good understanding of this unstructured text data is crucial for companies to make business decision, for doctors to assess their patients, and for lawyers to accelerate document review processes. Traditional business intelligence pulls content from databases into other massive data warehouses to analyze the data. The typical data movement process involves moving information from the database for analysis using external tools and storing the final product back into the database. This movement process is time consuming and prohibitive for interactive analytics. Minimizing the movement of data is a huge incentive for businesses and researchers. One way to achieve this is for the datastore to be in the same location as the analytic engine. While Hadoop has become a popular platform to perform largescale data analytics, newer parallel processing relational databases can also leverage more nodes and cores to handle large-scale datasets. Greenplum database, built upon the open source database PostgreSQL, is a parallel database that adopt a shared nothing massive parallel processing (MPP) architecture. Database researchers and vendors are capitalizing on the increase in database cores and nodes and investing in open source data analytics ventures such as the MADLib project [3, 7]. MADLib is an open source library for scalable in-database analytics on Greenplum and PostgreSQL. It provides parallel implementation of many machine learning algorithms. In this paper, we motivate in-database text analytics by showing the GPText, a powerful and scalable text analysis framework developed on Greenplum MPP database. The GPText inherits scalable indexing, keyword search, and faceted search functionalities from an effective integration of the Solr [6] search engine. The GP- Text uses the CRF package, which was contributed to the MADLib open-source library. We show that we can use SQL and user defined aggregates to implement conditional random fields (CRFs) meth-
2 ods for information extraction in parallel. The experiment shows sublinear improvement in runtime for both CRF learning and inference with linear increase in the number of cores. As far as we know, GPText is the first toolkit for statistical text analysis in relational database management systems. Finally, we describe the need and requirement for ediscovery applications and show that GPText is an effective platform to develop such sophisticated text analysis applications. 2. RELATED WORK Researchers have created systems for large scale text analytics including GATE, PurpleSox and SystemT [4, 2, 8]. While both GATE and SystemT uses rule-based approach for information extraction (IE), PurpleSox uses statistical IE models. However, none of the above system are native built for a MPP framework. The parallel CRF in-database implementation follows the MAD methodology [7]. In a similar vein, researchers show that most of the machine learning algorithms can be expressed through a unified RDBMS architectures [5]. There are several implementations conditional random fields and but only a few large scale implementations for NLP tasks. One example is the PCRFs [11] that are implemented over massively parallel processing systems supporting Message Passing Interface (MPI) such as Cray XT3, SGI Altix, and IBM SP. However, this is not implemented over RDBMs. 3. GREENPLUM TEXT ANALYTICS GPText runs on Greenplum database(gp), which is a shared nothing massive parallel processing(mpp) database. The Greenplum database adopts the most widely used master-slave parallel computing model where one master node orchestrates multiple slaves to work until the task is completed and there is no direct communication between slaves. As shown in Figure 1, it is a collection of PostgreSQL instances including one master instance and multiple slave instances(segments). The master node accepts SQL queries from clients, then divide the workloads and send sub-tasks to the segments. Besides harnessing the power of a collection of computer nodes, each node can also be configured with multiple segments to fully utilize the muticore processor. To provide high availability, GP provides the options to deploy redundant standby master and mirroring segments in case of master segments and primary segments failure. The embarrassing processing capability powered by the Greenplum MPP framework lays the cornerstone to enable GPText to process the production sized text data. Besides the underling MPP framework, GPText also inherited the features that a traditional database system brings to GPText for example, online expansion, data recovery and performance monitoring to name a few. On top of the underling MPP framework, there are two building blocks, MADLib and Solr as illustrated in Figure 1 which distinguish GP- Text from many of the existing text analysis tools. MADLib makes GPText capable of doing sophisticated text data analysis tasks, such as part-of-speech tagging, named entity recognition, document classification and topic modeling with a vast amount of parallelism. Solr is reliable and scalable text search platform from Apache Lucene project and it has been widely deployed in web servers. The major features includes powerful full-text search, faceted search, near real time indexing. As shown in Figure 1, GPText uses Solr to create distributed indexing. Each primary segment is associated with one and only Solr instance where the index of the data in the primary segment is stored for the purpose of load balancing. GPText has all the features that Solr has since Solr is integrated into GPText Figure 1: The GPText architecture over Greenplum database. seamlessly. The authors submitted the CRF statistical package for text analysis to MADLib (more details on section 4). 3.1 In-database document representation In GPText, a document can represented as a vector of counts against a token dictionary, which is a vector of unique tokens in the dataset. For efficient storage and memory, GPText uses a sparse vector representation for each document instead of a naive vector representation. The following are an example with two different vector representations of a document. The dictionary contains all the unique terms (i.e., 1-grams) exist in the corpus. Dictionary: {am, before, being, bothered, corpus, document, in, is, me, never, now, one, really, second, the, third, this, until}. Document: {i, am, second, document, in, the, corpus}. Naive vector representation: {1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,1}. Sparse vector representation: {(1,3,1,1,1,1,6,1,1):(1,0,1,1,1,1,0,1,1)}. GPText adopts the run-length encoding to compress the naive vector representation using pairs of count-value vectors. The first array in the vector representation contains count values for the corresponding values in the second array. Although not apparent in this example, the advantage of the sparse vector representation is dramatic in real world documents, where most of the elements in the vector are ML-based Advanced Text Analysis GPText relies on multiple machine learning modules in MADLib to perform statistical text analysis. Three of the commonly used modules are k-means for document clustering, multinomial naive bayes (multinomialnb) for document classification, and latent dirichlet allocation(lda) for topic modeling for dimensionality reduc-
3 SELECT doc2.pos, doc2.doc_id, E., ARRAY[doc1.label, doc2.label] FROM segmenttbl doc1, segmenttbl doc2 WHERE doc1.doc_id = doc2.doc_id AND doc1.pos + 1 = doc2.pos Figure 3: Query for extracting edge features SELECT start_pos, doc_id, R_ r.name, ARRAY[-1, label] FROM regextbl r, segmenttbl s WHERE s.seg_text r.pattern Figure 4: Query for extracting regex features Figure 2: The MADLib CRF overall system architecture. tion and feature extraction. Performing k-means, multinomialnb, LDA in GPText follows the same pipeline as follows: 1. Create Solr index for documents. 2. Configure and populate Solr index. 3. Create terms table for each document. 4. Create dictionary of the unique terms across documents. 5. Construct term vector using the tf-idf for each document. 6. Run MADLib k-means/multinomianb/lda algorithm. The following section details the implementation of the CRF learning and inference modules that we developed for GPText applications as part of MADLib. 4. CRF FOR IE OVER MPP DATABASES Conditional random field(crf) is a type of discriminative undirected probabilistic graphical model. Linear-chain CRFs are special CRFs, which assume that the next state depends only on the current state. Linear-chain CRFs achieve start-of-the-art accuracy in many real world natural language processing (NLP) tasks such as part of speech tagging(pos) and named entity recognition(ner). 4.1 Implementation Overview In Figure 2 illustrate the detailed implementation of the CRF module developed for IE tasks in GPText. The top box shows the pipeline of the training phase. The bottom box shows the pipeline for the inference phase. We use declarative SQL statements to extract all features from text. Any features in the state of art packages can be extracted using one single SQL clause. All of the common features described in literature can be extracted with one SQL statement. The extracted features are stored in a relation for either single or two-state features. For inference, we use user-defined functions to calculate the maximum-a-priori (MAP) configuration and probability for each document. Inference over all documents can be naively parallel since the inference over one document is independent with other documents. For learning, the computation of gradient and log-likelihood over all training documents dominates the overall training time. Fortunately, it can run parallel since the overall gradient and log-likelihood is simply the summation of the gradient and log-likelihood for each document. This parallelization can be expressed with the parallel programming paradigm, userdefined aggregates which is supported in parallel databases. 4.2 Feature Extraction Using SQLs Text feature extraction is a step in most statistical text analysis methods. We are able to implement all of the seven types of features used in POS and NER using exactly seven SQL statements. These features include: Dictionary: does this token exist in a dictionary? Regex: does this token match a regular expression? Edge: is the label of a token correlated with the label of a previous token? Word: does this token appear in the training data? Unknown: does this token appeared in the training data below certain threshold? Start/End: is this token first/last in the token sequence? There are many advantages for extracting features using SQLs. The SQL statements hide a lot of the complexity present in the actual operation. It turns out that each type of feature can be extracted out using exactly one SQL statement, making the feature extraction code extremely succinct. Secondly, SQLs statements are naively parallel due to the set semantics supported by relational DBMS s. For example, we compute features for each distinct token and avoid re-computing the features for repeated tokens. In Figure 3 and Figure 4 we show how to extract edge and regex features, respectively. Figure 3 extracts adjacent labels from sentences and stores them in an array. Figure 4 shows a query that selects all the sentences that satisfies any of the regular expressions present in the table regextable. 4.3 Parallel Linear-chain CRF Training Algorithm 1 CRF training(z 1:M ) Input: z 1:M, Document set Convergence(), Initialization(), Transition(), Finalization() Output: Coefficients w R N Initialization/Precondition: iteration = 0 1: Initialization(state) 2: repeat 3: state LoadState() 4: for all m 1..M do 5: state Transition(state, z m) 6: state F inalization(state) 7: WriteState(state) 8: until Convergence(state, iteration) 9: return state.w
4 Programming Model. In Algorithm 1 we show the parallel CRF training strategy. The algorithm is expressed as an user-defined aggregate. User-defined aggregates are composed of three parts: a transition function (Algorithms 2), a merge function 1, and a finalization function (Algorithm 3). Following we describe these functions. In line 1 of Algorithm 1 the Initialization function creates a state object in the database. This object contains a coefficient (w), gradient ( ) and log-likelihood (L) variables. This state is loaded (line 3) and saved (line 7) between iterations. We compute the gradient and log-likelihood of each segment in parallel (line 4) much like a Map function. Then line 6 computes the new coefficients much like a reduce function. Figure 5: Linear-chain CRF training scalability Transition strategies. Algorithm 2 contains the logic of computing the gradient and log-likelihood for each tuple using the forward-backward algorithm. The overall gradient and log-likelihood is simply the summation of gradient and log-likelihood of each tuple. This algorithm is invoked in parallel over many segments and the result of these functions are combined using the merge function. Algorithm 2 transition-lbfgs(state, z m) Input: state, Transition state z m, A Document Gradient() Output: state 1: {state., state.l} Gradient(state, z m) 2: state.num_rows state.num_rows + 1 3: return state Finalization strategy. The finalization function invokes the L-BFGS convex solver to get a new coefficient vector. To avoid overfitting, we choose to Algorithm 3 finalization-lbfgs(state) Input: state, LBFGS() Convex optimization solver Output: state 1: {state., state.l} penalty(state., state.l) 2: instance LBFGS.init(state) 3: instance.lbfgs() invoke the L-BFGS solver 4: return instance.state penalize the log-likelihood with a spherical Gaussian weight prior. Limited-memory BFGS (L-BFGS), a variation of the Broyden- Fletcher-Goldfarb-Shanno (BFGS) algorithm is a leading method for large scale non-constraint convex optimization method. We translate an in-memory Java implementation [10] to a C++ in-database implementation. Before each iteration of L-BFGS optimization, we need to initialize the L-BFGS with the current state object. At the end of each iteration, we need to dump the updated variables to the database state for next iteration. 4.4 Parallel Linear-chain CRF Inference The Viterbi algorithm is applied to find the top-k most likely labellings of a document for CRF models. The CRF inference is 1 The merge function takes two states and sums their gradient and log-likelihood. Figure 6: Linear-chain CRF inference scalability naively parallel since the inference over one document is independent of other documents. We chose to implement a SQL clause to drive the Viterbi inference over all documents and each function call will finish labeling of one document. In parallel databases, e.g., Greenplum, Viterbi can be run in parallel over different subsets of the documents on a single-node or muti-node cluster and each node can have multiple cores. 5. EXPERIMENTS AND RESULTS In order to evaluate the performance and scalability of the linearchain CRF learning and inference on Greenplum, we conduct experiments on various data sizes over on a 32-core machine with 2T hard drive and 64GB memory. We use CoNLL2000 dataset containing 8936 tagged sentences for learning. This dataset is labeled with 45 POS tags. To evaluate the inference performance, we extracted 1.2 million sentences from the New York Times dataset. In Figure 5 and Figure 6 we show the algorithm is sublinear and improves with an increase in the number of segments. Our POS implementation achieves.9715 accuracy, this is consistent with the state of the art [9]. 6. GPTEXT APPLICATION With the support of Greenplum MPP data processing framework, efficient Solr indexing/search engine, and parallelized statistical text analysis modules, GPText positions itself to be a great platform for many applications that need to apply scalable text analytics methods with varying sophistications over unstructured data in databases. One of such application is ediscovery. ediscovery has become increasingly important in legal processes. Large enterprises keep terabytes of text data such as s and internal documents in their databases. Traditional civil litigation often involves reviews of large amounts of documents both in the plaintiffs side and defendants side. ediscovery provides tools to
5 Figure 7: GPText application pre-filter documents for review to provide a more speedy, inexpensive and accurate solution. Traditionally, simple keyword search are used for such document retrieval tasks in ediscovery. Recently, predictive coding based on more sophisticated statistical text analysis methods are gaining more and more attention in civil litigation since it provides higher precision and recall in retrieval. In 2012, judges in several cases [1] approves the use of the predictive coding based on the indication of Rules of the Federal Rules of Civil Procedure. GPText developed a prototype of ediscovery tool using the Enron dataset. Figure 7 is a snapshot of the ediscovery application. In the top pane, keyword illegal is specified and Solr is used to retrieve all relevant s that contains the keyword displayed in the bottom pane. As shown on the middle pane on the right, it also supports topic discovery in the corpus using LDA and classification using k-means. K-means use LDA and CRF to reduce the dimensionalities of features to speed up the k-means classification. CRF is also used to extracted named entities. As you can see the topics are displayed in the Results panel. The tool also provides visualization of aggregate information and faceted search over the corpus. 7. CONCLUSION We introduce GPText, a parallel statistical text analysis framework over MPP database. With the seamless integration with Solr and MADLib, GPText is a framework with powerful search engine and advanced statistical text analysis capabilities. We implemented a parallel CRF inference and training module for IE. The functionalities and scalability provided by GPText positions itself to be a great tools for sophisticated text analytics applications such as ediscovery. As future work, we plan to support learning to rank and active learning algorithms in GPText. Acknowledgments Christan Grant is funded by NSF under Grant No. DGE This work was supported by a gift from EMC Greenplum. 8. REFERENCES [1] L. Aguiar and J. A. Friedman. Predictive coding: What it is and what you need to know about it. [2] P. Bohannon, S. Merugu, C. Yu, V. Agarwal, P. DeRose, A. Iyer, A. Jain, V. Kakade, M. Muralidharan, R. Ramakrishnan, and W. Shen. Purple sox extraction management system. SIGMOD Rec., 37(4):21 27, Mar [3] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. Mad skills: new analysis practices for big data. Proc. VLDB Endow., 2(2): , Aug [4] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts, D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion, J. Petrak, Y. Li, and W. Peters. Text Processing with GATE (Version 6) [5] X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-rdbms analytics. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD 12, pages , New York, NY, USA, ACM. [6] T. A. S. Foundation. Apache solr. [7] J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The madlib analytics library: or mad skills, the sql. Proc. VLDB Endow., 5(12): , Aug [8] Y. Li, F. R. Reiss, and L. Chiticariu. Systemt: a declarative information extraction system. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations, HLT 11, pages , Stroudsburg, PA, USA, Association for Computational Linguistics. [9] C. D. Manning. Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I, CICLing 11, pages , Berlin, Heidelberg, Springer-Verlag. [10] J. Morales and J. Nocedal. Automatic preconditioning by limited memory quasi-newton updating. SIAM Journal on Optimization, 10(4): , [11] H. Phan and M. Le Nguyen. Flexcrfs: Flexible conditional random fields, 2004.
MADlib. An open source library for in-database analytics. Hitoshi Harada PGCon 2012, May 17th
MADlib An open source library for in-database analytics Hitoshi Harada PGCon 2012, May 17th 1 Myself Window functions in 8.4 and 9.0 Help wcte work in 9.1 PL/v8 Other modules like twitter_fdw, tinyint
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
Parallel Time Series Modeling - A Case Study of In-Database Big Data Analytics
Parallel Time Series Modeling - A Case Study of In-Database Big Data Analytics Hai Qian 1, Shengwen Yang 1, Rahul Iyer 1, Xixuan Feng 1, Mark Wellons 2, and Caleb Welton 1 1 Predictive Analytics Team,
EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data
EMC Greenplum Driving the Future of Data Warehousing and Analytics Tools and Technologies for Big Data Steven Hillion V.P. Analytics EMC Data Computing Division 1 Big Data Size: The Volume Of Data Continues
In-Database Analytics
Embedding Analytics in Decision Management Systems In-database analytics offer a powerful tool for embedding advanced analytics in a critical component of IT infrastructure. James Taylor CEO CONTENTS Introducing
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.
Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE
Greenplum Database. Getting Started with Big Data Analytics. Ofir Manor Pre Sales Technical Architect, EMC Greenplum
Greenplum Database Getting Started with Big Data Analytics Ofir Manor Pre Sales Technical Architect, EMC Greenplum 1 Agenda Introduction to Greenplum Greenplum Database Architecture Flexible Database Configuration
International Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
I/O Considerations in Big Data Analytics
Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics
Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics Please note the following IBM s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice
SQL Server 2005 Features Comparison
Page 1 of 10 Quick Links Home Worldwide Search Microsoft.com for: Go : Home Product Information How to Buy Editions Learning Downloads Support Partners Technologies Solutions Community Previous Versions
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
Integrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata
Up Your R Game James Taylor, Decision Management Solutions Bill Franks, Teradata Today s Speakers James Taylor Bill Franks CEO Chief Analytics Officer Decision Management Solutions Teradata 7/28/14 3 Polling
Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.
Collaborative Big Data Analytics 1 Big Data Is Less About Size, And More About Freedom TechCrunch!!!!!!!!! Total data: bigger than big data 451 Group Findings: Big Data Is More Extreme Than Volume Gartner!!!!!!!!!!!!!!!
In-Memory Analytics for Big Data
In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...
Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis
Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Yue Dai, Ernest Arendarenko, Tuomo Kakkonen, Ding Liao School of Computing University of Eastern Finland {yvedai,
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum
Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All
Automatic Knowledge Base Construction Systems. Dr. Daisy Zhe Wang CISE Department University of Florida September 3th 2014
Automatic Knowledge Base Construction Systems Dr. Daisy Zhe Wang CISE Department University of Florida September 3th 2014 1 Text Contains Knowledge 2 Text Contains Automatically Extractable Knowledge 3
Information Architecture
The Bloor Group Actian and The Big Data Information Architecture WHITE PAPER The Actian Big Data Information Architecture Actian and The Big Data Information Architecture Originally founded in 2005 to
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel
A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Big Data and the Data Lake. February 2015
Big Data and the Data Lake February 2015 My Vision: Our Mission Data Intelligence is a broad term that describes the real, meaningful insights that can be extracted from your data truths that you can act
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Bayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.
Mike Maxey Senior Director Product Marketing Greenplum A Division of EMC 1 Greenplum Becomes the Foundation of EMC s Big Data Analytics (July 2010) E M C A C Q U I R E S G R E E N P L U M For three years,
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Full-text Search in Intermediate Data Storage of FCART
Full-text Search in Intermediate Data Storage of FCART Alexey Neznanov, Andrey Parinov National Research University Higher School of Economics, 20 Myasnitskaya Ulitsa, Moscow, 101000, Russia [email protected],
http://www.paper.edu.cn
5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission
Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis
Webinar will begin shortly Hadoop s Advantages for Machine Learning and Predictive Analytics Presented by Hortonworks & Zementis September 10, 2014 Copyright 2014 Zementis, Inc. All rights reserved. 2
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data
INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are
The basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
INTEROPERABILITY OF SAP BUSINESS OBJECTS 4.0 WITH GREENPLUM DATABASE - AN INTEGRATION GUIDE FOR WINDOWS USERS (64 BIT)
White Paper INTEROPERABILITY OF SAP BUSINESS OBJECTS 4.0 WITH - AN INTEGRATION GUIDE FOR WINDOWS USERS (64 BIT) Abstract This paper presents interoperability of SAP Business Objects 4.0 with Greenplum.
Natural Language Processing in the EHR Lifecycle
Insight Driven Health Natural Language Processing in the EHR Lifecycle Cecil O. Lynch, MD, MS [email protected] Health & Public Service Outline Medical Data Landscape Value Proposition of NLP
Clustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
ANALYTICS IN BIG DATA ERA
ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY, DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA MAURIZIO SALUSTI SAS Copyr i g ht 2012, SAS Ins titut
Universal PMML Plug-in for EMC Greenplum Database
Universal PMML Plug-in for EMC Greenplum Database Delivering Massively Parallel Predictions Zementis, Inc. [email protected] USA: 6125 Cornerstone Court East, Suite #250, San Diego, CA 92121 T +1(619)
What s next for the Berkeley Data Analytics Stack?
What s next for the Berkeley Data Analytics Stack? Michael Franklin June 30th 2014 Spark Summit San Francisco UC BERKELEY AMPLab: Collaborative Big Data Research 60+ Students, Postdocs, Faculty and Staff
Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014
Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview
Client Overview. Engagement Situation. Key Requirements
Client Overview Our client is one of the leading providers of business intelligence systems for customers especially in BFSI space that needs intensive data analysis of huge amounts of data for their decision
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
See the Big Picture. Make Better Decisions. The Armanta Technology Advantage. Technology Whitepaper
See the Big Picture. Make Better Decisions. The Armanta Technology Advantage Technology Whitepaper The Armanta Technology Advantage Executive Overview Enterprises have accumulated vast volumes of structured
IT services for analyses of various data samples
IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
Arun Kumar. 4354 Computer Sciences 1210 W Dayton St Madison, WI-53706
Arun Kumar 4354 Computer Sciences 1210 W Dayton St Madison, WI-53706 Email: [email protected] Phone: (+1) 614-602-9734 Web: http://pages.cs.wisc.edu/~arun/ EDUCATION University of Wisconsin-Madison Ph.D.
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics
HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics ESSENTIALS EMC ISILON Use the industry's first and only scale-out NAS solution with native Hadoop
Big Data Technologies Compared June 2014
Big Data Technologies Compared June 2014 Agenda What is Big Data Big Data Technology Comparison Summary Other Big Data Technologies Questions 2 What is Big Data by Example The SKA Telescope is a new development
Cray: Enabling Real-Time Discovery in Big Data
Cray: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
OBIEE 11g Analytics Using EMC Greenplum Database
White Paper OBIEE 11g Analytics Using EMC Greenplum Database - An Integration guide for OBIEE 11g Windows Users Abstract This white paper explains how OBIEE Analytics Business Intelligence Tool can be
Research Statement Immanuel Trummer www.itrummer.org
Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses
KEYWORD SEARCH IN RELATIONAL DATABASES
KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Big Data Can Drive the Business and IT to Evolve and Adapt
Big Data Can Drive the Business and IT to Evolve and Adapt Ralph Kimball Associates 2013 Ralph Kimball Brussels 2013 Big Data Itself is Being Monetized Executives see the short path from data insights
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Radoop: Analyzing Big Data with RapidMiner and Hadoop
Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Mr. Apichon Witayangkurn [email protected] Department of Civil Engineering The University of Tokyo
Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn [email protected] Department of Civil Engineering The University of Tokyo Contents 1 Introduction 2 What & Why Sensor Network
OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP
OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP 1 KALYANKUMAR B WADDAR, 2 K SRINIVASA 1 P G Student, S.I.T Tumkur, 2 Assistant Professor S.I.T Tumkur Abstract- Product Review System
Dell* In-Memory Appliance for Cloudera* Enterprise
Built with Intel Dell* In-Memory Appliance for Cloudera* Enterprise Find out what faster big data analytics can do for your business The need for speed in all things related to big data is an enormous
SAP Solution Brief SAP HANA. Transform Your Future with Better Business Insight Using Predictive Analytics
SAP Brief SAP HANA Objectives Transform Your Future with Better Business Insight Using Predictive Analytics Dealing with the new reality Dealing with the new reality Organizations like yours can identify
High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances
High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances Highlights IBM Netezza and SAS together provide appliances and analytic software solutions that help organizations improve
Performance and Scalability Overview
Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics platform. PENTAHO PERFORMANCE ENGINEERING
SPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Foundations of Business Intelligence: Databases and Information Management
Foundations of Business Intelligence: Databases and Information Management Wienand Omta Fabiano Dalpiaz 1 drs. ing. Wienand Omta Learning Objectives Describe how the problems of managing data resources
Search and Real-Time Analytics on Big Data
Search and Real-Time Analytics on Big Data Sewook Wee, Ryan Tabora, Jason Rutherglen Accenture & Think Big Analytics Strata New York October, 2012 Big Data: data becomes your core asset. It realizes its
Harnessing the power of advanced analytics with IBM Netezza
IBM Software Information Management White Paper Harnessing the power of advanced analytics with IBM Netezza How an appliance approach simplifies the use of advanced analytics Harnessing the power of advanced
IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!
The Bloor Group IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS VENDOR PROFILE The IBM Big Data Landscape IBM can legitimately claim to have been involved in Big Data and to have a much broader
Machine Learning with MATLAB David Willingham Application Engineer
Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the
Performance and Scalability Overview
Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics Platform. Contents Pentaho Scalability and
CitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
Chapter 6. Foundations of Business Intelligence: Databases and Information Management
Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
