Sentimental Analysis using Hadoop Phase 2: Week 2



Similar documents
How To Handle Big Data With A Data Scientist

Cloud Scale Distributed Data Storage. Jürmo Mehine

How To Scale Out Of A Nosql Database

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Understanding NoSQL Technologies on Windows Azure

Open Source Technologies on Microsoft Azure

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Architecting Open source solutions on Azure. Nicholas Dritsas Senior Director, Microsoft Singapore

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Understanding NoSQL on Microsoft Azure

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data and Data Science: Behind the Buzz Words

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

So What s the Big Deal?

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

NoSQL Data Base Basics

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Cloudera Certified Developer for Apache Hadoop

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Comparing SQL and NOSQL databases

Can the Elephants Handle the NoSQL Onslaught?

Presenters: Luke Dougherty & Steve Crabb

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

White Paper: Hadoop for Intelligence Analysis

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Challenges for Data Driven Systems

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Open Source for Cloud Infrastructure

Hadoop implementation of MapReduce computational model. Ján Vaňo

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

NoSQL and Hadoop Technologies On Oracle Cloud

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Dominik Wagenknecht Accenture

Oracle Big Data SQL Technical Update

Large scale processing using Hadoop. Ján Vaňo

FINANCIAL SERVICES: FRAUD MANAGEMENT A solution showcase

Map Reduce & Hadoop Recommended Text:

White Paper: What You Need To Know About Hadoop

Hadoop IST 734 SS CHUNG

Open source large scale distributed data management with Google s MapReduce and Bigtable

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY


Hadoop and Map-Reduce. Swati Gore

Big Data and Scripting Systems build on top of Hadoop

Cost-Effective Business Intelligence with Red Hat and Open Source

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Oracle Database 12c Plug In. Switch On. Get SMART.

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Microsoft Azure Data Technologies: An Overview

Moving From Hadoop to Spark

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Workshop on Hadoop with Big Data

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

Enterprise Operational SQL on Hadoop Trafodion Overview

Structured Data Storage

The Inside Scoop on Hadoop

Internals of Hadoop Application Framework and Distributed File System

How Companies are! Using Spark

Hadoop Job Oriented Training Agenda

Big Data With Hadoop

Cloud Big Data Architectures

Qsoft Inc

How To Write A Nosql Database In Spring Data Project

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Apache Flink Next-gen data analysis. Kostas

INTRODUCTION TO CASSANDRA

Testing 3Vs (Volume, Variety and Velocity) of Big Data

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Big Data. Facebook Wall Data using Graph API. Presented by: Prashant Patel Jaykrushna Patel

ITG Software Engineering

Applications for Big Data Analytics

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

MapReduce with Apache Hadoop Analysing Big Data

WINDOWS AZURE DATA MANAGEMENT

GigaSpaces Real-Time Analytics for Big Data

Introduction to Apache Cassandra

Transcription:

Sentimental Analysis using Hadoop Phase 2: Week 2 MARKET / INDUSTRY, FUTURE SCOPE BY ANKUR UPRIT

The key value type basically, uses a hash table in which there exists a unique key and a pointer to a particular item of data. The key can be synthetic or auto-generated while the value can be String, JSON, BLOB (basic large object) etc. This key/value type database allow clients to read and write values using a key as follows: 1. Get(key), returns the value associated with the provided key. 2. Put(key, value), associates the value with the key. 3. Multi-get(key1, key2,.., keyn), returns the list of values associated with the list of keys. 4. Delete(key), removes the entry for the key from the data store.

While Key/value type database seems helpful in some cases, but it has some weaknesses as well. One, is that the model will not provide any kind of traditional database capabilities (such as atomicity of transactions, or consistency when multiple transactions are executed simultaneously). Such capabilities must be provided by the application itself. Secondly, as the volume of data increases, maintaining unique values as keys may become more difficult; addressing this issue requires the introduction of some complexity in generating character strings that will remain unique among an extremely large set of keys. Key India Romania US Value { B-25, Sector-58, Noida, India 201301 { IMPS Moara Business Center, Buftea No. 1, Cluj-Napoca, 400606,City Business Center, Coriolan Brediceanu No. 10, Building B, Timisoara, 300011 } { 3975 Fair Ridge Drive. Suite 200 South, Fairfax, VA 22033 } Reference: http://www.3pillarglobal.com/insights/exploring-the-different-types-of-nosql-databases

MongoDB MongoDB is a document-oriented database that natively supports JSON format. It is extremely easy to use and operate so it is very popular with developers and doesn t require a database administrator (DBA) to bootstrap. Redis Redis is one of the fastest data stores available today. An open source, in-memory and NoSQL database known for its speed and performance, Redis has become popular with developers and has a growing and vibrant community. It features several data types that make implementing various functionalities and flows extremely simple.

Cassandra Created at Facebook, Cassandra has emerged as a useful hybrid of a column-oriented database with a key-value store. Grouping families gives the familiar feeling of tables and provides good replication and consistency for easy linear scaling. Cassandra is most effective when used for managing really big volumes of data (the kind that don t fit in a single server), such as Web/click analytics and measurements from the Internet of Things writing to Cassandara is extremely fast. CouchDB CouchDB gets accessed in JSON format over HTTP. This makes it simple to use for Web applications. Perhaps not surprisingly, CouchDB is suited best for the Web with some interesting applications for offline mobile apps.

HBase Deep down in Hadoop is the powerful database, HBase, that spreads data out among nodes using HDFS. It is, perhaps, most appropriate to use for managing huge tables consisting of billions of rows. Being a part of Hadoop, it allows using map/reduce processing on the data for complicated computational jobs, but also provides real-time data processing capabilities. Both HBase and Cassandra follow the BigTable model. As such, HBase can be scaled linearly simply by adding more nodes to the setup. HBase is best suited for real-time querying of Big Data. Reference: http://www.itbusinessedge.com/slideshows/top-five-nosql-databases-and-when-to-use-them.html

Most of the scripting languages like php, python, perl, ruby bash is good. Any language able to read from stdin, write to sdtout and parse tab and new line characters will work: Hadoop Streaming just pipes the string representations of key value pairs as concatenated with a tab to a program that must be executable on each task tracker node. On most linux distros used to setup hadoop clusters, python, bash, ruby, perl... are already installed but nothing will prevent to roll up your own execution environment for your favorite scripting or compiled programming language. But, the difference between java and scripting language, it is "Heart Beat of child nodes will not be sent to the parent nodes when we are using scripting languages".

Reference: 1. http://stackoverflow.com/questions/8572339/which-language-to-use-for-hadoop-map-reduce-programs-java-orphp?answertab=active#tab-top 2. http://www.slideshare.net/corleycloud/big-data-just-an-introduction-to-hadoop-and-scripting-languages

Clover provides the metrics you need to better balance the effort between writing code that does stuff, and code that tests stuff. Clover runs in your IDE or your continuous integration system, and includes test optimization to make your tests run faster, and fail more quickly. Reference: https://www.atlassian.com/software/clover/overview

JUnit is a simple framework to write repeatable tests. It is an instance of the xunit architecture for unit testing frameworks. Reference: http://blog.cloudera.com/blog/2008/12/testing-hadoop/ How to create a test:https://github.com/junit-team/junit/wiki/getting-started

Hadoop admin who can work independently with excellent communication skill. Good knowledge of Linux (security, configuration, tuning, troubleshooting and monitoring). Able to deploy Hadoop cluster, add and remove nodes, troubleshoot failed jobs, configure and tune the cluster, monitor critical parts of the cluster, etc. Hadoop Developer: A Hadoop Developer has many responsibilities. And the job responsibilities are dependent on your domain/sector, where some of them would be applicable and some might not

Hadoop development and implementation. Loading from disparate data sets. Pre-processing using Hive and Pig. Designing, building, installing, configuring and supporting Hadoop. Translate complex functional and technical requirements into detailed design. Perform analysis of vast data stores and uncover insights. Maintain security and data privacy. Create scalable and high-performance web services for data tracking. High-speed querying. Managing and deploying HBase. Being a part of a POC effort to help build new Hadoop clusters. Test prototypes and oversee handover to operational teams. Propose best practices/standards. http://www.edureka.co/blog/hadoop-developer-job-responsibilities-skills/ https://www.dice.com/jobs/detail/hadoop-solution-architect-%26%2347-information-and-bi-architect-%26%2347-big-data- Architect-%26%2347-Hadoop-Admin-IDC-Technologies-Sunnyvale-CA-94085/10114879/727691

Reference: http://hortonworks.com/training/certification/

You would not compare so does Hive vs Hbase - Commonly happend because of SQL-like layer on Hive - Hbase is a Database but Hive is never a Database. Hive is a MapReduce based Analysis/ Summarisation tool running on Top of Hadoop. Hive depends on Mapreduce( Batch Processing) + HDFS Hbase is a Database (NoSQL) - which is used to store and retrieve data. To Query(Scans) Hbase - mapreduce is not required - So HBase depends only on HDFS - not on Mapreduce So Hbase is Online Processing System http://www.quora.com/hive-vs-hbase-which-one-wins-the-battle-which-is-used-in-which-scenario

Depending on where you work, you may need to simply use whatever standards your company has established. For example, Hive is commonly used at Facebook for analytical purposes. Facebook promotes the Hive language and their employees frequently speak about Hive at Big Data and Hadoop conferences. However, Yahoo! is a big advocate for Pig Latin. Yahoo! has one of the biggest Hadoop clusters in the world. Their data engineers use Pig for data processing on their Hadoop clusters. Alternatively, you may have a choice of Pig or Hive at your organization, especially if no standards have yet been established, or perhaps multiple standards have been set up. However, compared to Hive, Pig needs some mental adjustment for SQL users to learn. Pig Latin has many of the usual data processing concepts that SQL has, such as filtering, selecting, grouping, and ordering, but the syntax is a little different from SQL (particularly the group by and flatten statements!). Pig requires more verbose coding, although it s still a fraction of what straight Java MapReduce programs require. Pig also gives you more control and optimization over the flow of the data than Hive does. If you are a data engineer, then you ll likely feel like you ll have better control over the dataflow (ETL) processes when you use Pig Latin, especially if you come from a procedural language background. If you are a data analyst, however, you will likely find that you can ramp up on Hadoop faster by using Hive, especially if your previous experience was more with SQL than with a procedural programming language. If you really want to become a Hadoop expert, then you should learn both Pig Latin and Hive for the ultimate flexibility.