On the Analysis and Visualisation of Anonymised Call Detail Records
|
|
|
- Junior Heath
- 10 years ago
- Views:
Transcription
1 On the Analysis and Visualisation of Anonymised Call Detail Records Varun J, Sunil H.S, Vasisht R SJB Research Foundation Uttarahalli Road, Kengeri, Bangalore, India (varun.iyengari,hs.sunilrao,vasisht.raghavendra)@gmail.com Srinidhi Saragur, Abhijit Lele SJB Research Foundation Uttarahalli Road, Kengeri, Bangalore, India [email protected], [email protected] Abstract Global mobile traffic is estimated to grow at a compounded annual rate of 40%. In order to keep telecom operators profitable, network planning and optimisation, and personalized traffic plans are critical to success. In order to do this, telecom operators need to analyse the data generated from their telecom network and derive information and knowledge that will assist them in network planning and developing personalised traffic plans. With this as the background, in this paper we analyse the Call Detail Records (CDR) provided by Orange Telecom as a part of the Data for Development (D4D) challenge. The data analysis of the CDR records is carried out using Hadoop and Hive framework. This paper focuses on analysing the data from telecom network optimisation and socio-economic perspectives. We utilise a visualisation tool to represent the derived information in a human understandable format. Based on visual inspection of the derived knowledge from the CDRs, we provide our recommendations for network optimisation. I. INTRODUCTION Telecommunications operating companies generate a large number of data records from switching systems. Data items are produced for every telephone call through a telephone network. Call Detail Record (CDR) is the fingerprint of how many seconds and at what time a customer is using a telephone and the associated infrastructure in terms of Base Stations used to process the call. Therefore, CDR displays a map of telephone customers behavior. The knowledge of customer behaviour obtained by analyzing CDRs has multiple applications. In [1] the authors deduce social attributes from calling behaviour. Mobile customer clustering based on CDR records from the perspective of marketing campaigns is discussed in [2]. In [3], the authors propose a method to derive transportation patterns based on CDR records. Considerable research is being carried out not only on socially relevant data, but also patterns that can be used to enhance the overall service parameters of telecom operations. While methods that analyse CDR records and derive knowledge from it are important, it is equally important to visualise data and the knowledge derived from it in a human understandable format. This gives a quick reference point to the stakeholders in a manner that they can relate to. In this paper we share our experiences in visualization of the knowledge derived from CDR records and to some extent provide our observations on and interpretation of the derived knowledge. The rest of the paper is organized as follows. Section II describes the CDR data set. The framework of Hadoop [4] and Hive [5] along with mapreduce [6] are discussed in Section III. Our approach to analysing the data sets to derive information from them and then visualising the data sets is discussed in Section IV. We briefly attempt to provide analysis of the knowledge derived from these data sets in Section V. We finally conclude with future work in Section VI. II. CDR DATA In this paper we use the Orange Telecom [7] Data for Development D4D data set. D4D is an open data challenge, encouraging research teams around the world to use four datasets of anonymous call patterns of Orange Telecom s Ivory Coast subsidiary, to help address society development questions in novel ways. The data sets are based on anonymized Call Detail Records extracted from Oranges customer base, covering the months of December 2011 to April Four datasets are provided by Orange in Tabulation Separated Values (TSV) plain text format. 1) Antenna-to-Antenna: Antennas are uniquely identified by an antenna identifier (A id ) and a geographic location. The data set aggregate hour by hour the duration of calls between any pair of antennas. The data set is represented by the following tuple (date hour TIMESTAMP, originating antenna INTEGER, terminating antenna INTEGER, number voice calls INTEGER, duration voice calls INTEGER). 2) Individual Trajectories High Spatial Resolution Data: This dataset contains the trajectories of randomly sampled individuals for the entire observation period but with reduced spatial resolution. The data set is represented by the following tuple (antenna identifier INTEGER, longitude FLOAT, latitude FLOAT), where antenna identifier represents the antenna anonymized by an identifier and (user identifier INTEGER, connection datetime TIMESTAMP, antenna identifer INTEGER).
2 3) Individual Trajectories Long Term Data: In this data set, the trajectories of randomly selected individuals are provided for the entire observation period but with reduced spatial resolution. The data set is represented by the following tuple (subpref identifier INTEGER, longitude FLOAT, latitude FLOAT), where subpref identifier represents the subprefecture anonymized by an identifier and (user identifer INTEGER, connection datetime TIMESTAMP, subpref identifier INTEGER). 4) Communication Sub Graphs: The dataset contains the communication sub graphs for 5000 randomly selected individuals identified by user identifier. The data set is represented by the following tuple (Source user identifer INTEGER, destination user identifier INTEGER) III. APPROACH TO CDR DATA ANALYSIS As mentioned in section II, the CDR data is for a duration of 6 months with about twenty million records; hence, traditional methods of data analysis may not work. With this in mind we used the Apache Hadoop framework as a base line. A. Framework for CDR Data Analysis Apache Hadoop [4] is a framework for running applications on a large cluster built of commodity hardware. The Hadoop framework transparently provides applications for both reliability and data motion. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. In conjunction with Hadoop, we used Hive [5]. Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. The overall framework of Hadoop in conjunction with Hive is shown in Figure 1. The major components of the framework are Fig. 1. Framework for CDR Data Analysis the Metastore that stores all the structure information of the various tables and partitions in the warehouse and the Execution Engine that executes the execution plan created by the compiler. Considering the large data set Mapreduce technique is used to manage the complexity. For the sake of brevity, details of Mapreduce technique are out of scope of this paper, but for the sake of completeness, the Mapreduce technique takes a set of data and converts it into another set of data, where individual elements are broken down into keyvalue pairs. Complex operations can then be performed on these distribted key-value pairs. B. Methodology for Analysis and Visualization The methodology used to analyse the data using Hadoop is shown in Figure 2. As a first step, the data is loaded to Hadoop. Fig. 2. Methodology used for Analysing Data Hadoop distributes the data across HDFS in all the slave nodes. Hive query is used to fetch the required data. Hive compiles the query, locates data and executes the query. It returns the output data in a table. The table is stored in the local system in the Hadoop cluster. To visualise this stored data, HTML 5 along with java scripts based on D3JS are used. These java scripts operate on the output of the query and provide the final visualisation. IV. STATISTICAL ANALYSIS AND VISUALISATION Based on the methodology discussed in Section III the CDR data was analysed and an appropiate visualisation provided. Since the focus of this paper is on visualization, we target four themes for data analysis. A. Geographical Understanding The CDR records are analysed to find the geographical location of antennas in terms of latitude and longiture. This information is correlated and overlayed with an actual Georaphical Information System using Openmaps [8]. The resulting visualisation is given in Figure 3. As a second step, the sub-prefecture data was then overlayed on the knowledge acquired from antenna locations, resulting in an approximate zonal coverage of a set of antennas as shown in Figure 4. B. Call Patterns The most important parameter when monitoring call patterns is the Call Density. Call Density is defined as the number of calls originating from a Base Station within a cellular network. From a telecom operator perspective, the knowledge about the call density and specifically the correlation of the call density to the geography is an important call pattern. This call pattern helps the telecom operator to tune the cellular network
3 Fig. 3. Geographical Spread of Antennas Fig. 5. Sub-prefecture based call density on Wed at 1:00 am Fig. 4. Sub-prefecture based zone coverage Fig. 6. Sub-prefecture based call density on Saturday at 4:00 pm to minimise call drop rates and thus maximise revenue. Figure 5 1 and Figure 6 give a snap shot of the call density per subprefecture. The darker the shade higher the call density. Such a visualisation gives the telecom operator a quick mental picture of the call patterns. Such a visualisation when combined with actual data can assist the telecom operator to optimise the cellular network for optimal performance. C. Market Analytics Operating a cellular network is a CAPEX and OPEX intensive business. Telecom operators are continuously striving to optimise both the CAPEX / OPEX cost. One of the factors that influences the opex cost and determines the return on investment (profits) the telecom company can make is the Active Air Time. Active Air Time (AAT) is the aggregate time in which the air-waves are used. In other words AAT is the total number of calls made in a day, and monitored across all days / months of the years. In order to give the telecom operator 1 We thankfully acknowledge Paradigma Labs for the Visualisation concept and code. a pictorial view of this data, CDR records were analysed and daily aggregated call information plotted as shown in Figure 7 and Figure 8 respectively. While the value of visualizing the same data in two different views might be debatable, from a telecom operators perspective it provides two different sets of information. Figure 7 gives a time series perspective of the data and assists the telecom operator to get a visual perception of the number of calls made across different days in a month. On the other hand Figure 8 gives the telecom operator an easy visual reference to compare the number of calls made on the same day across different months. These inputs are important to fine tune the market stratagy of the telecom operator. D. Sentiment Analysis One important business objective that every telecom operator tries to achieve is personalized services. The personalisation can be in terms of customized tariff plans, personalised greetings and such. Sentiment analysis plays an important role in personalisation. While a lot more can be derived from CDR records from a sentiment analysis perspective, as a first
4 Fig. 7. Snapshot of Call Density variation across months as a Time series Fig. 9. Aggregated Calls around festive period Fig. 8. Snapshot of Call Density variation across months as bar chart step we have tried to analyse the call patterns around festive periods. Figure 9 shows a pictorial view of the number of calls around the festive period. Observe that on New Year s day the total number of calls is more than on New Year s eve. One interpretation of this data point is possibly indicative of a positive outlook of people of Ivory Coast. This interpretation will be discussed in detail in the following section when we correlate these data sets. V. INTERPRETATION Section IV discussed the visual representation of some of the information derived from the CDR. In this section we attempt to provide an interpretation to the data with possible applications of the derived knowledge. While more work needs to be done to have concrete recommendations based on the derived knowledge, this is the first step in that direction. While there can be numerous interpretations of the derived knowledge, we focus on two key aspects viz., Telecom Operator perspective and Socio-Economic perspective. A. A Telecom Operator Perspective From Figure 3 observe that the antenna density is more towards the south east of Ivory Coast. Correlating this data with Figure 5 and Figure6 one would expect a higher call density, but a combination of light and dark shades indicates that this is not the case. Our experiments with the CDRs show a consistent low call density in the region. It cannot be conclusively said whether this observation is true all year long since the CDRs are only for five months. However, based on the available data, it can be inferred that some amount of network planning to optimise antenna density to match call density is advisable 2. Another interesting observation is the correlation between the antenna density on the north east of Ivory Coast (Figure 3) and the call density (Figure 6). While the antenna density is low, the call density is relatively high. This is likely to result in higher call drop rate, since there isn t enough network capacity available. Based on this inference, increasing the network capacity by increasing the number of antennas and in turn base stations is advisable. The other observation is the correlation between the subprefectures and the antenna density from Figure 3 and Figure 4 respectively. Observe that the sub-prefectures are smaller towards the southern region of Ivory Coast and grow to be larger towards the northern region. However, the antenna density decreases as we travel north. While we are still in the process of deriving the call drop rate across different sub-prefectures, such a decreasing antenna density as the sub-prefecture size grows larger is non-intutive. From Figure 8 if we compare the call density across different months, and days of the month, it is interesting to note that the call density towards the middle of the month is lower in January than in other months. On the same lines note that call density in the second week of December is higher than other months, but reduces in the following weeks. Based on 2 These are not recommendations, but only observations based on visual inspection
5 this observation, we are attempting to find the mobility pattern of mobile phone users and to determine whether the mobility pattern influences this call density pattern. B. Socio-Economic Perspective From Figure 9 observe that the call density is considerably higher on New Year day as compared to any other festival days. This is an indication of the importance of the New Year in the lifestyle of the people of Ivory Coast. What is worth noting is that aggregated call density on New Year s day is about 30% higher than any other normal day. On an average the call density on festival days is about 10-15% higher than normal. While it might be premature to infer conclusively, this seems to indicate a strong emotional bias in the people of Ivory Coast. We say this becuase the economic condition of the people of Ivory Coast is not very good, but at the same time during festive periods they do not mind spending on calls. We intend to correlate this data with region-wise socio-economic conditions of the people to infer the telecom spending index and then extrapolate this data to determine region-wise socio-economic conditions. For the sake of completeness and to provide a visual reference related to the geography of Ivory Coast, Figure 10 is a map of Ivory Coast. Based on our literature survey of the socio- business districts than around the capital. It will be interesting to determine the mobility profiles of users and check whether there is a distinct movement towards the business districts. Note that all the above observations were made possible only because of the sophisticated visualisation techniques used to represent the derived information. When correlated with the actual inferred data, this can potentially be a handy tool for the telecom operators to optimise their cellular networks. VI. CONCLUSION AND FUTURE WORK This paper analyses anonymised CDRs obtained from Orange Telecom for a duration of five months, one of the telecom operators in Ivory Coast 3. Four data sets as discussed in section II provided by Orange Telecom are analysed from a telecom operator and socio-economic perspective. The analysis has been implemented on a Hadoop and Hive framework. This paper proposed a visualisation tool using HTML 5 to visualise the information derived from CDR records. Information and knowledge related to antenna density, subprefectures and call density patterns are derived by analysing CDR records. In this paper we attempt to correlate the derived information and provide observations from a telecom operator perspective and socio-economic perspective. One key observation, from a telecom operator perspective, is that the antenna density does not match with the call density patterns, and there is conderable scope for improvement. Another key observation, from a socio-economic perspective, is that the call density is saturated more towards the business districts than towards the political capital of the country. While these are just preliminary investigations, in the future we intend to derive the mobility profile of the users and correlate it with the call density and antenna density patterns to provide concrete network optimisation recommendations to the telecom operator. In the process, our larger goal is to propose and develop a tool for CDR record analysis which is configurable and scalable. REFERENCES Fig. 10. Map of Ivory Coast economic conditions of the people of Ivory Coast [9], the southern part of Ivory Coast (Figure 10) is the the economic capital of the country, but the political capital of the country is Yamoussoukro which is in the central region. Comparing Figure 10 and Figure 3 observe that the antenna density is lower in the central region where the capital is located than in the business district of Abidjan towards the southern region. Also observe from Figure 5 and Figure 6 that the average call density is higher in the business districts than around the captial of the country. This is indicative of the fact that more of the socio-economic growth is focused in and around the [1] Chen Zhou et. all, Activity Recognition from Call Detail Record: Relation Between Mobile Behavior Pattern And Social Attribute Using Hierarchical Conditional Random Fields, 2010 IEEE International Conference on Green Computing and Communications & 2010 IEEE International Conference on Cyber, Physical and Social Computing, June [2] Qining LIN et. all, Mobile Customer Clustering Based On Call Detail Records For Marketing Campaigns, International Conference on Management and Service Science, [3] Huayong Wang, Francesco Calabrese, Giusy Di Lorenzo, Carlo Ratti, Transportation Mode Inference from Anonymized and Aggregated Mobile Phone Call Detail Records, 13 International Conference onintelligent Transportation Systems, 2010 [4] Tom While, Hadoop A Definitive Guide,O riely Press [5] Introduction to Hive, Cloudera Inc. [6] Mapreduce Tutorial, Apache Software Foundation Tutorial [7] [8] [9] 3 We wish to thank Orange Telecom for providing anonymised CDRs on their Ivory Coast subscriber base for a duration of five months as part of the D4D challenge
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE
Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Client Overview. Engagement Situation. Key Requirements
Client Overview Our client is one of the leading providers of business intelligence systems for customers especially in BFSI space that needs intensive data analysis of huge amounts of data for their decision
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Move Data from Oracle to Hadoop and Gain New Business Insights
Move Data from Oracle to Hadoop and Gain New Business Insights Written by Lenka Vanek, senior director of engineering, Dell Software Abstract Today, the majority of data for transaction processing resides
Are You Ready for Big Data?
Are You Ready for Big Data? Jim Gallo National Director, Business Analytics February 11, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
HadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
Mr. Apichon Witayangkurn [email protected] Department of Civil Engineering The University of Tokyo
Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn [email protected] Department of Civil Engineering The University of Tokyo Contents 1 Introduction 2 What & Why Sensor Network
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
Big Data: Tools and Technologies in Big Data
Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
GPS Enabled Taxi Probe s Big Data Processing for Traffic Evaluation of Bangkok Using Apache Hadoop Distributed System
GPS Enabled Taxi Probe s Big Data Processing for Traffic Evaluation of Bangkok Using Apache Hadoop Distributed System Paper Identification number: AYRF14-060 Saurav RANJIT 1, Masahiko NAGAI 1, 2, Itti
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
BIG DATA CHALLENGES AND PERSPECTIVES
BIG DATA CHALLENGES AND PERSPECTIVES Meenakshi Sharma 1, Keshav Kishore 2 1 Student of Master of Technology, 2 Head of Department, Department of Computer Science and Engineering, A P Goyal Shimla University,
Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes
Contents Pentaho Corporation Version 5.1 Copyright Page New Features in Pentaho Data Integration 5.1 PDI Version 5.1 Minor Functionality Changes Legal Notices https://help.pentaho.com/template:pentaho/controls/pdftocfooter
Generic Log Analyzer Using Hadoop Mapreduce Framework
Generic Log Analyzer Using Hadoop Mapreduce Framework Milind Bhandare 1, Prof. Kuntal Barua 2, Vikas Nagare 3, Dynaneshwar Ekhande 4, Rahul Pawar 5 1 M.Tech(Appeare), 2 Asst. Prof., LNCT, Indore 3 ME,
Big Data Analytics OverOnline Transactional Data Set
Big Data Analytics OverOnline Transactional Data Set Rohit Vaswani 1, Rahul Vaswani 2, Manish Shahani 3, Lifna Jos(Mentor) 4 1 B.E. Computer Engg. VES Institute of Technology, Mumbai -400074, Maharashtra,
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE Anjali P P 1 and Binu A 2 1 Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi. M G University, Kerala
How To Analyze Network Traffic With Mapreduce On A Microsoft Server On A Linux Computer (Ahem) On A Network (Netflow) On An Ubuntu Server On An Ipad Or Ipad (Netflower) On Your Computer
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pig and Typical Mapreduce Anjali P P and Binu A Department of Information Technology, Rajagiri School of Engineering and Technology,
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Navigating Big Data business analytics
mwd a d v i s o r s Navigating Big Data business analytics Helena Schwenk A special report prepared for Actuate May 2013 This report is the third in a series and focuses principally on explaining what
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel
Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley
WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley Disclaimer: This material is protected under copyright act AnalytixLabs, 2011. Unauthorized use and/ or duplication of this material or
Are You Ready for Big Data?
Are You Ready for Big Data? Jim Gallo National Director, Business Analytics April 10, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?
L1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li [email protected] School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang
The Big Data Ecosystem at LinkedIn Presented by Zhongfang Zhuang Based on the paper The Big Data Ecosystem at LinkedIn, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems Hadoop Ecosystem
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Transforming the Telecoms Business using Big Data and Analytics
Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September 2014. National Institute of Standards and Technology (NIST)
NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop September 2014 Dylan Yaga NIST/ITL CSD Lead Software Designer Fernando Podio NIST/ITL CSD Project Manager National Institute of Standards
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis
, 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying
Unified Batch & Stream Processing Platform
Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built
You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.
What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees
Processing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
Oracle Big Data Essentials
Oracle University Contact Us: Local: 1800 103 4775 Intl: +91 80 40291196 Oracle Big Data Essentials Duration: 3 Days What you will learn This Oracle Big Data Essentials training deep dives into using the
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Asian Institute of Technology, Pathum Thani, Thailand Email: [email protected], [email protected]
GPS Enabled Taxi Probe s Big Data Processing for Traffic Evaluation of Bangkok Using Apache Hadoop Distributed System Paper Identification number: AYRF14-060 Saurav RANJIT 1, Masahiko NAGAI 1, 2, Itti
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.
Beyond Web Application Log Analysis using Apache TM Hadoop A Whitepaper by Orzota, Inc. 1 Web Applications As more and more software moves to a Software as a Service (SaaS) model, the web application has
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Ubuntu and Hadoop: the perfect match
WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Best Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
BIG DATA HADOOP TRAINING
BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Real Time Bus Monitoring System by Sharing the Location Using Google Cloud Server Messaging
Real Time Bus Monitoring System by Sharing the Location Using Google Cloud Server Messaging Aravind. P, Kalaiarasan.A 2, D. Rajini Girinath 3 PG Student, Dept. of CSE, Anand Institute of Higher Technology,
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Executive Summary... 2 Introduction... 3. Defining Big Data... 3. The Importance of Big Data... 4 Building a Big Data Platform...
Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure Requirements... 5 Solution Spectrum... 6 Oracle s Big Data
How To Turn Big Data Into An Insight
mwd a d v i s o r s Turning Big Data into Big Insights Helena Schwenk A special report prepared for Actuate May 2013 This report is the fourth in a series and focuses principally on explaining what s needed
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Big Data for Investment Research Management
IDT Partners www.idtpartners.com Big Data for Investment Research Management Discover how IDT Partners helps Financial Services, Market Research, and Investment Management firms turn big data into actionable
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
TORNADO Solution for Telecom Vertical
BIG DATA ANALYTICS & REPORTING TORNADO Solution for Telecom Vertical Overview Last decade has see a rapid growth in wireless and mobile devices such as smart- phones, tablets and netbook is becoming very
Integrating VoltDB with Hadoop
The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
http://glennengstrand.info/analytics/fp
Functional Programming and Big Data by Glenn Engstrand (September 2014) http://glennengstrand.info/analytics/fp What is Functional Programming? It is a style of programming that emphasizes immutable state,
Data Mining in Telecommunication
Data Mining in Telecommunication Mohsin Nadaf & Vidya Kadam Department of IT, Trinity College of Engineering & Research, Pune, India E-mail : [email protected] Abstract Telecommunication is one of
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 Yannis Stavrakas Vassilis Plachouras IMIS / RC ATHENA Athens, Greece {yannis, vplachouras}@imis.athena-innovation.gr Abstract.
Large Scale Mobility Analysis: Extracting Significant Places using Hadoop/Hive and Spatial Processing
Large Scale Mobility Analysis: Extracting Significant Places using Hadoop/Hive and Spatial Processing Apichon Witayangkurn 1, Teerayut Horanont 2, Masahiko Nagai 1, Ryosuke Shibasaki 1 1 Institute of Industrial
Big Data Open Source Stack vs. Traditional Stack for BI and Analytics
Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Part I By Sam Poozhikala, Vice President Customer Solutions at StratApps Inc. 4/4/2014 You may contact Sam Poozhikala at [email protected].
Chase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
3 PRINCIPLES OF MOBILE DEVICE DATA FOR POPULATION DISTIRBUTOIN AND MOTION ANALYSYS
Exploring Population Distribution and Motion Dynamics through Mobile Phone Device Data in Selected Cities Lessons Learned from the UrbanAPI Project Jan Peters-Anders, Wolfgang Loibl, Johann Züger, Zaheer
Oracle Big Data Spatial & Graph Social Network Analysis - Case Study
Oracle Big Data Spatial & Graph Social Network Analysis - Case Study Mark Rittman, CTO, Rittman Mead OTN EMEA Tour, May 2016 [email protected] www.rittmanmead.com @rittmanmead About the Speaker Mark
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
MapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
How To Use Big Data For Telco (For A Telco)
ON-LINE VIDEO ANALYTICS EMBRACING BIG DATA David Vanderfeesten, Bell Labs Belgium ANNO 2012 YOUR DATA IS MONEY BIG MONEY! Your click stream, your activity stream, your electricity consumption, your call
