Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
|
|
- Myron Ray
- 7 years ago
- Views:
Transcription
1 Volume 6, Issue 3, March 2016 ISSN: X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: Special Issue on 5 th National Conference on Recent Trends in Information Technology 2016 Conference Held at P.V.P. Siddhartha Institute of Technology Kanuru, Vijayawada, India A Relative Study on Traditional ETL and ETL with Apache Hadoop 1 Y. Ramu, 2 C P Pavan Kumar Hota, 3 Dr. B. V. Subba Rao 1 Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Assistant Professor, Department of IT, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 3 Professor, Dept of IT, PVP Siddhartha Institute of Technology, Vijayawada, Andhra Pradesh, India Abstract ETL process is the backbone of data warehouse or data processing system, as it supplies data with necessary integration from heterogeneous and distributed data sources. Before processing the data to data processing systems, it extracts from multiple sources, then cleanses formats and loads the data into systems for analysis. Big data includes data sets with sizes beyond the ability of other software tools to capture, manage and process data within an elapsed time. In this paper, we present various ETL operations to handle big-data with Apache Hadoop; distinct hadoop components; and a brief comparative study between traditional ETL & ETL with hadoop sytems. Keywords ETL, Big data, Hadoop. I. INTRODUCTION Web based organizations collect a lot of data from multiple sources. Now a days, web-scale-companies are everywhere in the market. So, the amount of data organized by these web-scale-companies becomes huge over time. Developing applications on top of this huge data have their own challenges. As the data keeps on growing, then it becomes too complex to develop, too costly to operate and takes too much of time to execute. Big Data: Big Data is a term that describes large volumes of high velocity, complex and variable data that require advanced techniques and technologies to enable tasks like capture, storage, distribution, management, and analysis of the information. It is a computing infrastructure that can take in, validate and analyze high volume of data, and analyzing diverse data (structured/unstructured) from multiple sources. Figure 1: 3 V's of Big Data Big data are large-scaled heterogeneous data in terms of quantity, complexity, semantics, distribution, and processing costs in computer science, information science, cognitive informatics, web-based computing, cloud computing, and computational intelligence. Big data has become a popular phrase that can be understood by various definitions with respect to the Volume, Velocity, Variety (as shown in Figure:1). Later studies revealed out that the definition of 3V's is insufficient to explain the big data we face now. Thus, Veracity, Validity, Value, Variability were added too. Volume: Volume refers to the huge amount of data that is generated from various sources. Overall more than 3 exabyte's of data is generated every day and it keeps growing. Velocity: Velocity refers to the speed of data generation. i.e. how fast the data is generated and processed to meet the demands and the challenges ahead in the enterprise growth. Variety: Variety refers to the format of data. Now a days, data that is generated from various sources can be in three different formats: a. Structured Data: Structured data is data that is having a pre-defined schema, like RDBMS, XML etc. b. Semi Structured Data: Semi-Structured data is data that may have a pre-defined schema, it is often ignored, like XML, JSON etc. c. Un-Structured Data: Un-Structured data is data that does not have any pre-defined format, like Text Files, Images, Audio Files, Video Files, etc. 2016, IJARCSSE All Rights Reserved Page 74
2 Hadoop: Usually, the developers might use the databases such as Oracle, SQL Server etc.., of their choice to store the data. In this case, when user interacts with the application it works fine, if the data is less in volume. But, when it comes to deal with huge amount of data, then it becomes a problem with single database.to overcome this problem, Google provided a solution called "Map reduce". This map reduce algorithm divides the task into smaller components and assign those individual smaller tasks to many computers and collects the result by integrating individual results to from the final result set. Using the map reduce algorithm, Doug Cutting with his team developed open source software called Hadoop. In which, data is processed in parallel and is able to develop application that performs statistical analysis with huge amount of data. In other words, Hadoop allows distributed processing and storage of datasets to clusters of computers. Hadoop architecture mainly consists of two layers (as shown in Figure:2): 1. Processing layer (Map Reduce) and 2. Storage layer (HDFS) Figure 2: Major Layers of Hadoop II. TRADITIONAL ETL ETL is a popular architectural pattern for Data Warehouse[5]. ETL process extracts data from multiple sources, then cleanses, formats and loads it into a data warehouse for analysis (as shown in Figure:3 and Figure:4). When the source data sets are large, fast and loads unstructured, then it becomes too complex to develop, too expensive to operate and takes too long time to execute. Figure 3: Traditional ETL Extract phase refers to extracting data from various sources that produce them. Data warehousing systems integrate data from various logically related sources. Transformation phase refers to transform the source data into standardized form. Then, Load phase, which moves or uploads the data into target system. Figure 4: ETL Process III. ETL WITH HADOOP An ETL workflow with Hadoop process comprises various tasks as follows: Initially, input data from various data source, which contains data in different formats to HDFS Map the consolidated data in to a table to make it query table The target data is transformed with in a finalized format, and is mapped to destination source. Convert all input data sources information in to target format and make it available at central. Use the finalized data available at central for reporting or analytics. 2016, IJARCSSE All Rights Reserved Page 75
3 Extract, Transform, Load operations can be performed with Apache Hadoop[1]. It includes various components such as Map Reduce, HDFS, Apache Flume and Apache Sqoop (as shown in Figure: 5). Figure 5: ETL Process with Hadoop and its Components Map Reduce: Map-reduce technique contains two phases, i) Map-phase and ii) Reduce-phase. The overall mapreduce approach is can be described as various internal tasks as shown in the Figure:6. Map phase accepts a data set and converts it into another data set by breaking individual elements into (key, value) pairs. Reduce-phase performs after the map-phase. This phase accepts the output of map-phase as an input of reducephase, thus combines (key, value) pairs into smaller set of tuples. Combiner, combines similar data from the map phase into separate sets. It accepts intermediate keys as input from the map phase and aggregates values in a smaller scope. Shuffle and Sort phase, shuffles the individual (key, value) pairs based on their similarity and then (key, value) pairs are sorted by key into a larger data list. In Reducer phase, data can be aggregated, filtered and combined in a number of ways. It results to zero or more (key,value) pairs. Finally the output phase, translates the key value pairs from the reducer function and writes them onto a file using record writer. Figure 6: Map-Reduce Work Flow HDFS: Hadoop File System is used for storing the data [3] on Hadoop cluster. One can read the files from HDFS and write the files to HDFS. Data Files are split into blocks before storing on the cluster. The size of each data block is either 64MB (default size) or 128 MB. Hadoop file system architecture is based on Master/Slave architecture as shown in Figure: 7, which consists of Name node and Data nodes. Figure 7: Hadoop File System Name Node: It is the master node which stores meta data information such as file names, number of blocks, details of data node blocks containing information etc.. If Name node fails then the cluster becomes inaccessible. Due to this reason, Secondary Name Nodes are encouraged. 2016, IJARCSSE All Rights Reserved Page 76
4 Secondary Name Node: It performs periodic check points. Data node cannot connect to secondary Name node. It is just used for the recovery of name node. Data Node: It acts as a slave node that stores the blocks of data to local file system. Each data node periodically sends reports to the name node. Block report contains a list of all blocks that are available on a data node. Apache Flume: It is a distributed system used for collecting, aggregating and moving large amounts of data from multiple sources into Hadoop File System (HDFS). Flume is a distributed system that gets data from multiple sources and aggregates them for processing as shown in Figure: 8. It contains various components like: Source node is feeds the data to one or more channels. Channel is the location representing various data events. Sink node, acts as the transporter to move data events from a channel to destination. Figure 8: Apache Flume Figure 9: Channel Processor Flume, helps in integrating and aggregating various data sources. It facilitates data reading from internal and external data sources. It supports channels to hold various data events raised by data sources as per pre-configured approach. Initially, clients send various data events to agents. Sources operating with an agent will receive these events, and passes received events through interceptors. If not filtered then, put them on channels identified by the channel selector. Often, we will be having more than one channel for effective data-events and channelizing. At channel selector, by default, every channel will gets a copy or other option is channel picked based on header value. Interceptor is applied to source configuration element. One source can have many interceptors. Interceptor can be used for tagging, filtering and routing as shown in Figure:9. Apache Sqoop: "Sqoop in the name itself shows that "Sql to Hadoop and Hadoop to Sql". It means, it is a tool for transferring data between Hadoop and Transactional database. In other words, using Sqoop, one can import data from other data base storage systems into HDFS. Then apply Map-Reduce on that data,and then export the data back to the RDBMS. Sqoop automatically processes the data to import and export in parallel with fault tolerance. Apache Sqoop s import and export process is described as follows: i) Apache Sqoop import process: Step-1: Sqoop examines the databases to gather the necessary meta data for the data being imported. Step-2: Map only hadoop job submitted to cluster by sqoop.the mapped only job performs the data transfering using hadoop. Apache Sqoop imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in Hadoop File System. By default, these files contain delimited fields with new lines separating different records. These records are stored in text format either in text files or binary files. ii) Apache Sqoop export process: Step-1: Examines the database for metadata followed by second step of transferring the data. Step-2: Transfer the data 2016, IJARCSSE All Rights Reserved Page 77
5 Sqoop divides the input data set into splits. It uses individual map task to push the split to database. Each map task performs many transactions to ensure optimal throughput and minimum resource utilization. IV. COMPATIVE STUDY OF TRADTIOAN ETL & ETL WITH HADOOP Here, we have with a brief comparative study on ETL approach in traditional and hadoop systems. A traditional ETL approach focuses on extraction, transformation and loading of various OLTP (On-Line Transactional Processing) data sources in to a central data store. Before loading data, it transforms all the data in to a standardized format after performing various data cleaning tasks. Finalized format of data in the central store is generally in row-column fashion. This traditional ETL helps data professionals to query for various types of data and retrieve a valuable information. Data across all operational and analytical systems is in a unique format, so it facilitates data professionals for various data operations time to time. But, data professionals must accept overhead time at data staging. In recent times, hadoop systems with HDFS, facilitates data professionals to store all data sources in terms HDFS mapped tables which are very flexible for data queries. Source data can also be transformed to a desired format and can be stored at a required remote location. Data storing and querying become very easy and flexible. A brief comparative study with five different parameters between tradional ETL and ETL with hadoop is shown in the Tabel:1. Table 1: Traditional ETL Vs. ETL with Hadoop Comparative Parameter Traditional ETL with ETL Hadoop Possibility of handling unstructured data Is the system scalable to desired extent Is it facilitates at low cost Is the system provides high security Is the system supports dynamic real-time data analysis Building an ETL systems with HDFS will certainly improves the way of data handling. Here, we have following advantages of ETL with Hadoop over Traditional ETL An improved phenomenon of having a centralised data centre for all the data sources. No burden of data movements across multiple clusters Provision of data access as per source format Data transformations are expressed in terms of source platform features and can refer any one of the hadoop resident data sources as well. So, we believe that data professionals will be facilitated more to access any required data from any remote machine for their data processing or data analytics purpose. These ETL with hadoop systems will certainly help data professionals in the current big-data era. V. CONCLUSIONS Effectiveness of any decision support systems depends on ETL process. In the current big-data era, to cope with emerging data trends, ETL approach should always be refined from time to time. In this paper, our comparative study represents that adopting ETL with apache hadoop system will certainly help data professionals with its rich set of components and tools available to provides improved performance in data automation, schema updates and business intelligence applications with better scalability, security and real time features. REFERENCES [1] Jaswender Malik, Kavita Framework for ETL with hadoop map reduce, International Journal of technology enhancements and emerging Engineering Research, Vol 3, ISSUE 07 ISSN [2] Marcel Kornacker, Lenni Kuff, From Raw Data to Analytics with No ETL, Cloudera, Inc.! [3] M. Bala, O. Boussaid, and Z. Alimazighi, Big-ETL: Extracting,Transforming, Loading Approach for Bigdata, Int'l Conf. Par. and Dist. Proc. Tech. and Appl.PDPTA'15 [4] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, pp , [5] P. Vassiliadis, A. Simitsis, and S. Skiadopoulos, Conceptual modeling for etl processes, in Proceedings of the 5th ACM international workshop on Data Warehousing and OLAP. ACM, 2002, pp , IJARCSSE All Rights Reserved Page 78
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationR.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationInternational Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop
ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationBIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
More informationBig Data: Tools and Technologies in Big Data
Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can
More informationFinding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2
More informationManaging Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges
Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Prerita Gupta Research Scholar, DAV College, Chandigarh Dr. Harmunish Taneja Department of Computer Science and
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationInternational Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
More informationVolume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationData Refinery with Big Data Aspects
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
More informationKeywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.
Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics
More informationImproving Data Processing Speed in Big Data Analytics Using. HDFS Method
Improving Data Processing Speed in Big Data Analytics Using HDFS Method M.R.Sundarakumar Assistant Professor, Department Of Computer Science and Engineering, R.V College of Engineering, Bangalore, India
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationKeywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
More informationQuery and Analysis of Data on Electric Consumption Based on Hadoop
, pp.153-160 http://dx.doi.org/10.14257/ijdta.2016.9.2.17 Query and Analysis of Data on Electric Consumption Based on Hadoop Jianjun 1 Zhou and Yi Wu 2 1 Information Science and Technology in Heilongjiang
More informationA Comparative Study on Operational Database, Data Warehouse and Hadoop File System T.Jalaja 1, M.Shailaja 2
RESEARCH ARTICLE A Comparative Study on Operational base, Warehouse Hadoop File System T.Jalaja 1, M.Shailaja 2 1,2 (Department of Computer Science, Osmania University/Vasavi College of Engineering, Hyderabad,
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationTesting Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
More informationConstructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationBig Data and Hadoop with Components like Flume, Pig, Hive and Jaql
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationBig Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
More informationNative Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy
Native Connectivity to Big Data Sources in MicroStrategy 10 Presented by: Raja Ganapathy Agenda MicroStrategy supports several data sources, including Hadoop Why Hadoop? How does MicroStrategy Analytics
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationOptimization of ETL Work Flow in Data Warehouse
Optimization of ETL Work Flow in Data Warehouse Kommineni Sivaganesh M.Tech Student, CSE Department, Anil Neerukonda Institute of Technology & Science Visakhapatnam, India. Sivaganesh07@gmail.com P Srinivasu
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationMobile Storage and Search Engine of Information Oriented to Food Cloud
Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:
More informationAn Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
More informationReference Architecture, Requirements, Gaps, Roles
Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationTesting 3Vs (Volume, Variety and Velocity) of Big Data
Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used
More informationLambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
More informationRole of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
More informationBig Data: Study in Structured and Unstructured Data
Big Data: Study in Structured and Unstructured Data Motashim Rasool 1, Wasim Khan 2 mail2motashim@gmail.com, khanwasim051@gmail.com Abstract With the overlay of digital world, Information is available
More informationKeywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop
Volume 4, Issue 1, January 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Transitioning
More informationA Study of Data Management Technology for Handling Big Data
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,
More informationCertified Big Data and Apache Hadoop Developer VS-1221
Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification
More informationResearch on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
More informationGetting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
More informationStorage and Retrieval of Data for Smart City using Hadoop
Storage and Retrieval of Data for Smart City using Hadoop Ravi Gehlot Department of Computer Science Poornima Institute of Engineering and Technology Jaipur, India Abstract Smart cities are equipped with
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationManaging Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
More informationEnhancing MapReduce Functionality for Optimizing Workloads on Data Centers
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,
More informationBIG DATA CHALLENGES AND PERSPECTIVES
BIG DATA CHALLENGES AND PERSPECTIVES Meenakshi Sharma 1, Keshav Kishore 2 1 Student of Master of Technology, 2 Head of Department, Department of Computer Science and Engineering, A P Goyal Shimla University,
More informationIndian Journal of Science The International Journal for Science ISSN 2319 7730 EISSN 2319 7749 2016 Discovery Publication. All Rights Reserved
Indian Journal of Science The International Journal for Science ISSN 2319 7730 EISSN 2319 7749 2016 Discovery Publication. All Rights Reserved Perspective Big Data Framework for Healthcare using Hadoop
More informationA STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
More informationInternational Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
More informationCloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationApproaches for parallel data loading and data querying
78 Approaches for parallel data loading and data querying Approaches for parallel data loading and data querying Vlad DIACONITA The Bucharest Academy of Economic Studies diaconita.vlad@ie.ase.ro This paper
More informationJournal of Environmental Science, Computer Science and Engineering & Technology
JECET; March 2015-May 2015; Sec. B; Vol.4.No.2, 202-209. E-ISSN: 2278 179X Journal of Environmental Science, Computer Science and Engineering & Technology An International Peer Review E-3 Journal of Sciences
More informationSurvey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
More informationW H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationData-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More informationTHE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS
THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS WHITE PAPER Successfully writing Fast Data applications to manage data generated from mobile, smart devices and social interactions, and the
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationHadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science
A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org
More informationBig Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationBig Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.
Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology
More informationWhat's New in SAS Data Management
Paper SAS034-2014 What's New in SAS Data Management Nancy Rausch, SAS Institute Inc., Cary, NC; Mike Frost, SAS Institute Inc., Cary, NC, Mike Ames, SAS Institute Inc., Cary ABSTRACT The latest releases
More informationLog Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationApache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer aaa@cloudera.com, twitter: @awadallah Hadoop Past
More informationBIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
More informationHow to Enhance Traditional BI Architecture to Leverage Big Data
B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...
More informationAddressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015
Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO Big Data Everywhere Conference, NYC November 2015 Agenda 1. Challenges with Risk Data Aggregation and Risk Reporting (RDARR) 2. How a
More informationAnalyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen
Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen Anil G, 1* Aditya K Naik, 1 B C Puneet, 1 Gaurav V, 1 Supreeth S 1 Abstract: Log files which
More informationTrustworthiness of Big Data
Trustworthiness of Big Data International Journal of Computer Applications (0975 8887) Akhil Mittal Technical Test Lead Infosys Limited ABSTRACT Big data refers to large datasets that are challenging to
More informationL1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
More informationBIG DATA IN BUSINESS ENVIRONMENT
Scientific Bulletin Economic Sciences, Volume 14/ Issue 1 BIG DATA IN BUSINESS ENVIRONMENT Logica BANICA 1, Alina HAGIU 2 1 Faculty of Economics, University of Pitesti, Romania olga.banica@upit.ro 2 Faculty
More informationBig Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig
Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types
More informationManifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
More informationTurkish Journal of Engineering, Science and Technology
Turkish Journal of Engineering, Science and Technology 03 (2014) 106-110 Turkish Journal of Engineering, Science and Technology journal homepage: www.tujest.com Integrating Data Warehouse with OLAP Server
More informationBig Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014
White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page
More informationData Services Advisory
Data Services Advisory Modern Datastores An Introduction Created by: Strategy and Transformation Services Modified Date: 8/27/2014 Classification: DRAFT SAFE HARBOR STATEMENT This presentation contains
More informationMap Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
More informationDYNAMIC QUERY FORMS WITH NoSQL
IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 7, Jul 2014, 157-162 Impact Journals DYNAMIC QUERY FORMS WITH
More informationWhite Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP
White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big
More informationData Warehousing and Data Mining in Business Applications
133 Data Warehousing and Data Mining in Business Applications Eesha Goel CSE Deptt. GZS-PTU Campus, Bathinda. Abstract Information technology is now required in all aspect of our lives that helps in business
More informationSurfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics
Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,
More informationhttp://www.paper.edu.cn
5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationClient Overview. Engagement Situation. Key Requirements
Client Overview Our client is one of the leading providers of business intelligence systems for customers especially in BFSI space that needs intensive data analysis of huge amounts of data for their decision
More informationA Database Hadoop Hybrid Approach of Big Data
A Database Hadoop Hybrid Approach of Big Data Rupali Y. Behare #1, Prof. S.S.Dandge #2 M.E. (Student), Department of CSE, Department, PRMIT&R, Badnera, SGB Amravati University, India 1. Assistant Professor,
More informationBoarding to Big data
Database Systems Journal vol. VI, no. 4/2015 11 Boarding to Big data Oana Claudia BRATOSIN University of Economic Studies, Bucharest, Romania oc.bratosin@gmail.com Today Big data is an emerging topic,
More informationBIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS
BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS Megha Joshi Assistant Professor, ASM s Institute of Computer Studies, Pune, India Abstract: Industry is struggling to handle voluminous, complex, unstructured
More informationAnalysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
More informationAnalytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
More information