A B S T R A C T I. INTRODUCTION

Size: px
Start display at page:

Download "A B S T R A C T I. INTRODUCTION"

Transcription

1 Big Data Analysis Techniques and Challenges in cloud Computing Environment Pawan Kumar 1, Aditya Bhardwaj 2, Amit Doegar 3 Department of Computer Science & Engineering, NITTTR Chandigarh , INDIA. Pawanyadav235@gmail.com, adityaform@gmail.com, amit@nitttrchd.ac.in A B S T R A C T With the rapid growth of larger data in today s world of cloud computing log analysis has become a necessary task to identify the behavior of people s in order to improve sales of products and their advertisement. Analysis of large datasets like medical, retail Wi-Fi, Banking, Online Stores etc. is required to get useful information from it. Log files are getting generated rapidly. Effective management and analysis of larger datasets becomes interesting and critical challenge. Virtual Databases along with parallel processing system are most appropriate solution to analyze these log files. Different techniques are available to process these data and there are some challenges also like Hadoop has been accepted as a data processing model which provides data storage by map reduce programming model and hadoop distributed file system. Important aspects of Cloud computing and Big Data like resource management and performance optimization of analysis tools are introduced. Keywords Big Data, Cloud Computing, Hadoop, HDFS, Map Reduce, Hive, Storm. I. INTRODUCTION Today, In the era of cloud computing Internet world is becoming more complex as everything is going to be online[1]. Every field is putting applications on internet in their own way. We can do shopping, banking work, office work by seating at home. In field of online services, service providers like (Flipkart, Amazon etc.) are eager to know about are they providing best services in the market[2]. For example high energy physics experiments such as DZero[17] generate huge amount of data every day. Different social sites like Facebook, Twitter handles data in Exabyte. Different companies like Google, Facebook, YouTube uses a number of artificial intelligence techniques to make instant decisions. American government initiated and makes a Big Data Research and Development as the National policy[18]. Now a days, servers generates different file formats. Different web applications data and logs are getting generated in different structured format like (HTML, XML, Tables, Spreadsheets etc.) from heterogeneous data sources. To integrate and process data and log from these heterogeneous sources Virtual Database with parallel processing is best solution[2][3]. W3C (World Wide Web Consortium) extensible log file format[4]. This is default customizable file format for single site. Another text based common log file format is NCSA(National Centre for Supercomputing Applications). In W3C different file format like Date-The Date at which activity occurred, Time-The Time at which activity occurred. IP address of the client or system who request the query, number of bytes sent and received by web server etc. type of browser used by the client[5]. Issues In big data processing there can be some issues like below, which should be considered - i. Types of Data: Log files may be in any structured or unstructured format. For mining of these files needed to change in structured format because the traditional format have predefined schema. For this purpose Trans-log algorithm is used to change logs to structured format[1] , IJAFRSE and ICCICT 2015 All Rights Reserved

2 ii. Data Distribution: In traditional computation of log files it takes much processing power and is also complex. But hadoop have a simplified programming model Map Reduce which is efficient and automatically distribute the work on different machines of cluster. iii. Fault Tolerance: Hadoop also have fault tolerance capacity by replicating the data on three or more machines. If any machines stops working, then another machine where replica of same data is loaded will take care of remaining processing[1]. iv. Data Locality: To process the log files blocks of files are spread over different nodes by HDFS(Hadoop Distributed File System). So which data is operated by a node is depending upon the node locality. Figure 1: System s work Flow[1] Cloud computing as well as grid computing are biggest growing technologies and are intended to access large amount of data by offering single system view and aggregating resources. These technologies tackle larger data sets such as multimedia, medical and high dimensional datasets. In both cloud computing is the biggest growing technology which deal with big data using different technologies. Big data is defined as the dataset with size beyond the ability of current technology. According to Gartner: Big data are high velocity, high volume and high variety datasets that requires new forms of processing to enable enhanced decision making. The main purpose of this paper is to provide introduction to different query processing techniques in big data and its management technologies. Later implementation of hadoop for data processing and challenges in big data. II. QUERY PROCESSING TECHNIQUES 1. MapReduce: MapReduce is a programming model introduced by Google have the work for processing large volume of data and execution framework for large scale data processing of commodity servers. Data may be in any format or anything but it is designed to process lists of data. Main job for Map Reduce is to change lists of input data to lists of output data. Many times it occurs that the available data is not in readable format. MapReduce mainly consists of the following task: , IJAFRSE and ICCICT 2015 All Rights Reserved

3 i) Map Task: Map function have the work to take input record and to generate input keys(k1, k2,..kn) along with value pair and emits value for each key which is 1. During Map pass, task are interpreted into records and map function is applied on all records[2][5]. Map: (k1, v1)[(k2, v2)] ii) Reduce Task: Reduce task takes the output from map task as input[5]. It reduces the list of values by single value by combining the values for input keys[1//]. Reduce: (k2, [v2]) (k3, v3) iii) Shuffle Task: Process of moving intermediate output from map to reduce task called as shuffling. Nodes start exchanging intermediate output from map task to reduce task. 2. Hadoop Hadoop is the most popular open source and big data handling platform. Hadoop is implementation of mapreduce technique. It is able to work with multiple datasets, either aggregating of multiple source data data for large scale processing A primary storage system used by hadoop applications is defined as hadoop distributed file system(hdfs). HDFS divide the original data into data blocks and then distributes them on different nodes along with replicas of blocks on three or more machines[5]. HDFS consists of Name node or master node that manages filing system and manages access to files by clients[6]. In HDFS default block size is 64 MB[1]and we can set the size of block of our own size also. Hadoop has several different applications like social media data, traffic, weather, sensors etc. Hadoop processing architecture as shown in below figure Hive Figure 2: Hadoop Distributed File System Architecture[6] Hive provides a query language HiveQL syntactically similar to SQL allows to run query on hadoop cluster[11]. It alllows to create tables which can be accessed remotely through ODBC connection. By installing ODBC driver for Hive on client sytstem, it allows to connet to HDInsight cluster and to submit HiveQL queries. Hive looks like traditional database code with SQL access, but also have some key difference because of dependent on hadoop and mapreduce operations. It is also helpful when you want to perform experiment with different schemas for the table format of the output. 4. Mahout , IJAFRSE and ICCICT 2015 All Rights Reserved

4 Mahout is also data processing technique which is useful when you want to extract specific type of information. It consists of several machine learning algorithms. Mahout is used when source files consists of items of interest in data processing solution. Based on the schedule and to update results, Mahout queries are processed as separate process. Later the results are stored in cluster storage to export to databse or other tools[13]. Mahout is useful in extracting user preference on basis of their behavior. In data mining, frequent operations based on recent data are performed using mahout. 5. Pig Pig is another open source query processing tool developed by yahoo. Pig consists of Perl-Latin language which allows for query execution over data on Hadoop cluster rather than SQL-like language[11]. Pig allow to perform complex query processing of data to generate result useful for analysis and reporting such as merging and filtering datasets, process data as a sequence of process, restructuring source data like grouping values, grouping columns to rows[12]. 6. HCatalog In all another existing technologies like Pig, Hive etc. data can be processed into HDInsight Cluster. So every time to generate required result either we need to process data or need code to project a schema on data stored at a particular location and then apply transformation and filter. HCatalog offers Abstraction layer which provides a consistent way for data to be loaded and stored- regardless of specific processing interface being used[13]. HCatalog is helpful in abstracting data storage location, format and schema from the code used for processing it. HCatalog provides a way to write applications to perform multiple jobs, by enabling data availability. 7. Storm As hadoop which was not able to process real time streaming analysis such as sensor, online transaction etc. then new technology like Storm were introduced to analyze real time data. It is a real time, scalable, fault-tolerant and distributed computation system for processing large and fast streams of data[21]. It takes each message as individual task and by using a number of user defined parallel tasks. Storm is also helpful when data is needed to pre-process before loading into solution space, real time data examine. III. MANAGEMENT SYSTEM IN BIG DATA As the data is growing rapidly, it is not possible to manage it using traditional management system like DBMS. Traditional management technique have the drawback of scalability and cost. D. Koss et. al. has presented four different architecture such as replication, partitioning caching and distributed control system architecture. Different big data processing companies like Google, Microsoft provide different level services[8]. Different cloud service provider uses different techniques for handling big data. Most of data that is getting generated is in unstructured or semi-structured format. Google uses its own file system called as GFS(Google File System)[15] which works distributed file system like hadoop. MapReduce is also programming technique introduced by Google toprocess big data. Hadoop uses HDFS a distributed file system to handle data on clusters. Amazon s S3(Simple Storage Service) aims to provide scalability, high availability and low latency at lower cost. There are another file system also like Moose File System(MFS), Kosmos Distributed System etc.. These file system are useful in managing data in distributed environment. Another issue in management system comes that is storage of data which is in different formats such as web data is in both semi-structured and unstructured which is growing very fast. Simple distributed file system don t satisfy the service providers like Microsoft, Google. Google uses its own Bigtable as a distributed file system to store data in huge amount.[16]. Yahoo uses a massive scale hosted database called as PNUTS which is also helpful in creating new applications[19]. Another data storage system used for Amazon s internal application support is Dynamo[20]. Another hybrid data management system is Llama which support combine feature of row wise and column wise database , IJAFRSE and ICCICT 2015 All Rights Reserved

5 IV. HADOOP IMPLEMENTATION As discussed above hadoop is a framework which is useful in optimization of huge amount of data in distributed environment. To perform data processing we need to implement the environment for the technique. It allows us to add new node whenever needed. To install hadoop we need to follow some steps like: 1. First we need to install java on system that may be of any format windows or linux (here we are talking about ubuntu). Figure 3: install process of java 2. Later we need to create SSH setup, which is required to access the hadoop nodes in cluster(using the below command). Apt-get install ssh 3. After installing ssh, we neet to create a dedicated hadoop user using the commandsudo addgroup hadoop sudo adduser -ingroup hadoop hduser su -i hduser ssh-keygen -t rsa -P "" cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys 4. After executing first two commands in step 3, we might be asked for file name, just leave It blank and proceed next. 5. After this fetch and install hadoop in directory Cd/usr/local wget common/current/hadoop tar.gz 6. When the package is downloaded, extract it using the command tar xfz hadoop tar.gz 7. After that we need to edit and setup the configuration file in the directory,and along with set the JAVA_HOME environment variable - ~/.bashrc 8. Edit the bash directory using the command sudo gedit ~/.bashrc , IJAFRSE and ICCICT 2015 All Rights Reserved

6 9. In the directory we need to modify following files: /usr/local/hadoop/etc/hadoop/hadoop-env.sh /usr/local/hadoop/etc/hadoop/core-site.xml /usr/local/hadoop/etc/hadoop/yarn-site.xml /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/hdfs-site.xml Figure 4: public-private key generation Figure 5: bash directory after editing 10. After editing files, save and close file.later we need to format the new hadoop file system using the commandhdfs namenode format 11. At the end we need to start hadoop, in singlr node cluster using command, start-dfs.sh Later click yes and run the below command , IJAFRSE and ICCICT 2015 All Rights Reserved

7 start-yarn.sh After executing the command on hadoop, we start running and verify by running command jps V. CHALLENGES IN BIG DATA Figure 6: hadoop in running Mode With the introduction to new technologies in big data processing, some challenges are also introduced which are needed to keep in mind. Performance: In today s online world a nanosecond may also effect your business, so big data must move at a high velocity in all workload conditions[22]. Visualization helps in performing analysis and making decision, but the challenge also occur as the degree of granularity increase. Possible solution for this is more hardware or more memory and powerful parallel processing. Another method can be grid computing to solve the query and improve the performance. High Availability: When you rely on big data, it should be available 24hours and should never go down data[23]. A certain amount of down time should built into. Scale: Data is getting generated very fast, so the scalability is also the challenge for data processing companies to process within time. Big data should be able to scale whenever it is required at any scale[22]. Data Security With the growth of internet world, data is also getting generated from more sensitive data such as credit card data, personal ID Information which requires more security. For these data processing user feel more security issue that his data is safe or not. We should ensure that organization s data, network, partner, customer are protected end-to-end[24]. Addressing Data Quality , IJAFRSE and ICCICT 2015 All Rights Reserved

8 To make data much useful for decision making, we should be much able to find and analyze data quickly in proper format for information consumers. Addressing data quality is a challenge for data analysts when considering volume of information involved in big data projects. We should ensure a pro-active method to address data quality issues[25]. Management Management is also the biggest challenge in big data as the data is growing much faster and also in different formats. To manage big data introducing new technology is the biggest challenge for data analysts. If we talk about traditional management system such as RDBMS which is costly, time consuming and often futile endeavor[23]. Dealing with outliers: To make communication between trends and outliers graphical representation of data by visualization is much faster and better than tables containing numbers and texts. By charts issues can be understood easily by pointing at chart. In larger data representation of data to outliers is not possible. Then the possible solution to it is to remove outlier from data[24]. Big Data talent gap: Big data talent gap is real and according to a static by 2018, the US alone could face the shortage of more than 140,000 deep analytical skills. There is a growing community of tools developer like hadoop ecosystem. There are expert who gained experience through tool development and uses programming model rather than data management aspects. VI. CONCLUSION This paper describes the survey on big data processing techniques in the cloud computing environment. Here we discussed different processing techniques along with management techniques required to store the huge data. Big data face the challenges like real time processing which requires new techniques. With the growth in data, big data will become more complex and introduce more challenges which create more opportunity for the scholars. There is a need to make cooperation research scholars and industries to face all challenges and success to cloud computing and big data. VII. REFERENCES [1] Narkhede, S., & Baraskar, T. HMR log analyzer: analyze Web application logs over Hadoop MapReduce. International Journal of UbiComp. pp 41-51, [2] Pandit, A., Deshpande, A., & Karmarkar, P. Log Mining Based on Hadoop s Map and Reduce Technique. International Journal on Computer Science & Engineering. pp , [3] Wada, Y., Watanabe, Y., Syoubu, K., Sawamoto, J., & Katoh, T. Virtual database technology for distributed database. In Advanced Information Networking and Applications Workshops (WAINA), 2010 IEEE 24 th International Conference on, IEEE. pp , [4] Bakariya, A. B., & Thakur, B. G. S. User Behavior Analysis from Web Log using Log Analyzer Tool. Ijcsns. pp 41-52, , IJAFRSE and ICCICT 2015 All Rights Reserved

9 [5] Dhole Poonam, B., & Gunjal Baisa, L. Survey Paper on Traditional Hadoop and Pipelined Map Reduce. International Journal of Computational Engineering Research. pp 32-36, [6] Chavan, M. V., & Phursule, R. N. Survey Paper On Big Data. International Journal of Computer Science and Information Technologies, IJCSIT. pp , [7] Big data: science in the peta byte era," Nature 455 (7209): 1, [8] D. Kossmann, T. Kraska, and S. Loesing, "An evaluation of alternative architectures for transaction processing in the cloud," in Proceedings of the 2010 international conference on Management of data. ACM, 2010, pp [9] Pal, A., & Agrawal, S. An experimental approach towards big data for analyzing memory utilization on a hadoop cluster using HDFS and MapReduce. In Networks & Soft Computin(ICNSC), First International Conference,IEEE. Pp , [10] Shim, K. S., Lee, S. K., & Kim, M. S. Application traffic classification in Hadoop distributed computing environment. In Network Operations and Management Symposium (APNOMS), 2014, 16th Asia-Pacific, IEEE. pp 1-4, [11] Rathee, S. Big Data and Hadoop with components like Flume, Pig, Hive and Jaql. In International Conference on Cloud, Big Data and Trust pp , [12] Fuad, A., Erwin, A., & Ipung, H. P. (2014, September). Processing performance on Apache Pig, Apache Hive and MySQL cluster. In Information, Communication Technology and System (ICTS), 2014 International Conference on (pp ). IEEE. [13] Sethi, P., & Kumar, P. (2014, August). Leveraging hadoop framework to develop duplication detector and analysis using Mapreduce, Hive and Pig. In Contemporary Computing (IC3), 2014 Seventh International Conference on (pp ). IEEE. [14] Rumi, G., Colella, C., & Ardagna, D. (2014, September). Optimization Techniques within the Hadoop Eco-system: A Survey. In Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), th International Symposium on (pp ). IEEE. [15] S. Ghemawat, H. Gobioff, and S. Leung, "The google file system," in ACM SIGOPS Operating Systems Review, vol. 37, no. 5. ACM, 2003, pp [16] F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber, "Bigtable: A distributed structured data storage system," in 7th OSDI, 2006, pp [17] [18] [19] B. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H. Jacobsen, N. Puz, D. Weaver, and R. Yerneni, "Pnuts: Yahoo!'s hosted data serving platform," Proceedings of the VLDB Endowment, vol. 1, no. 2, pp , [20] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, "Dynamo: amazon's highly available keyvalue store," in ACM SIGOPS Operating Systems Review, vol. 41, no. 6. ACM, 2007, pp [21] Jitkajornwanich, K., Gupta, U., Shanmuganathan, S. K., Elmasri, R., Fegaras, L., & McEnery, J. (2013, October). Complete storm identification algorithms from big raw rainfall data using MapReduce framework. In Big Data, 2013 IEEE International Conference on (pp ). IEEE , IJAFRSE and ICCICT 2015 All Rights Reserved

10 [22] Challenges-of-Big-Data [23] [24] Mariyah, S. (2014, September). Identification of big data opportunities and challenges in statistics Indonesia. In ICT For Smart Society (ICISS), 2014 International Conference on (pp ). IEEE. [25] Kaisler, S., Armour, F., & Espinosa, J. A. (2014, January). Introduction to Big Data: Challenges, Opportunities, and Realities Minitrack. In System Sciences (HICSS), th Hawaii International Conference on (pp ). IEEE , IJAFRSE and ICCICT 2015 All Rights Reserved

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.

More information

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014 Hadoop and Hive Introduction,Installation and Usage Saatvik Shah Data Analytics for Educational Data May 23, 2014 Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 1 / 15

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM

MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM Julia Myint 1 and Thinn Thu Naing 2 1 University of Computer Studies, Yangon, Myanmar juliamyint@gmail.com 2 University of Computer

More information

Big Data: Tools and Technologies in Big Data

Big Data: Tools and Technologies in Big Data Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Hosting Transaction Based Applications on Cloud

Hosting Transaction Based Applications on Cloud Proc. of Int. Conf. on Multimedia Processing, Communication& Info. Tech., MPCIT Hosting Transaction Based Applications on Cloud A.N.Diggikar 1, Dr. D.H.Rao 2 1 Jain College of Engineering, Belgaum, India

More information

Evaluation of NoSQL and Array Databases for Scientific Applications

Evaluation of NoSQL and Array Databases for Scientific Applications Evaluation of NoSQL and Array Databases for Scientific Applications Lavanya Ramakrishnan, Pradeep K. Mantha, Yushu Yao, Richard S. Canon Lawrence Berkeley National Lab Berkeley, CA 9472 [lramakrishnan,pkmantha,yyao,scanon]@lbl.gov

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Scalable Multiple NameNodes Hadoop Cloud Storage System

Scalable Multiple NameNodes Hadoop Cloud Storage System Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai

More information

A Study of Data Management Technology for Handling Big Data

A Study of Data Management Technology for Handling Big Data Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive E. Laxmi Lydia 1,Dr. M.Ben Swarup 2 1 Associate Professor, Department of Computer Science and Engineering, Vignan's Institute

More information

A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems

A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems Ismail Hababeh School of Computer Engineering and Information Technology, German-Jordanian University Amman, Jordan Abstract-

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

A REVIEW ON EFFICIENT DATA ANALYSIS FRAMEWORK FOR INCREASING THROUGHPUT IN BIG DATA. Technology, Coimbatore. Engineering and Technology, Coimbatore.

A REVIEW ON EFFICIENT DATA ANALYSIS FRAMEWORK FOR INCREASING THROUGHPUT IN BIG DATA. Technology, Coimbatore. Engineering and Technology, Coimbatore. A REVIEW ON EFFICIENT DATA ANALYSIS FRAMEWORK FOR INCREASING THROUGHPUT IN BIG DATA 1 V.N.Anushya and 2 Dr.G.Ravi Kumar 1 Pg scholar, Department of Computer Science and Engineering, Coimbatore Institute

More information

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece mattos@csd.uoc.

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece mattos@csd.uoc. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece mattos@csd.uoc.gr Joining Cassandra Binjiang Tao Computer Science Department University of Crete Heraklion,

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Cloud Data Management: A Short Overview and Comparison of Current Approaches

Cloud Data Management: A Short Overview and Comparison of Current Approaches Cloud Data Management: A Short Overview and Comparison of Current Approaches Siba Mohammad Otto-von-Guericke University Magdeburg siba.mohammad@iti.unimagdeburg.de Sebastian Breß Otto-von-Guericke University

More information

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Enhancing Massive Data Analytics with the Hadoop Ecosystem www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3, Issue 11 November, 2014 Page No. 9061-9065 Enhancing Massive Data Analytics with the Hadoop Ecosystem Misha

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

More information

Efficient Analysis of Big Data Using Map Reduce Framework

Efficient Analysis of Big Data Using Map Reduce Framework Efficient Analysis of Big Data Using Map Reduce Framework Dr. Siddaraju 1, Sowmya C L 2, Rashmi K 3, Rahul M 4 1 Professor & Head of Department of Computer Science & Engineering, 2,3,4 Assistant Professor,

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

An Approach to Implement Map Reduce with NoSQL Databases

An Approach to Implement Map Reduce with NoSQL Databases www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh

More information

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Prerita Gupta Research Scholar, DAV College, Chandigarh Dr. Harmunish Taneja Department of Computer Science and

More information

Loose Coupling between Cloud Computing Applications and Databases: A Challenge to be Hit

Loose Coupling between Cloud Computing Applications and Databases: A Challenge to be Hit International Journal of Computer Systems (ISSN: 2394-1065), Volume 2 Issue 3, March, 2015 Available at http://www.ijcsonline.com/ Loose Coupling between Cloud Computing Applications and Databases: A Challenge

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required. What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees

More information

The 4 Pillars of Technosoft s Big Data Practice

The 4 Pillars of Technosoft s Big Data Practice beyond possible Big Use End-user applications Big Analytics Visualisation tools Big Analytical tools Big management systems The 4 Pillars of Technosoft s Big Practice Overview Businesses have long managed

More information

BIG DATA IN BUSINESS ENVIRONMENT

BIG DATA IN BUSINESS ENVIRONMENT Scientific Bulletin Economic Sciences, Volume 14/ Issue 1 BIG DATA IN BUSINESS ENVIRONMENT Logica BANICA 1, Alina HAGIU 2 1 Faculty of Economics, University of Pitesti, Romania olga.banica@upit.ro 2 Faculty

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Alternatives to HIVE SQL in Hadoop File Structure

Alternatives to HIVE SQL in Hadoop File Structure Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The

More information

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant

More information

Data Refinery with Big Data Aspects

Data Refinery with Big Data Aspects International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems

More information

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')

More information

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2

More information

Big Data Explained. An introduction to Big Data Science.

Big Data Explained. An introduction to Big Data Science. Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

More information

Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen

Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen Anil G, 1* Aditya K Naik, 1 B C Puneet, 1 Gaurav V, 1 Supreeth S 1 Abstract: Log files which

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc. Beyond Web Application Log Analysis using Apache TM Hadoop A Whitepaper by Orzota, Inc. 1 Web Applications As more and more software moves to a Software as a Service (SaaS) model, the web application has

More information

L1: Introduction to Hadoop

L1: Introduction to Hadoop L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE

HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE Sayalee Narkhede 1 and Tripti Baraskar 2 Department of Information Technology, MIT-Pune,University of Pune, Pune sayleenarkhede@gmail.com

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Play with Big Data on the Shoulders of Open Source

Play with Big Data on the Shoulders of Open Source OW2 Open Source Corporate Network Meeting Play with Big Data on the Shoulders of Open Source Liu Jie Technology Center of Software Engineering Institute of Software, Chinese Academy of Sciences 2012-10-19

More information

Approaches for parallel data loading and data querying

Approaches for parallel data loading and data querying 78 Approaches for parallel data loading and data querying Approaches for parallel data loading and data querying Vlad DIACONITA The Bucharest Academy of Economic Studies diaconita.vlad@ie.ase.ro This paper

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

An Integrated Framework for Cloud Data Management in Educational Institutes

An Integrated Framework for Cloud Data Management in Educational Institutes An Integrated Framework for Cloud Data Management in Educational Institutes Indu Arora Department of Computer Science and Applications MCM DAV College for Women Chandigarh, India indarora@yahoo.co.in Dr.

More information

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce Volume 5, Issue 9, September 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Study of

More information

Big Data on Cloud Computing- Security Issues

Big Data on Cloud Computing- Security Issues Big Data on Cloud Computing- Security Issues K Subashini, K Srivaishnavi UG Student, Department of CSE, University College of Engineering, Kanchipuram, Tamilnadu, India ABSTRACT: Cloud computing is now

More information

A Demonstration of Rubato DB: A Highly Scalable NewSQL Database System for OLTP and Big Data Applications

A Demonstration of Rubato DB: A Highly Scalable NewSQL Database System for OLTP and Big Data Applications A Demonstration of Rubato DB: A Highly Scalable NewSQL Database System for OLTP and Big Data Applications Li-Yan Yuan Department of Computing Science University of Alberta yuan@cs.ualberta.ca Lengdong

More information

Big Data Racing and parallel Database Technology

Big Data Racing and parallel Database Technology EFFICIENT DATA ANALYSIS SCHEME FOR INCREASING PERFORMANCE IN BIG DATA Mr. V. Vivekanandan Computer Science and Engineering, SriGuru Institute of Technology, Coimbatore, Tamilnadu, India. Abstract Big data

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute

More information

Suresh Lakavath csir urdip Pune, India lsureshit@gmail.com.

Suresh Lakavath csir urdip Pune, India lsureshit@gmail.com. A Big Data Hadoop Architecture for Online Analysis. Suresh Lakavath csir urdip Pune, India lsureshit@gmail.com. Ramlal Naik L Acme Tele Power LTD Haryana, India ramlalnaik@gmail.com. Abstract Big Data

More information

Leveraging SAP HANA & Hortonworks Data Platform to analyze Wikipedia Page Hit Data

Leveraging SAP HANA & Hortonworks Data Platform to analyze Wikipedia Page Hit Data Leveraging SAP HANA & Hortonworks Data Platform to analyze Wikipedia Page Hit Data 1 Introduction SAP HANA is the leading OLTP and OLAP platform delivering instant access and critical business insight

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information