A Survey on Issues and Challenges in Handling Big Data

Transcription

1 A Survey on Issues and Challenges in Handling Big Data Sandeep K N Usha R G Dept of Information Science and Engineering Dept of Information Science and Engineering JSS Academy of Technical Education, Bangalore, India. JSS Academy of Technical Education, Bangalore, India. knsandeep7@gmail.com usha.r.g1218@gmail.com Abstract: Since data is the essential characteristics in today s technology and as the data ranges from Gigabytes to Terabytes, Petabytes and Exabytes, the large pool of data can be brought together and analyzed by using Big Data. Big Data is a collection of large amount of data sets that is being generated from every phone, website and application across the Internet. Due to the huge volume and the speed at which it is generated, it is very difficult for the machine to maintain and process Big Data. Hence Hadoop is used to manage it. The technologies used by Big Data are Hadoop, Map Reduce, Hive, NoSQL database etc. This paper includes features, functionalities and challenges of Big data, Hadoop, HDFS, Map Reduce. Keywords-Big data, Hadoop, HDFS, Map Reduce. 1. INTRODUCTION In ancient days, as there were no technology, people used their own ideas to store their data on woods by using charcoals and carving on the stones. As days passed, man used primitive ways of storing data on paper, clothes. Later, new inventions and discoveries made him to store data in vacuum tubes, magnetic tapes, floppy disks, CD-ROM, hard disk, pen drive, memory cards, Blurays etc. From this trend, to accumulate huge amount of data, technology has made a drastic change by using Big Data[3]. With the immense growth of technological development, production and services, large amount of data is formed which can be structured, semistructured, and unstructured from the different sources in different domains. In daily routines, people store large amount of data in facebook, twitter, google drives, mail, you tube etc. These companies has to provide drives for storing huge amount of data. Due to the massive use of storage, the need of Big Data came into existence. Big Data is a collection of large amount of data sets that is being generated. In 1990, people were usually using 1GB-20GB capacity of hard disk. Big Data size is constantly moving target, as of 2012 ranging from few dozen terabytes to many petabytes of data. Big Data requires set of techniques and technologies with new form of integration to reveal insights from data sets that are diverse, complex and of massive scale. In future years we may arrive into the situation where we need thousands of zetabytes of hard disk to store the data. Due to the increase in storage of data, we need Big Data. The need of big data generated from large companies like facebook, yahoo, you tube, google etc for the purpose of analysis of enormous amount of data which is in unstructured or even in structured form. Figure 1: Big Data[1] New skills are needed to fully harness the power of big data. Though courses are being offered to prepare a new generation of big data experts, it will take some time to get them into the workforce. Leading organizations are developing new roles, focusing on key challenges and creating new business models to gain the most from big data [4]. The big data includes data produced by different devices. The different sources of Big data are as given below [5] Black Box Data It is a components used in airplanes, jet and helicopter etc. It records voice of flight crew. Social Media Data Social media like whatsapp, facebook and twitter stores the Page 164

2 various data and views posted by various people all around the globe. Stock Exchange Data Stock exchange holds the information about buy and sell decisions made by various companies. Power Grid Data-The power grid data holds information consumed by a particular node with respect to a base station. Transport Data -It stores the information about model, capacity, distance and availability of vehicle. Search Engine Data-It retrieves lot of data from various databases. There are various technologies in the market from different vendors including Amazon, IBM, Microsoft, etc., to handle big data. 2. CHARACTERISTICS OF BIG DATA The seven V s of Big Data are: analyzed. Veracity in data analyze is the biggest challenge when compared to volume and velocity. The quality of data vary greatly from one data to another. Precision of data analysis depends on veracity of source data. Visualization- Data visualization is a general term that describes any effort to help people understand the significance of data by placing it in a visual context. Data visualization are everywhere and more important than ever. From creating a visual representation of data points as part of an executive presentation, to showcasing progress, or even visualizing concepts for customer segments, data visualizations are a critical and valuable tool in many different situation. Value- Value starts and ends with business use case. The business must define the analytic application of data and its potential associated value to the business. The potential value of big data is huge. The value lies in rigorous analysis of accurate data, information and insights this provides. Volume- With the advancement of technology, data that is generated and collected is rapidly increasing. If the volume is in gigabytes it is probably not Big Data, but at the terabyte and petabyte and beyond it may very well be. Volume is a key contributor to the problem of why traditional relational database management system (RDBMS) fail to handle Big Data. The volume determines the actual quantity of data. Velocity- Velocity refers to increasing speed at which the data is created, and the increasing speed at which data can be processed, stored and analyzed by relational database. It simply describes the data-at-rest and data-in-motion. Sending the data and fetching the data requires some velocity. Velocity in big data refers to how fast the data is generated. Velocity also incorporates the characteristics of timeliness or latency - is the data being captured at a rate or with a lag time that makes it useful. Variety- It refers to the different types of data generated and how the data is stored. The data can be structured, semi-structured or unstructured data. Legal records, data in RDBMS, etc belongs to the structured data. Blogs, Log files, s are the good example for semi-structured data. Unstructured data are stored in the form of audio, video, images, text, graphs and the output from all types of machine-generated data from sensors, devices, cell phone GPS signals, DNA analysis devices and so on. Variability- As data changes from time to time, it causes inconsistency. This is particularly the case when gathering data relies on language processing. Thus causing problem to manage and handle efficiently. Veracity- Big Data refers to the biases, noise and abnormality in the data. It is the data that is being stored and mined meaningful to the problem being Figure 2:Characteristics of Big Data Big Data cannot be stored on a single machine. It is normally stored in a multiple machines. Internally there should be a structure so that multiple machines can club their data and provide it to the end user. 3. ISSUES OF BIG DATA Data access and connectivity can be hindrance. Processing time increases, as the data size is increased. Hence immediate retrieval of important information may be impossible. Incomplete data also creates uncertainties and correcting these data leads to difficulty. Incomplete refers to missing data and hence some algorithms are used to overcome it. Storing and managing huge amount of data is quite difficult. And also retrieval is also a major challenge. Page 165

3 Difficulties arise from the heterogeneous mixture of data because the data formats and patterns vary greatly. Data can be in the form of structured, semi-structured and unstructured form. Converting unstructured data to structured format is a major challenge. 4. CHALLENGES OF BIG DATA In today s business environment, along with storing and finding the relevant data, accessing must also be quickly. As huge amount of data is stored, accessing speed may decrease. Hence reliable hardware must be used. Even though if we can find and analyze the data quickly, the major challenge is to have the accurate and valuable data. Hence data quality must be assured. Understanding the data takes a lot of time. Hence we should have people from expertise domains and should have a good understanding knowledge. Identifying the data collected and implementing the right solution to accurately analyze the data. It should address a security threat to big data environments or data stored within a cluster. Hadoop is designed to store huge data sets and is not recommended for small data sets. Hadoop has five services: Name node Secondary name node Job tracker Data node Task tracker The first 3 services are called as Master services or Master nodes. The last 2 services are called as Slave services or Slave nodes. Every master services can talk to each other and every slave services can talk to each other. If name node is a master service then data node is the corresponding slave service. And if the job tracker is the master service then task tracker is the slave service. 5. HADOOP AND HDFS Since we store huge amount of data, the processing time should be decreased in order to achieve efficiency. The best solution for this is Hadoop. The founder of Hadoop is Doug Cutting. Hadoop is an open source, java based programming framework given by apache software foundation for storing and processing huge data sets with clusters of commodity hardware. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. Hadoop framework includes following four models[6]: Hadoop common Hadoop YARN Hadoop Distributed File System(HDFS) Hadoop MapReduce Hadoop Distributed File System(HDFS)- It is a technique for storing huge number of data with streaming access pattern and with cluster of commodity hardware. Streaming access pattern refers to write once and read any number of times. HDFS has a default block size of 64MB. The block size can also be increased. In normal Operating system, if we store 2KB of data in 4KB block size, remaining space is wasted. But in case of HDFS, the space is not wasted. Figure 3: Master Slave Architecture of Hadoop[2] Machine(client) uses name node services to store the huge amount of data. Name node maintains the metadata that keep track of all the information about storage. The data will be stored in data node. HDFS provides backup by storing multiple copies of data in case of data loss. Name node is called as single point of failure because if name node is lost, then nothing can be accessible. If a program needs to access the data stored in a data node, then job tracker requests the name node for accessing the data. Name node responses by giving the metadata to the job tracker. Job tracker assigns task to the task tracker. Task tracker chooses the nearest system (i.e., one which is nearer among 3 replications/copies) and processes. This process is called map. The files which are divided to store in HDFS are called input splits. The number of input splits is equal to the number of mappers. 6. MAP REDUCE MapReduce is a programming model for processing large-scale datasets in computer clusters. Map reduce is a core component of the Apache Hadoop software framework. Mapreduce operation includes: Page 166

4 Specify computation in terms of map and reduce function. Parallel computation across large-scale clusters of machine. Handle machine failures and performance issues. Ensure efficient communication between the nodes. The main reason to perform mapping and reducing is to speed up the execution of a specific process by splitting the process into a number of tasks, thus enabling parallel work. The MapReduce programming model consists of two functions, Map() method that performs filtering and sorting and Reduce() method that performs summary operation. Hadoop runs the map reduce in the form of (key, value) pairs. A MapReduce cluster employs a master-slave architecture. The use of this model to reduce network communication cost. Optimizing the communication cost is essential to a good MapReduce algorithm. The following are the Mapreduce components: 1. Name Node - It manages the HDFS metadata. 2. Data Node It stores blocks of HDFS default replication level for each block Job Tracker It manages jobs and resources in a cluster. 4. Task Tracker It runs Map Reduce operations. Word count example of a MapReduce Figure 5:MapReduce word count[7] Map execution consists of following s: Map Sort Reduce Reads the input splits from HDFS. Parses input into records (key, value) pairs. Applies map function into each record. Informs master node of its completion. Partition : Name Node Shuffle Figure 6: Execution flow of data Map : Data Node Partition Each mapper must determine which reducer will receive each of the outputs. For any key, the destination partition is same. Number of partitions=number of reducers. Shuffle : Task Tracker Job Tracker Fetches input data from all map tasks for the portion corresponding to the reduce tasks. Sort : Figure 4: Components of MapReduce Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Reduce : Merge-sorts all map outputs into a single run. Applies user-defined reduce function to the merged run. Arguments: key and corresponding list of values. Writes output to a file in HDFS. Page 167

5 7. CONCLUSION Today, big data is no longer an experimental tool. Since the data is growing exponentially all over the world, Big data is becoming new area for research and business applications. The analysis of big data helps business people to make better decisions. Many companies have begun to achieve the results with this approach. Big data technologies like Hadoop and MapReduce provides many advantages. REFERENCES [1] [2] [3] Sudha P. R, Assistant Professor, JSSATE, Bangalore, A Survey on MapReduce, Hadoop and YARN in Handing Big Data. International Journal of Advanced Research in Computer Science and Software Engineering, Volume 6, Issue 1, January 2016, ISSN: X [4] [5] [6] [7] GUIDED BY, Sudha P. R Assistant Professor, Department of ISE, JSS Academy of Technical Education, Bangalore, India. AUTHORS S PROFILE SANDEEP K N USHA R G Page 168