R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Transcription

1 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com, 3 zhuldyz.kalpeeva@gmail.com, 4 dinara887@gmail.com, 5 mukazhan@mail.ru 1, 2, 5 International University of Information Technologies, Almaty, Kazakhstan 3, 4 Kazakh National Technical University named after K.I.Satpayev, Almaty, Kazakhstan Abstract Presented work focuses on the characteristics of the distributed computing in heterogeneous cloud environments using the technology of Hadoop MapReduce. Given the practical example of data processing and analysis using these technologies. Key words: cloud computing, BigData, MapReduce, Hadoop Introduction At present, with the development of cloud computing paradigm, which involves the use of a large number of processors working in parallel for solving computational problems have developed in technology to manage large amount of data. One of such instruments for distributed data processing is MapReduce (MR). MR is attractive to many programmers as a simple model, based on which users can build relatively sophisticated distributed programs. The present work focuses on the implementation features of the distributed computing in heterogeneous cloud environments using the technology of Hadoop MapReduce. Scope of MapReduce and Hadoop technologies are diverse and covers almost all sectors of industry and business, where is needed to access large data, often unstructured. In such situations, conventional relational DBMSs do not cope with the processing and analysis large amounts of data. At the same time, the possibility to perform the necessary calculations quickly and scalability is a necessary condition for successful research. For efficient processing of large amounts of data in 2004, Google has developed a distributed computing model called Map Reduce [1]. Examples of successful implementation of this technology are given in detail in [2, 3]. 1. The concept of MapReduce (cloud computing) The programming model MapReduce is intended for distributed processing tasks on a cluster of servers, created by Google Company [1], and the first implementation of this model on the bases of a distributed file system GFS (Google File System) was made there[2]. This implementation is widely used in software products, mostly Google, but is proprietary and is not available for external use [4]. Now therefore, MapReduce (MR) is a paradigm of performance of distributed computing for large amount of data [5]. According to this concept, the problem of handling large amounts of data is decomposed into two phases - map and reduce. The map (ƒ, j) phase takes a function ƒ and a list of j. Returns a list of the output that results from applying the function ƒ to each element of the input list j. Map-processes run on the subsets of input data and is executed independently of each other (Fig. 1).

2 Map Input j i j n Map Otput f(j) j i ' f Э ' i Figure 1. The Мap Phase j n ' The reduce (ƒ, j) phase takes a function ƒ and a list of j. Returns an object formed by the aggregation of input data j through the function ƒ. Reduce-processes are process the map-phase processes, smashing them on key values into non-overlapping blocks, which also allows implementing them independently (Fig. 2). Thus, each of the phases can be processed simultaneously on an arbitrary number of servers, which are pre-defined. Reduce Input j i j n f(j) Reduce Output Figure 2. The Reduce Phase 2. The architecture of Apache Hadoop The real popularity of MapReduce technology has brought an open and accessible (an open source) implementation, which was made in the project Hadoop [6] by the Apache community. The widespread use of Hadoop MapReduce in various research and scientific projects brings undoubted benefits of this system, stimulating developers to its continuous improvement. Hadoop MapReduce - the programming model (framework) of performance of distributed computing for the large amounts of data within the paradigm of map / reduce, which is a set of Java-classes and executable utilities for creating and processing tasks on parallel processing [5]. Hadoop also allows you to specify a map and reduce implementations of arbitrary programs. Interaction between Hadoop and the program can be implemented using standard input and output streams. Hadoop platform consists of several elements. In the architecture of Hadoop is a distributed file system Hadoop Distributed File System (HDFS), which distributes files across multiple storage nodes in the cluster Hadoop (Fig. 3). Above the HDFS file system is the mechanism of MapReduce, consisting of nodes types JobTracker and TaskTracker. To understand the operation of Hadoop in this section, we give a brief description of each of these elements. Hadoop Distributed File System (HDFS) - a distributed file system designed to store a very large amount of data (terabytes or even petabytes) and provide high-speed access to the information[7]. All files stored in HDFS are divided into a series of blocks of fixed size, constituting a default 64MB. To ensure the reliability of copy blocks (replica) are stored on

3 multiple servers, as the default in three. The block size and the number of replicas (eg, replication factor) can be set individually for each file. HDFS is very similar to the GFS architecture type of "master-slave". The main server is called the NameNode, and slave servers - DataNode [3]. MapReduce Engine JobTracke r TaskTracker TaskTracker TaskTracker NameNode HDFS Cluster DataNode DataNode DataNode Figure 3. Architecture of Hadoop NameNode type of node exists in a single copy and acts as a metadata services of HDFS, and the nodes of type DataNode, serve as storage units of HDFS. Hadoop cluster node contains a single type of NameNode and hundreds or thousands of nodes of type DataNode. The actual Input / Output operations do not apply to the node NameNode - through this unit passes only the metadata of the comparisons between the sites type of DataNode and file blocks. When an external client sends a request to create a file, node NameNode responds to it by sending back the identification data file block and node IP-address DataNode, which will hold the first copy of the block. Also NameNode informs the nodes of DataNode, which will receive a copy of the file block. NameNode receives periodic status messages (so called, heartbeatmessages) from each of DataNode. If DataNode cannot send a status message, the node NameNode can take corrective action to replicate blocks located on the failed node to other nodes in the cluster. Similar actions are carried out in the event of a drive failure on datanodeserver, damage to individual replicas or increase the replication factor file. In the current implementation of HDFS master node is a "weak point" of the system. When a NameNode node fails, system requires manual intervention to which the system becomes inoperable. Automatic restart of NameNode and its migration to another system is not implemented yet. To implement the calculations in Hadoop is used "master-worker" architecture. Unlike Google MapReduce, the system has a dedicated control process (the so-called JobTracker) and a lot of work processes (eg, TaskTracker), which carry out all the users tasks. JobTracker accepts jobs from the applications that splits them into map-reduce-tasks and allocates tasks to work processes, tracks the performance of the tasks and executes their restart. TaskTracker requests tasks from the host process, uploads code and executes the task, notifies the control process on the status of tasks and provides access to intermediate data of map-tasks. Processes interact with the RPC-calls, all calls go in the direction of the worker to the process manager in order to reduce its dependence on the state of the workflow. 3. The practical implementation of distributed data processing in Hadoop environment This section describes the practical experience of handling big amount of data using the paradigm of MapReduce. For distributed computing we organized a cluster of five machines,

4 each of which run on two virtual machines with pre-installed Apache Hadoop. As an example, to execute experiment, we took the task of processing unstructured data about the applicants of the university. In this article we consider the problem of counting the number of grants allocated by majors. The algorithm consists of several steps: 1. As an initial step, Map-function is performed to each element of the source collection. Map will return zero or create instances of Key / Value objects. Duty of the Map-function is to convert the elements of the original collection to zero or more instances of Key / Value objects. 2. The next step, the algorithm will sort all pairs of Key / Value and create new instances of objects, where all the values will be grouped by key. 3. The final step will execute the function Reduce - for each clustered instance of Key / Value object. In conclusion, the Reduce function returns a new instance of the object to be included in the resulting collection. Figure 4. The scheme of MapReduce programming model Listing 1 provides an implementation of Map-functions. Listing 1. Implementation of Map-functions Next implementation is Reduce-function (see Listing 2). Listing 2. Implementation of Reduce-functions

5 Finally, the results are collected (see Listing 3). Listing 3. Collecting the results. The results obtained during the experiment of processing the input data is reflected as result in the Figure 5. Figure 5. The processing result of MapReduce functions

6 Conclusion In this article, we have given only a small sample of data analysis, briefly touched upon the Hadoop s possibilities, without delving into the study of all the benefits of its infrastructure. But even from this small case study one can see that Hadoop greatly simplifies the analysis of the data, allowing you to work with a distributed set of cluster s nodes. Despite the fact that the original implementation of the Hadoop technology is proprietary developments, due to open source projects their public counterparts are actively developing. Thanks to the Hadoop distributed processing and analysis of data have become available not only for giants like Google and Yahoo, but to all the ordinary users. Also these technologies that came out from the business, are beginning to be used in the academic world, as modern science and research problems often have the same requirements to computing resources that the problems of big companies. In the future we plan to fully explore and apply the capabilities of these technologies to the needs of the academic community. References 1. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Google, Inc., Stonebraker M., Abadi D., DeWitt D. J. et al. MapReduce and parallel DBMSs: friends or foes? // Commun. ACM Vol. 53, no. 1. Pp Retrieved from: Access Date: year 3. Chalk Lame. Hadoop in action. - М.: DKM Press, p. 4. Kuznecov S. Map Reduce: inside, outside, or from the side of the parallel databases? Retrieved from: Access Date: year. 5. Petukhov D. Hadoop MapReduce. The basic concept and architecture. Retrieved from: Access Date: year. 6. Apache Hadoop Home Page. Retrieved from: Access Date: year 7. Sukhoroslov O.V. New technologies for distributed storage and processing of large data sets / All Russian-competitive selection of an overview and analytical articles on priority area "Information-telecommunication systems", p.