SPECIALISTS TRAINING IN BIG DATA USING DISTRIBUTED ARCHITECTURAL SOLUTIONS SERVICES. Проректор по учебной и воспитательной работе

Similar documents
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

ANALYTICS CENTER LEARNING PROGRAM

HDP Hadoop From concept to deployment.

Comprehensive Analytics on the Hortonworks Data Platform

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

HiBench Introduction. Carson Wang Software & Services Group

BIG DATA ANALYTICS For REAL TIME SYSTEM

Apache Hadoop. Alexandru Costan

Big Data - Business, Math, Technology Best combination for big data 商 业 理 解, 数 据 科 学, 技 术 实 践 之 完 美 结 合

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data Use Case: Business Analytics

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Client Overview. Engagement Situation. Key Requirements

Upcoming Announcements

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

[Hadoop, Storm and Couchbase: Faster Big Data]

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

BIG DATA & DATA SCIENCE

Challenges for Data Driven Systems

Reference Architecture, Requirements, Gaps, Roles

Big Data and Analytics: Challenges and Opportunities

COMP9321 Web Application Engineering

TRAINING PROGRAM ON BIGDATA/HADOOP

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

From Spark to Ignition:

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Transforming the Telecoms Business using Big Data and Analytics

Improving Data Processing Speed in Big Data Analytics Using. HDFS Method

Apache Hama Design Document v0.6

Moving From Hadoop to Spark

How To Create A Data Visualization With Apache Spark And Zeppelin

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Databricks. A Primer

Chase Wu New Jersey Ins0tute of Technology

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Big Data and Industrial Internet

Big Data and Analytics (Fall 2015)

HDP Enabling the Modern Data Architecture

CAREER OPPORTUNITIES

Big Data Analytics - Accelerated. stream-horizon.com

HADOOP. Revised 10/19/2015

Dominik Wagenknecht Accenture

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Workshop on Hadoop with Big Data

Performance and Scalability Overview

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Databricks. A Primer

Big Data and Data Science. The globally recognised training program

Big Data Success Step 1: Get the Technology Right

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Hadoop Ecosystem B Y R A H I M A.

Blazent IT Data Intelligence Technology:

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

locuz.com Big Data Services

The 4 Pillars of Technosoft s Big Data Practice

Data Challenges in Telecommunications Networks and a Big Data Solution

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

In-Memory Analytics for Big Data

Certified Big Data and Apache Hadoop Developer VS-1221

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Hadoop IST 734 SS CHUNG

Big Data Explained. An introduction to Big Data Science.

Streaming items through a cluster with Spark Streaming

Intro to Big Data and Business Intelligence

Cloud Computing based on the Hadoop Platform

Data processing goes big

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

The Internet of Things and Big Data: Intro

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

SURVEY REPORT DATA SCIENCE SOCIETY 2014

PostgreSQL Performance Characteristics on Joyent and Amazon EC2

Data Mining and Analytics in Realizeit

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

UPS battery remote monitoring system in cloud computing

MapReduce Evaluator: User Guide

How Companies are! Using Spark

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

Big Data Storage Architecture Design in Cloud Computing

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Hadoop & Spark Using Amazon EMR

Integrating a Big Data Platform into Government:

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Transcription:

SPECIALISTS TRAINING IN BIG DATA USING DISTRIBUTED ARCHITECTURAL SOLUTIONS SERVICES M. Batura 1 S. Dzik 2 I. Tsyrelchuk 3 Ректор Белорусского государственного университета информатики и радиоэлектроники (БГУИР), доктор технических наук, профессор, академик Международной академии наук высшей школы, заслуженный работник образования Республики Беларусь Проректор по учебной и воспитательной работе БГУИР, кандидат физикоматематических наук, доцент, академик Белорусской инженерной академии, Республика Беларусь V. Komlitchenko 4 E. Unuchek 5 Заведующий кафедрой экономической информатики БГУИР, кандидат технических наук, доцент, Республика Беларусь 196 Старший преподаватель кафедры экономической информатики БГУИР, Республика Беларусь Заведующий кафедрой проектирования информационнокомпьютерных систем БГУИР, кандидат технических наук, доцент, Республика Беларусь 1 Belarusian State University of Informatics and Radioelectronics, rector@bsuir.by, 2 Belarusian State University of Informatics and Radioelectronics, sdick@bsuir.by, 3 Belarusian State University of Informatics and Radioelectronics, tsyrelchuk@bsuir.by, 4 Belarusian State University of Informatics and Radioelectronics, v.komlitchenko@gmail.com, 5 Belarusian State University of Informatics and Radioelectronics, e.unuchek@gmail.com Application integration and business process improvement connected with Distributed Computation Services and Predictive Analytics can bring new concurrent capabilities. This article considers organization of Big Data learning and Big Data experience exchange in Belarus. The experience of the organizing and the use of virtual distributed computing infrastructure for Big Data trainings in BSUIR is proposed. Application integration and optimization of distributed business processes require analysis of transferred data streams and Big Data opportunities to obtain new competitive advantages, revealing hidden knowledge using special analytics. The problem is reduced

to the integration of different interfaces for connecting external systems based on different protocols, as well as predefined routes messages. Such integrated transactions will allow to extract and provide key information in the field of complex computations, interpreting large amounts of information and decision management solutions in business problems. At the same time, Big Data analytics is not self-sufficient. Using Big Data methods and technologies could lead to the minimization of human intervention in the business processes in the future within the areas where natural limit of human memory capacity and speed interferes with the actions performance, and the actions themselves are based on large amounts of data and computations, where decision making is based on special analytics and new knowledge elements derived from Big Data. Business Process Management (BPM) and Big Data combination will not only increase the efficiency of business processes of organizations, but also will improve their quality, reduce expenses, increase flexibility, build new business models. The appeal of this subject area takes one of the first most popular places among the areas of information technology development. It is noteworthy that recently there is a gradual increase of interest to Big Data technologies and Advanced analytics in Belarus. These technologies are often seen as an alternative to the traditional. The main generator of this evolution is due to the highlight of this particular subject area within the framework of specialized conferences (e.g., XXII International Forum TIBO-2015 with the theme Databases and data, call-centers, Big Data ) [1], to highly specialized training and to special public communities, supported by the leading companies-residents of the High Technologies Park. It is notable that there are several communities connecting people with the interests in the field of Big Data in Belarus now. The most famous are Belarus Big Data User Group and Data Talks Belarus. Both communities include several hundred stakeholders and offer opportunities for professional dialogue with experts, information and knowledge about the latest experience in applying methods of analytics and tools to solve practical problems. Among the regular participants of the events organized by these communities one can find analysts, researchers, project managers, and all those who use or is going to use in their work analysis or complex mathematical calculations on large amounts of data either for reporting and decision-making or for information systems creation. Two most numerous groups among the community members are clearly defined. The first group consists of people with an interest to analytics, methods and tools of data mining and machine learning algorithms. The second group of participants shows a particular interest to the engineering and technical component of technologies providing Big Data functioning. Their research interests are the methods, technologies and tools of distributed batch, real-time and stream- 197

ing data processing, high-performance, scalable data storage and access technologies, etc. Both in Belarus and in the world there is an acute shortage of professionals in this particular area. This is primarily due to the high qualification requirements and an extremely wide range of competences, which such a specialist should have. This is the knowledge in databases, statistics, data visualization techniques, machine learning algorithms and artificial intelligence methods, algorithmization, design and programming using unix-like systems and computer networks, English language and other skills and knowledge. The beginning of the implementation of analytical and Big Data solutions into universities educational process were activities related to establishment of the relevant specialties for MA courses to train specialists within the second stage of higher education at BSUIR and BSU and conduct special training courses for specialists at the same universities. One of the first courses on data analysis in Belarus were the courses within the project Yandex School of Data Analysis [2] at the Department of Applied Mathematics and Informatics, BSU (now the courses are also held at the Faculty of Computer Systems and Networks, BSUIR), Data Vita Introduction to Data Science [3] and Big Data and Predictive Analytics were organized by BSUIR and BEZNext company, USA. Particular interest is in organizing training in areas related to processing large amount of data, held at BSUIR. So in October 2013 in collaboration with the American company BEZNext BSUIR organized recruitment and training as part of the training on Big Data and Predictive Analytics. The training was in English in evenings, it was conducted by the leading US experts in the field of Big Data using remote networking communications. As a result of the six-month training 6 people defended their final projects with honors. In the autumn semester a special training on Java-technology and work with large amount of data ( Introduction into Big Data ) was held as part of the activity of the joint laboratory BSUIR-IBA. During the courses, students had the opportunity to deal with IBM advanced technologies in the field of Big Data IBM Info Sphere Streams and IBM Info Sphere Big Insight [4]. In February 2015, BSUIR in collaboration with BEZNext organized iterative course on Big Data and Predictive Analytics in Russian language. The first open lecture was held as a videoconference for BSUIR professors, graduate students and undergraduates gathered more than 150 people. But the structure and content of the course have been significant changed due to the appearance and development of new solutions on the Big Data market affected by demands and technological requirements of BEZNext customers. This course includes both lectures and laboratory classes on the following issues: statistical methods in the data analysis; 198

correlation and regression analysis; time series analysis; data preparation and cleaning; machine learning algorithms; Map / Reduce; data processing with Spark NoSQL databases; ETL tools; data visualization tools and techniques. One of the challenges that the participants of the training faces in the practical tasks performance was the infrastructure for applications development, executing and debugging. Partially this problem was solved by running virtual machines on the client s workstations. However, this method just simulates a real environment and contains a number of restrictions referred to the scaling and capability of large amounts of data processing. To solve that problem there was a solution to create a virtual lab and provide access to the trainees of the courses on Big Data and Predictive Analytics. Virtual Lab is based on Amazon Web Services (AWS). It is a set of computer web services that make computing cloud platform provided by Amazon. Virtual Laboratory (Big Data Lab) simulates most of the systems of Apache Big Data projects stack [5] starting from messages processing (Kafka) and ending with a batch processing (Map / Reduce) and real time processing (Storm, Spark, Tez) deployed and configured to be run in a distributed environment to achieve parallelism. Additionally, a distributed NoSQL database (Cassandra) is configured and works. Big Data Lab architecture includes five infrastructure nodes and one control node (Fig. 1). BigData Control Node BigDataSys3 Master 1. Hadoop Resource Manager 2. Name Node, Secondary Name Node 3. Spark Master 4. Tez 1.Single Node Kafka 2. Storm Nimbus 3. Zookeeper Master Big Data Labm1 BigDataSys4 Slave BigDataSys2 Slave 1. Storm Supervisor Slave 1. Hadoop Node Manager 2. Spark Worker 3. Tez Services 4. Data Node 5. Cassandra BigDataSys1 1. Hadoop Node Manager 2. Spark Worker 3. Tez Services 4. Data Node 5. Cassandra Fig. 1. Architecture of a virtual laboratory Big Data Lab 199

Three nodes contain Hadoop, Spark (as a separate installation), Cassandra and Tez, the remaining two nodes contain Storm and Kafka. Kafka can run a set of brokers on a single machine. Storm is configured as master/slave to the Zookeeper service which is responsible for the interaction between them. BigDataSys3 and BigDataSys4. Both these nodes are the nodes for messages processing. Apache Kafka is installed on BigDataSys3 with a set of brokers working simultaneously. Test topics were created. Storm daemons Nimbus and UI are also launched on the BigDataSys3 node. Topology of examples Word Count and Exclamation are launched. BigDataSys4 node is a supervisor for distributed processes environment Storm with 10 processes (workers) configured for the topologies. BigDataLabm1. As shown on the Pic. 1, this node is the master of the cluster. YARN is deployed on it. BigDataSys1 and BigDataSys2 nodes are slaves. We can run either standard Hadoop benchmarks like Word Count and Terasort or at the same time standard Spark-tests such as Spark PI and Spark word count from shell on this machine. To run standard tests for Map Reduce and Spark one need to be connected to this node. Basic Hadoop and Spark commands can be executed from this node, for example, copy data to HDFS, retrieve a list of Hadoop jobs etc. BigDataSys1 and BigDataSys2. BigDataSys1 and BigDataSys2 act as slave-nodes of BigDatLabm1. Also, Cassandra is deployed on both these nodes. Since there is no master/slave in Cassandra clusters, both of the nodes are the slave-nodes of Cassandra cluster named BigDataLab. All Cassandra-cluster nodes are located in one data-center. Only one server executes seed-node functions. BigDataControlNode. This server is the only way to get into a virtual lab and to run test applications and applications which are being developed. Applying this approach to organizing of the environment for debugging and execution of either individual laboratory works and final project allowed to focus on practical problems solution (adjustment and installation of infrastructure requires deep knowledge and understanding of unix-like systems and computer networks administration), and also provided a high-performance, scalable computation infrastructure. The necessity of a separate rate per hour for the cluster work and high qualification requirements for the specialist who can set up and maintain performance of such a training computing environment are the difficulties and restrictions for the its usage. A list of additional positive aspects should be mentioned. The experience in creating a distributed virtual training laboratory shown on the example of training courses on Big Data at BSUIR reveals that there is no need in high initial investments in the network and server infrastructure, exclusion of incompatibility problems of equipment from different manufacturers, and system availability. Moreover, it can be mentioned that such a virtual laboratory is beyond the boundaries of the university, making a continuing learning process for students. Anywhere in the world where 200

there is an access to the Internet, trainees can distantly and continuously interact with a virtual lab environment that fully meets the concept of continuing distance learning. Literature 1. Subject of the XXII forum Tibo [digital resource]: http://www.park.by/post-888/?lng=ru access date: 18.05.2015. 2. Yandex School of Data Analysis [electronic resource]: https://yandexdataschool.ru/about/branches/minsk access date: 18.05.2015. 3. Data Vita: the science of data in theory and practice [digital resource]: http://www.datavita.net/ access date: 18.05.2015. 4. Presentation of IBM certificates was held at BSUIR [digital resource] режим доступа: http://www.bsuir.by/online/tnj2/one_article.jsp?pageid=94597&resid=100229&lang=ru&tnj_type=2&rid=102243&tnj_id= 11875&pid=100229 access date: 18.05.2015. 5. Apache Big Data Projects [digital resource]: http://projects.apache.org/indexes/category.html#big-data access date: 18.05.2015. 201