Big Data Processing Model from Mining Prospective



Similar documents
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Review on Data Mining with Big Data

How To Manage Big Data

Mining With Big Data Using HACE

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Mining Association Rules in Big Data for E-healthcare Information System

Data Mining with Big Data e-health Service Using Map Reduce

Data Mining With Application of Big Data

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

SPATIAL DATA CLASSIFICATION AND DATA MINING

A Study on Effective Business Logic Approach for Big Data Mining

Volume 3, Issue 8, August 2015 International Journal of Advance Research in Computer Science and Management Studies

Big Data: Study in Structured and Unstructured Data

Transforming the Telecoms Business using Big Data and Analytics

The basic data mining algorithms introduced may be enhanced in a number of ways.

BIG DATA: BIG BOOST TO BIG TECH

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

Business Challenges and Research Directions of Management Analytics in the Big Data Era

A Big Data Analytical Framework For Portfolio Optimization Abstract. Keywords. 1. Introduction

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 4, Jul-Aug 2015

Classification On The Clouds Using MapReduce

BIG DATA CHALLENGES AND PERSPECTIVES

Distributed Framework for Data Mining As a Service on Private Cloud

NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Are You Ready for Big Data?

Big Data Analytics: 14 November 2013

AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP

Radoop: Analyzing Big Data with RapidMiner and Hadoop

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

An Overview of Knowledge Discovery Database and Data mining Techniques

Research Issues in Big Data Analytics

Are You Ready for Big Data?

Mining and Detection of Emerging Topics from Social Network Big Data

Capturing Meaningful Competitive Intelligence from the Social Media Movement

Data quality in Accounting Information Systems

Social Innovation through Utilization of Big Data

Data processing goes big

ANALYTICS BUILT FOR INTERNET OF THINGS

Customer Relationship Management using Adaptive Resonance Theory

The Scientific Data Mining Process

Information Visualization WS 2013/14 11 Visual Analytics

Information Management course

What is Analytic Infrastructure and Why Should You Care?

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

EHR CURATION FOR MEDICAL MINING

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Hadoop Technology for Flow Analysis of the Internet Traffic

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

Journal of Chemical and Pharmaceutical Research, 2015, 7(3): Research Article. E-commerce recommendation system on cloud computing

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

Manifest for Big Data Pig, Hive & Jaql

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

Statistical Challenges with Big Data in Management Science

Operations Research and Knowledge Modeling in Data Mining

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

A Study on Security and Privacy in Big Data Processing

Hadoop Cluster Applications

IMPLEMENTATION OF RELIABLE CACHING STRATEGY IN CLOUD ENVIRONMENT

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Machina Research. Where is the value in IoT? IoT data and analytics may have an answer. Emil Berthelsen, Principal Analyst April 28, 2016

Data Refinery with Big Data Aspects

Big Data Explained. An introduction to Big Data Science.

Framework and key technologies for big data based on manufacturing Shan Ren 1, a, Xin Zhao 2, b

What happens when Big Data and Master Data come together?

Healthcare Measurement Analysis Using Data mining Techniques

International Journal of Innovative Research in Computer and Communication Engineering

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

A Cloud Based Solution with IT Convergence for Eliminating Manufacturing Wastes

The 4 Pillars of Technosoft s Big Data Practice

Novel Framework for Distributed Data Stream Mining in Big data Analytics Using Time Sensitive Sliding Window

Knowledge Engineering with Big Data

Personalization of Web Search With Protected Privacy

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

How To Handle Big Data With A Data Scientist

Big Data Readiness. A QuantUniversity Whitepaper. 5 things to know before embarking on your first Big Data project

Beyond Watson: The Business Implications of Big Data

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS

A comparative study of data mining (DM) and massive data mining (MDM)

The Next Wave of Data Management. Is Big Data The New Normal?

How Big Data is Different

Problems to store, transfer and process the Big Data 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 1

A Survey of Classification Techniques in the Area of Big Data.

The New Normal: Get Ready for the Era of Extreme Information Management. John Mancini President, DigitalLandfill.

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Introduction to Data Mining

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Practice of M2M Connecting Real-World Things with Cloud Computing

Mobile Adaptive Opportunistic Junction for Health Care Networking in Different Geographical Region

Transcription:

768 Big Data Processing Model from Mining Prospective Swathi Sama 1, D. Venkateshwarlu 2, Prof. Ravi Mathey 3 1 Department of Computer Science and Engineering, JNT University, Hyderabad 2 Department of Computer Science and Engineering, JNT University, Hyderabad 3 Head of Department of Computer Science and Engineering, JNT University, Hyderabad ABSTRACT In the Internet era, the volume of data we deal with has grown to terabytes and petabytes. As the volume of data keeps growing, the types of data generated by applications become richer than before. As a result, traditional relational databases are challenged to capture, store, search, share, analyze, and visualize data. Traditional data modeling focuses on resolving the complexity of relationships among schema-enabled data. However, these considerations do not apply to nonrelational, schema-less databases. As a result, old ways of data modeling no longer apply. We need a new methodology to manage big data for maximum business value. HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective which is disusing in this paper. might be big, for others 100GB might be big, and something else for others. This term is qualitative and it cannot really be quantified. Hence we identify Big Data by a few characteristics which are specific to Big Data. These characteristics of Big Data are popularly known as Volume, Velocity, and Variety as shown (in fig:1) below. Keywords - autonomous sources, big data, data mining, evaluation of complex data, processing model. I. INTRODUCTION The mantra of the moment, in every field from retail to healthcare, is Big Data defined as being data sets that are too large and complex to manipulate with standard methods or tools. Analyzing these data sets is quickly becoming the basis for competition, productivity and innovation; in fact, some predict Big Data will be as important to business and society as the Internet has become, and it is being used to predict where and when crimes will occur, flues will strike, where traffic will snarl all very useful for deploying limited resources like police forces, health care professionals or traffic lights. II. CHARACTERISTICS OF BIG DATA When do we say we are dealing with Big Data? For some people 1TB might seem big, for others 10TB Fig1: 3 v's of Big data Volume refers to the size of data that we are working with. With the advancement of technology and with the invention of social media, the amount of data is growing very rapidly. This data is spread across different places, in different formats, in large volumes ranging from Gigabytes to Terabytes, Petabytes, and even more. Today, the data is not only generated by humans, but large amounts of data is being generated by machines and it surpasses human generated data. This size aspect of data is referred to as Volume in the Big Data world. Velocity refers to the speed at which the data is being generated. Different applications have different latency requirements and in today's competitive world, decision makers want the necessary data/information in the least amount of time as possible. Generally, in near real time

769 or real time in certain scenarios. In different fields and different areas of technology, we see data getting generated at different speeds. A few examples include trading/stock exchange data, tweets on Twitter, status updates/likes/shares on Facebook, and many others. This speed aspect of data generation is referred to as Velocity in the Big Data world. Variety refers to the different formats in which the data is being generated/stored. Different applications generate/store the data in different formats. In today's world, there are large volumes of unstructured data being generated apart from the structured data getting generated in enterprises. Until the advancements in Big Data technologies, the industry didn't have any powerful and reliable tools/technologies which can work with such voluminous unstructured data that we see today. In today's world, organizations not only need to rely on the structured data from enterprise databases/warehouses, they are also forced to consume lots of data that is being generated both inside and outside of the enterprise like click stream data, social media, etc. to stay competitive. Apart from the traditional flat files, spreadsheets, relational databases etc., we have a lot of unstructured data stored in the form of images, audio files, video files, web logs, sensor data, and many others. This aspect of varied data formats is referred to as Variety in the Big Data world. III. REALITY OF BIG DATA Our capacity for big data era has never been so intense furthermore, colossal following the time when the creation of the data innovation in the mid nineteenth century. As another sample, on 4 October 2012, the first presidential level headed discussion between President Barack Obama and Governor Mitt Romney activated more than 10 million tweets inside of 2 hours [3]. Among every one of these tweets, the particular minutes that produced the most dialogs really uncovered the general population hobbies, for example, the dialogs about Medicare and vouchers. Such online discussions provide a new means to sense the public interests and generate feedback in real-time, and are mostly appealing compared to generic media, such as radio or TV broadcasting. IV. PROBLEM STATEMENT As we stated emerging growth in Big Data trend it is very important to manage huge size of the data with mining techniques. Whereas exiting mining algorithms are tested or being used with medium size of the data only. So in this paper Big Data Mining is going to be proposed to handle big data processing operations. V. EXISTING APPROACHES Right now, Big Data preparing for the most part relies on upon parallel programming models like MapReduce, and additionally giving a distributed computing stage of Big Data administrations for people in general. MapReduce is a bunch situated parallel figuring model. There is still a certain crevice in execution with social databases. Enhancing the execution of MapReduce and improving the ongoing way of expansive scale information preparing have gotten a noteworthy measure of consideration, with MapReduce parallel writing computer programs being connected to numerous machine learning and information mining calculations. Information mining calculations generally need to look over the preparation information for getting the measurements to explain or streamline model parameters. It calls for escalated registering to get to the expansive scale information every now and again. To enhance the productivity of calculations, Chu et al. proposed a broadly useful parallel programming system, which is pertinent to a substantial number of machine learning calculations taking into account the straightforward MapReduce programming model on multicore processors. Ten traditional information mining calculations are acknowledged in the system, including by regional standards weighted direct relapse, k-means, logistic relapse, gullible Bayes, direct bolster vector machines, the free variable examination, Gaussian discriminant investigation, desire expansion, and back-proliferation neural systems [1]. With the examination of these traditional machine learning calculations, we contend that the computational operations in the calculation learning procedure could be changed into a summation operation on various preparing information sets. Summation operations could be performed on distinctive subsets freely and accomplish punishment executed effectively on the MapReduce programming stage[1]. Along these lines, a vast scale information set could be isolated into a few subsets and allocated to numerous Mapper hubs. At that point, different summation operations could be performed on the Mapper hubs to gather middle of the road results. At long last, learning calculations are

770 executed in parallel through consolidating summation on Reduce hubs. VI. RESEARCH INITIATIVES To tackle the Big Data challenges and seize the opportunities afforded by the new, data driven resolution, the US National Science Foundation (NSF), under President Obama Administration s Big Data initiative, announced the BIGDATA solicitation in 2012. Such a federal initiative has resulted in a number of winning projects to investigate the foundations for Big Data management (led by the University of Washington), analytical approaches for genomics-based massive data computation (led by Brown University), large scale machine learning techniques for highdimensional data sets that may be as large as 500,000 dimensions (led by Carnegie Mellon University), social analytics for large scale scientific literatures (led by Rutgers University), and several others. These projects seek to develop methods, algorithms, frameworks, and research infrastructures that allow us to bring the massive amounts of data down to a human manageable and interpretable scale. Other countries such as the National Natural Science Foundation of China (NSFC) are also catching up with national grants on Big Data research. VII. PROPOSED SOLUTION For a wise learning database framework [2] to handle Enormous Data, the crucial key is proportional up to the outstandingly expansive volume of information and give medicines to the qualities highlighted by the previously stated HACE hypothesis. A reasonable perspective of the Big Data preparing structure, which incorporates three levels from back to front with contemplations on information getting to and figuring (Tier I), information protection and area learning (Tier II), and Big Data mining calculations (Tier III). The difficulties at Tier I concentrate on information getting to and number juggling processing systems. Since Big Data are regularly put away at distinctive areas and information volumes might consistently grow, a viable figuring stage will need to take circulated huge scale information stockpiling into thought for registering. Case in point, regular information mining calculations oblige all information to be stacked into the primary memory, this, be that as it may, is turning into an unmistakable specialized obstruction for Big Data on the grounds that moving information crosswise over diverse areas is costly (e.g., subject to serious system correspondence and other IO expenses), regardless of the possibility that we do have a super extensive primary memory to hold all information for figuring. The difficulties at Tier II base on semantics and area learning for distinctive Big Data applications. Such data can give extra advantages to the mining procedure, and also add specialized boundaries to the Big Data access (Tier I) and mining calculations (Tier III). Case in point, contingent upon distinctive space applications, the information protection and data sharing components[7] between information makers and information customers can be essentially diverse. Sharing sensor system information for applications like water quality checking may not be disheartened, while discharging and sharing portable clients' area data is obviously not worthy for dominant part, if not all, applications. In expansion to the above protection issues, the application spaces can likewise give extra data to advantage on the other hand direct Big Data mining calculation outlines. For instance, in business sector wicker container exchanges information, every exchange is considered free and the found learning is regularly spoke to by discovering exceedingly corresponded things, perhaps as for diverse fleeting and/or spatial confinements. In an informal community, then again, clients are connected and offer reliance structures. The learning is at that point spoke to by client groups, pioneers in each gathering[6], and social impact demonstrating, etc. In this way, understanding semantics and application information is critical for both low-level information access and for abnormal state mining calculation plans. At Tier III, the information mining difficulties focus on calculation plans in handling the troubles raised by the Enormous Data volumes, dispersed information appropriations, and by perplexing and element information attributes. The circle at Level III contains three stages. Initially, scanty, heterogeneous, indeterminate, deficient, and multisource information are preprocessed[4] by information combination strategies. Second, complex and dynamic information are mined subsequent to preprocessing. Third, the worldwide

771 information got by nearby learning and model combination is tried and significant data is feedback to the preprocessing stage. At that point, the model and parameters are balanced by criticism. In the entire procedure, data sharing is not just a guarantee of smooth advancement of every stage, additionally a reason for Big Data handling. VIII. DATA MINING ALGORITHMS USING In this paper two popular data mining algorithms are using named as Apriori and FP growth algorithms to manage big data analysis operations. Apriori is an algorithm (as Fig:2 )for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. Fig 3: FP Growth Fig 4: Algorithm Comparison IX. CONCLUSION Fig 2: Apriori Algorithm FP-growth is a program to find frequent item sets (also closed and maximal as well as generators) with the FPgrowth algorithm[5], which represents the transaction database as a prefix tree which is enhanced with links that organize the nodes into lists referring to the same item. The search is carried out by projecting the prefix tree, working recursively on the result, and pruning the original tree. The implementation also supports filtering for closed and maximal item sets with conditional item set repositories as suggested although the approach used in the program differs in as far as it used top-down prefix trees rather than FP-trees. It does not cover the clever implementation of FP-trees with two integer arrays as suggested. To investigate Big Data, we have examined a few difficulties at the information, model, and framework levels. To bolster Big Information mining, superior registering stages are obliged, which force precise plans to unleash the full force of the Big Data. At the information level, the independent data sources and the assortment of the information accumulation situations, frequently bring about information with confused conditions, for example, missing/unverifiable qualities. In different circumstances, protection concerns, clamor, and mistakes can be brought into the information, to create changed information duplicates. Adding to a protected and sound data sharing convention is a noteworthy test. At the model level, the key test is to produce worldwide models by joining by regional standards found examples to shape a binding together view. This obliges deliberately outlined calculations to examine model connections between circulated destinations, and circuit choices from numerous sources to pick up a best model out of the Big Data. At the framework level, the fundamental test is that a Big Data

772 mining structure needs to consider complex connections between tests, models, and information sources, alongside their advancing changes with time and other conceivable components. REFERENCES [1] C.T. Chu, S.K. Kim, Y.A. Lin, Y. Yu, G.R. Bradski, A.Y. Ng, and K. Olukotun, Map-Reduce for Machine Learning on Multicore, Proc. 20th Ann. Conf. Neural Information Processing Systems (NIPS 06), pp. 281-288, 2006. [2] X. Wu, Building Intelligent Learning Database Systems, AI Magazine, vol. 21, no. 3, pp. 61-67, 2000. [3] Twitter Blog, Dispatch from the Denver Debate, http://blog.twitter.com/2012/10/dispatch-from-denverdebate.html,oct. 2012. [4] D. Luo, C. Ding, and H. Huang, Parallelization with Multiplicative Algorithms for Big Data Mining, Proc. IEEE 12th Int l Conf. Data Mining, pp. 489-498, 2012. [5] X. Wu and X. Zhu, Mining with Noise Knowledge: Error-Aware Data Mining, IEEE Trans. Systems, Man and Cybernetics, Part A, vol. 38, no. 4, pp. 917-932, July 2008. [6] R. Chen, K. Sivakumar, and H. Kargupta, Collective Mining of Bayesian Networks from Distributed Heterogeneous Data, Knowledge and Information Systems, vol. 6, no. 2, pp. 164-187, 2004. [7] P. Domingos and G. Hulten, Mining High-Speed Data Streams, Proc. Sixth ACM SIGKDD Int l Conf. Knowledge Discovery and DataMining (KDD 00), pp. 71-80, 2000.