International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 4, Jul-Aug 2015



Similar documents
Volume 3, Issue 8, August 2015 International Journal of Advance Research in Computer Science and Management Studies

International Journal of Innovative Research in Computer and Communication Engineering

BIG DATA ANALYSIS USING HACE THEOREM

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

An Analysis and Review on Various Techniques of Mining Big Data

Keywords- Big data, HACE theorem, Processing Framework, Methodologies, Applications.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

An Overview of Knowledge Discovery Database and Data mining Techniques

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May ISSN BIG DATA: A New Technology

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

Knowledge Engineering with Big Data

Big Data and Healthcare Payers WHITE PAPER

How To Learn To Use Big Data

Comparison of K-means and Backpropagation Data Mining Algorithms

A Study on Effective Business Logic Approach for Big Data Mining

A REVIEW REPORT ON DATA MINING

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Improving Data Processing Speed in Big Data Analytics Using. HDFS Method

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Data Refinery with Big Data Aspects

BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS

Big Data Analytic and Mining with Machine Learning Algorithm

Are You Ready for Big Data?

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

SAP Solution Brief SAP HANA. Transform Your Future with Better Business Insight Using Predictive Analytics

Big Data Efficiencies That Will Transform Media Company Businesses

BIG DATA CHALLENGES AND PERSPECTIVES

Big Data: Study in Structured and Unstructured Data

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

High-Performance Analytics

Sentiment Analysis on Big Data

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Are You Ready for Big Data?

Data Mining in the Swamp

Mining With Big Data Using HACE

Collaborations between Official Statistics and Academia in the Era of Big Data

Information Management course

A Survey of Classification Techniques in the Area of Big Data.

Big Data Mining: Challenges and Opportunities to Forecast Future Scenario

Anuradha Bhatia, Faculty, Computer Technology Department, Mumbai, India

Manifest for Big Data Pig, Hive & Jaql

IJRCS - International Journal of Research in Computer Science ISSN:

Advanced Big Data Analytics with R and Hadoop

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

SPATIAL DATA CLASSIFICATION AND DATA MINING

Big Data. Fast Forward. Putting data to productive use

Big Data Analytics for Space Exploration, Entrepreneurship and Policy Opportunities. Tiffani Crawford, PhD

Data Mining with Big Data

Web Data Mining: A Case Study. Abstract. Introduction

DATAOPT SOLUTIONS. What Is Big Data?

Demystifying Big Data Government Agencies & The Big Data Phenomenon

CONNECTING DATA WITH BUSINESS

ISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MINING

Literature Survey in Data Mining with Big Data

ANALYTICS BUILT FOR INTERNET OF THINGS

Big Data with Rough Set Using Map- Reduce

Fight fire with fire when protecting sensitive data

A New Era Of Analytic

ISSN: International Journal of Innovative Research in Technology & Science(IJIRTS)

[callout: no organization can afford to deny itself the power of business intelligence ]

Indian Journal of Science The International Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

Business Intelligence Using Data Mining Techniques on Very Large Datasets

Enterprise Resource Planning Analysis of Business Intelligence & Emergence of Mining Objects

ISSN: A Review: Image Retrieval Using Web Multimedia Mining

Big Data Introduction, Importance and Current Perspective of Challenges

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Exploring Big Data in Social Networks

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

The Next Wave of Data Management. Is Big Data The New Normal?

IoT and Big Data- The Current and Future Technologies: A Review

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Healthcare Measurement Analysis Using Data mining Techniques

Data Mining with Big Data e-health Service Using Map Reduce

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013

AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP

Survey of Big Data Architecture and Framework from the Industry

Data Mining System, Functionalities and Applications: A Radical Review

CHAPTER SIX DATA. Business Intelligence The McGraw-Hill Companies, All Rights Reserved

How To Turn Big Data Into An Insight

Various Data-Mining Techniques for Big Data

The Big Picture on Big Data. Princeton Section 307 Dinner Meeting December 11, 2013 Richard Herczeg

A Strategic Approach to Unlock the Opportunities from Big Data

Cleaned Data. Recommendations

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Transcription:

RESEARCH ARTICLE OPEN ACCESS Data Mining Approach To Big Data Jyothiprasanna Jaladi [1], B.V.Kiranmayee [2], S.Nagini [3] Student of M.Tech(SE) [1], Associate Professor Département Computer Science and Engineering VNRVJIET, Hyderabad Telangana - India. [2] & [3] ABSTRACT The main motivation for extracting information from massive-data is to improve the efficiency of single-source mining methods. On the basis of gradual improvement of computer hardware features, researchers continue to disc over ways to make stronger, the efficiency of potential discovery algorithms to make them higher for huge knowledge. Because massive data are most of the time amassed from one of a kind data sources, the capabilities discovery of the giant knowledge must be carried out using a multisource mining mechanism. As real-world knowledge often come as an information stream or a characteristic waft, a good-situated mechanism is needed to become aware of competencies and grasp the evolution of knowledge in the dynamic data source. As a result, the large, heterogeneous and real-time traits of multisource data provide most important variations between single-source knowledge discovery and multisource data mining. Keywords:- Single-source Mining, Multi-source Mining Huge knowledge. I. INTRODUCTION The potential discovery of the large data must have got to be performed using a multisource mining mechanism. As real-world information mainly come as an information stream or a characteristic glide, a good-based mechanism is needed to notice capabilities and grasp the evolution of knowledge in the dynamic, potential-supply. Consequently, the big, heterogeneous and genuine-time characteristics of multi supply knowledge provide essential difference between single-supply competencies discovery and multisource expertise mining. The targets of large information mining methods go beyond fetching the requested knowledge or even uncovering some hidden relationships and patterns between numeral parameters. Inspecting rapid movement data could result in new valuable insights and theoretical concepts. Comparing with the results derived from mining the natural datasets, Unveiling the massive number of interconnected heterogeneous vast advantage has the knowledge to maximize our abilities and insights within the purpose area. On the other hand, this brings a series of new challenges to the study group. II. BIG DATA Big data [1] is a labelled term for large data units which might be large that cannot be dealt with through a common information processing purposes which can be inadequate. Challenges include analysis, capture, information-curation, search, sharing, storage, transfer, visualization, and knowledge privacy. The time period more often than not refers conveniently to the use of predictive analytics or other specified developed ways to extract worth from information, and infrequently to a specific dimension of data set. Accuracy in enormous information may result in extra confident choice making, and better selections can imply higher operational affectivity, cost reductions and reduced risk. III. CHARACTERISTICS OF BIG DATA Big Data is characterised by using five V s particularly Volume, velocity,sort,veracity and price[2]. Volume: Volume of big data refers back to the amount of data that's being generated in a amount of time. It stages from petabytes to petabytes. Velocity: Velocity refers back to the speed at which the information is being generated every second. Variety: Type refers back to the types that is being generated from the sources. ISSN: 2347-8578 www.ijcstjournal.org Page 258

Veracity: Veracity depends on the reliability of the information. Value: It stands for the worth of the information. IV. BIG DATA TYPES Enormous knowledge consists of every knowledge that represents from greenback transactions to tweets to pictures to audio. Accordingly, taking advantage of giant data requires that each one this information to be built-in for analysis and data management. That is extra difficult than it appears. There are two main varieties of information concerned right here: structured and unstructured. Structured data is like a information warehouse, where information is tagged and sortable, even as unstructured knowledge is random and difficult to analyze[4]. HACE THEOREM Hace Theorem [5] is used to model the characteristics of the big data. Big data -information includes of big, heterogeneous, autonomous, and decentralized manipulate desires to discover to the complex and dynamic relationship between data. These traits make it an severe challenge for locating useful advantage from the large knowledge. In a native sense, it will probably imagined that a blind man is making an attempt to measurement up a huge elephant for you to be the colossal knowledge in this context. The term tremendous information actually considerations about data volumes, HACE theorem suggests that the key characteristics of the tremendous data are: A.Huge with various and miscellaneous data sources:- The major characteristics of the big data is the large volume of knowledge represented by quite a lot of sources. This gigantic quantity of data comes from quite a lot of websites like Twitter, Myspace, Orkut and LinkedIn and so on. This is due to the fact one of a kind knowledge collectors prefer their possess representation or approach for data recording, and the character of specific applications additionally outcome in more than a few knowledge representations. Autonomous Sources with circulated &decentralised Control:- Independent Sources with circulated & decentralised manipulate are a essential characteristic of enormous data applications. Complex and evolving associations:-in an early stage of knowledge centralized information systems, the focus is on discovering first-class feature values to represent every remark. This type of pattern feature representation inherently treats each man or woman as an impartial entity without given that their social connections, which is likely one of the important reasons of the human society. V. BIG DATA MINING CHALLENGES It's major for a procedure [6] to manage the enormous quantities of data. The foremost element is to control the big amounts of knowledge and it should provide the imperative solutions which can be recounted within the HACE theorem. The above figure suggests a conceptual view of the giant data processing framework, which includes three tiers from inside of out with considerations on data accessing and computing (Tier I), data isolation and domain skills (Tier II), and massive data mining algorithms (Tier III). Tier: 1Big Data Accessing & Computing: In older days enormous quantities of data have to be extracted from dependant on the current set of data For this reason, there is a need ofone-of-a-kind set of data to extract the reward skills inside the significant knowledge. ISSN: 2347-8578 www.ijcstjournal.org Page 259

There are probabilities that may take it extra delicate and abilities can lack some of its attribute which is capable to as a consequence be changed right into a bigger set of information that may clearly me modified and utilized without difficulty to the given regular set of data. Such big information procedures on the way to therefore helps in each hardware and application performance. Tier:2 Data Isolation and Semantic Knowledge: In massive data, it is quintessential to comprehend the semantics of the info, ideas,policies involving to a distinctive software. With admire to the rules and policies of the information, a technical barrier known as privacy of the data will have to be viewed. It is a predominant to shield the privateers of the information, regarding sensitive applications like banking transactions, business transactions etc., simple information interactions do not get to the bottom of privateers issues.to be able to shield the privateness: limit the acess to the data and sensitive knowledge information fields may also be removed. Semantic talents can help to determine proper features for modeling the foremost information. The domain and software competencies may help design viable trade targets by using making use of tremendous data analytical procedures. Tier:3 Big Data Mining Algorithms: Gigantic knowledge functions are featured with more than one sources and with decentralized manage. Managing these forms of information to a single centralized site for mining is exhaustive because of the privacy concerns. Below these occasions, a large information mining process has to permit an knowledge exchange to make sure that everyone distributed web sites can work collectively to acquire a world optimization intention. The global mining can also be featured with a two-step approach, at knowledge, mannequin, and at expertise levels. At the information level, every local website can calculate the information records founded on the nearby information sources and alternate the information between sites to obtain a worldwide information distribution view. On the mannequin or sample level, each website online can carry out regional mining routine, with appreciate to the localized data, to notice nearby patterns. By exchanging patterns between more than one sources, new world patterns can also be synthetized by way of aggregating patterns across all web sites. On the capabilities degree, mannequin correlation evaluation finds out the value between items generated from exclusive knowledge sources to determine how important the information sources are correlated with each and every other, and methods to type correct decisions headquartered on items constructed from self-reliant sources. VI. BIG DATA MINING The intention of the enormous data goes is to fetch the desired understanding or to seek out the undiscovered patterns among the data. On account that big knowledge are customarily accrued from distinctive data sources, the knowledge discovery of the significant knowledge mustbe carried out making use of a multisource mining mechanism. As real-world data most commonly come as a data movement or a attribute-waft, good-founded mechanism is needed to observe competencies and master the evolution of abilities within the dynamic data source. Consequently, the huge, heterogeneous and real-time traits of multisource data provide essential differences between single-supply skills discovery and multisource information mining proposed and established the speculation of neighbourhood sample analysis, which has laid a foundation for international talents discovery in multisource knowledge mining. Local sample analysis of knowledge processing can avoid hanging one of a kind information-sources together to hold out centralized computing. Information streams are broadly utilized in fiscal evaluation, online trading, scientific trying out, and many others. Extracting speedy and massive movement information may just results in new useful standards. Tremendous knowledge mining have to deal with heterogeneity, severe scale, pace, privacy, accuracy, trust, and instructiveness that current mining procedures and algorithms are incapable of. The necessity for designing and enforcing very-significant-scale parallel computing device finding out and data mining algorithms (ML-DM) has remarkably increased, which accompanies the emergence of powerful parallel and very-significant-scale data processing systems, e.g., Hadoop MapReduce. NIMBLE is a transportable infrastructure that has been exceptionally designed to enable rapid implementation of parallel MLDM algorithms, jogging on high of Hadoop. VII. PROPOSED SYSTEM Considering that big data knowledge is a collection of difficult and enormous information units that are elaborate to system and mine for patterns and expertise utilising common database management instruments or data processing and mining techniques ISSN: 2347-8578 www.ijcstjournal.org Page 260

The primary motivation leads for discovering competencies from tremendous data is making improvements to the affectivity of single-supply mining approaches. On the groundwork of gradual improvement of computer hardware features, the methods to support the affectivity of potential discovery algorithms to make them better for giant knowledge. Since giant data are quite often collected from unique data sources, the advantage discovery of the huge data must be performed utilising a multisource mining mechanism. As real-world knowledge quite often come as a data movement or characteristic glide, a good-cantered mechanism is required to detect capabilities and grasp the evolution of capabilities within the dynamic knowledge supply. Consequently, the giant, heterogeneous and realtime traits of multi supply knowledge provide main differences between single-source knowledge discovery and multisource knowledge mining. Dataset Selection: a data set is a group of data. The term information set can also be used more loosely, to refer to the info in a set of closely associated tables, corresponding to a distinct test or occasion. Pre-Processing: Pre-processing entails putting off contaminated knowledge, removing null values and unformatted values. Clustering: Clustering is a key concept in Dataset mining. A clustering process pursuits to analyse the similarities between knowledge objects and build businesses of them. The grouped objects can then be used to navigate quite simply via a very large record of information units. Dataset clustering ambitions to routinely divide Datasets in to corporations founded on similarities of their contents. Each and every corporations encompass Datasets which can be identical between themselves, they have high intra-cluster similarity and varied to Datasets of other corporations that have low inter-cluster similarity. K-means is a simple however good recognized algorithm for group-ing objects, clustering. Again all objects must be represented as a collection of numerical elements. Furthermore the user has to specify the number of corporations (referred to as k) he wants to establish. Every object can be concept of as being represented by using some feature vector in an n dimensional house, n being the number of all features used to explain the objects to cluster. The algorithm then randomly chooses okay facets in that vector area, these points serve as the preliminary centres of the clusters. Afterwards all objects are each and every assigned to the centre they're closest to. In general the distance measure is chosen by way of the person and determined by means of the learning project. After that mission is computed, for each and every cluster a new core is computed through averaging the feature vectors of all objects assigned to it. The procedure of assigning objects and re computing facilities is repeated unless the approach converges. The algorithm may also be tested to converge after a finite quantity of iterations. A number of tweaks regarding distance measure, initial middle alternative and computation of recent traditional facilities were explored, as well as the estimation of the number of clusters k. Experimental Results: This experimental results shows that it can act like big data mining platform to extract the useful data from the data source. ISSN: 2347-8578 www.ijcstjournal.org Page 261

VIII. CONCLUSION The main motivation for discovering knowledge from massive data is improving the efficiency of single-source mining method is achieved by developing a big data mining platform, where is no proper layout to extract the massive data from the data source, because massive data are typically collected from different data sources, the knowledge discovery of the massive data must be performed using a multi-source mining mechanism. As real-world data often come as a data stream or a characteristic flow, a well-established mechanism is needed to discover knowledge and master the evolution of knowledge in the dynamic data source. [4] Big data and Five V s Characteristics, HIBA JASIM HADI, AMMAR HAMEED SHNAIN, SARAH HADISHAHEED,AZIZAHBTHAJIAHMAD,Ministry of Education, Islamic University College. [5] Data Mining with Big Data Xindong Wu, Fellow, IEEE, Xingquan Zhu, Senior Member, IEEE,Gong- Qing Wu, and Wei Ding, Senior Member, IEEE, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 26, NO. 1, JANUARY 2014. [6] Big data analysis using HaceTheorem,Deepak S. Tamhane, Sultana N. Sayyad, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) Volume 4 Issue 1, January 2015. [7] Issues and Challenges in the Era of Big DataMining, B R Prakash, Dr. M. Hanumanthappa Assistant Professor,Department of MCA,Sri Siddhartha Institute of Technology,Tumkur Professor, Department of Computer Science & Applications, Bangalore University, Bangalore, International Journal of Emerging Trends & Technology in Computer Science (IJETTCS), Volume 3, Issue 4 July-August 2014. [8] Galen Gruman, PRICEWATERHOUSECOOPERS, Technology Forecast, (2010), Issue 3, A quarterly journal, Making Sense of Big Data. [9] Phil Shelly, Doug Cutting, Infosys White paper, Big Data Spectrum. REFERENCES [1] http://en.wikipedia.org/wiki/big-data. [2] From Big Data to Big Data Mining: Challenges, Issues, and Opportunities Dunren Che, Mejdl Safran, and Zhiyong Peng, Department of Computer Science, Southern Illinois University Carbondale, Illinois 62901, USA. [3] Exploring the Big Data Spectrum, Jaya Singh and Ajay Rana, International Journal of Emerging Technology and Advanced Engineering, Volume 3, Issue 4, April 2013. ISSN: 2347-8578 www.ijcstjournal.org Page 262