Big Data Processing Model from Mining Prospective

Transcription

1 768 Big Data Processing Model from Mining Prospective Swathi Sama 1, D. Venkateshwarlu 2, Prof. Ravi Mathey 3 1 Department of Computer Science and Engineering, JNT University, Hyderabad 2 Department of Computer Science and Engineering, JNT University, Hyderabad 3 Head of Department of Computer Science and Engineering, JNT University, Hyderabad ABSTRACT In the Internet era, the volume of data we deal with has grown to terabytes and petabytes. As the volume of data keeps growing, the types of data generated by applications become richer than before. As a result, traditional relational databases are challenged to capture, store, search, share, analyze, and visualize data. Traditional data modeling focuses on resolving the complexity of relationships among schema-enabled data. However, these considerations do not apply to nonrelational, schema-less databases. As a result, old ways of data modeling no longer apply. We need a new methodology to manage big data for maximum business value. HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective which is disusing in this paper. might be big, for others 100GB might be big, and something else for others. This term is qualitative and it cannot really be quantified. Hence we identify Big Data by a few characteristics which are specific to Big Data. These characteristics of Big Data are popularly known as Volume, Velocity, and Variety as shown (in fig:1) below. Keywords - autonomous sources, big data, data mining, evaluation of complex data, processing model. I. INTRODUCTION The mantra of the moment, in every field from retail to healthcare, is Big Data defined as being data sets that are too large and complex to manipulate with standard methods or tools. Analyzing these data sets is quickly becoming the basis for competition, productivity and innovation; in fact, some predict Big Data will be as important to business and society as the Internet has become, and it is being used to predict where and when crimes will occur, flues will strike, where traffic will snarl all very useful for deploying limited resources like police forces, health care professionals or traffic lights. II. CHARACTERISTICS OF BIG DATA When do we say we are dealing with Big Data? For some people 1TB might seem big, for others 10TB Fig1: 3 v's of Big data Volume refers to the size of data that we are working with. With the advancement of technology and with the invention of social media, the amount of data is growing very rapidly. This data is spread across different places, in different formats, in large volumes ranging from Gigabytes to Terabytes, Petabytes, and even more. Today, the data is not only generated by humans, but large amounts of data is being generated by machines and it surpasses human generated data. This size aspect of data is referred to as Volume in the Big Data world. Velocity refers to the speed at which the data is being generated. Different applications have different latency requirements and in today's competitive world, decision makers want the necessary data/information in the least amount of time as possible. Generally, in near real time

2 769 or real time in certain scenarios. In different fields and different areas of technology, we see data getting generated at different speeds. A few examples include trading/stock exchange data, tweets on Twitter, status updates/likes/shares on Facebook, and many others. This speed aspect of data generation is referred to as Velocity in the Big Data world. Variety refers to the different formats in which the data is being generated/stored. Different applications generate/store the data in different formats. In today's world, there are large volumes of unstructured data being generated apart from the structured data getting generated in enterprises. Until the advancements in Big Data technologies, the industry didn't have any powerful and reliable tools/technologies which can work with such voluminous unstructured data that we see today. In today's world, organizations not only need to rely on the structured data from enterprise databases/warehouses, they are also forced to consume lots of data that is being generated both inside and outside of the enterprise like click stream data, social media, etc. to stay competitive. Apart from the traditional flat files, spreadsheets, relational databases etc., we have a lot of unstructured data stored in the form of images, audio files, video files, web logs, sensor data, and many others. This aspect of varied data formats is referred to as Variety in the Big Data world. III. REALITY OF BIG DATA Our capacity for big data era has never been so intense furthermore, colossal following the time when the creation of the data innovation in the mid nineteenth century. As another sample, on 4 October 2012, the first presidential level headed discussion between President Barack Obama and Governor Mitt Romney activated more than 10 million tweets inside of 2 hours [3]. Among every one of these tweets, the particular minutes that produced the most dialogs really uncovered the general population hobbies, for example, the dialogs about Medicare and vouchers. Such online discussions provide a new means to sense the public interests and generate feedback in real-time, and are mostly appealing compared to generic media, such as radio or TV broadcasting. IV. PROBLEM STATEMENT As we stated emerging growth in Big Data trend it is very important to manage huge size of the data with mining techniques. Whereas exiting mining algorithms are tested or being used with medium size of the data only. So in this paper Big Data Mining is going to be proposed to handle big data processing operations. V. EXISTING APPROACHES Right now, Big Data preparing for the most part relies on upon parallel programming models like MapReduce, and additionally giving a distributed computing stage of Big Data administrations for people in general. MapReduce is a bunch situated parallel figuring model. There is still a certain crevice in execution with social databases. Enhancing the execution of MapReduce and improving the ongoing way of expansive scale information preparing have gotten a noteworthy measure of consideration, with MapReduce parallel writing computer programs being connected to numerous machine learning and information mining calculations. Information mining calculations generally need to look over the preparation information for getting the measurements to explain or streamline model parameters. It calls for escalated registering to get to the expansive scale information every now and again. To enhance the productivity of calculations, Chu et al. proposed a broadly useful parallel programming system, which is pertinent to a substantial number of machine learning calculations taking into account the straightforward MapReduce programming model on multicore processors. Ten traditional information mining calculations are acknowledged in the system, including by regional standards weighted direct relapse, k-means, logistic relapse, gullible Bayes, direct bolster vector machines, the free variable examination, Gaussian discriminant investigation, desire expansion, and back-proliferation neural systems [1]. With the examination of these traditional machine learning calculations, we contend that the computational operations in the calculation learning procedure could be changed into a summation operation on various preparing information sets. Summation operations could be performed on distinctive subsets freely and accomplish punishment executed effectively on the MapReduce programming stage[1]. Along these lines, a vast scale information set could be isolated into a few subsets and allocated to numerous Mapper hubs. At that point, different summation operations could be performed on the Mapper hubs to gather middle of the road results. At long last, learning calculations are

3 770 executed in parallel through consolidating summation on Reduce hubs. VI. RESEARCH INITIATIVES To tackle the Big Data challenges and seize the opportunities afforded by the new, data driven resolution, the US National Science Foundation (NSF), under President Obama Administration s Big Data initiative, announced the BIGDATA solicitation in Such a federal initiative has resulted in a number of winning projects to investigate the foundations for Big Data management (led by the University of Washington), analytical approaches for genomics-based massive data computation (led by Brown University), large scale machine learning techniques for highdimensional data sets that may be as large as 500,000 dimensions (led by Carnegie Mellon University), social analytics for large scale scientific literatures (led by Rutgers University), and several others. These projects seek to develop methods, algorithms, frameworks, and research infrastructures that allow us to bring the massive amounts of data down to a human manageable and interpretable scale. Other countries such as the National Natural Science Foundation of China (NSFC) are also catching up with national grants on Big Data research. VII. PROPOSED SOLUTION For a wise learning database framework [2] to handle Enormous Data, the crucial key is proportional up to the outstandingly expansive volume of information and give medicines to the qualities highlighted by the previously stated HACE hypothesis. A reasonable perspective of the Big Data preparing structure, which incorporates three levels from back to front with contemplations on information getting to and figuring (Tier I), information protection and area learning (Tier II), and Big Data mining calculations (Tier III). The difficulties at Tier I concentrate on information getting to and number juggling processing systems. Since Big Data are regularly put away at distinctive areas and information volumes might consistently grow, a viable figuring stage will need to take circulated huge scale information stockpiling into thought for registering. Case in point, regular information mining calculations oblige all information to be stacked into the primary memory, this, be that as it may, is turning into an unmistakable specialized obstruction for Big Data on the grounds that moving information crosswise over diverse areas is costly (e.g., subject to serious system correspondence and other IO expenses), regardless of the possibility that we do have a super extensive primary memory to hold all information for figuring. The difficulties at Tier II base on semantics and area learning for distinctive Big Data applications. Such data can give extra advantages to the mining procedure, and also add specialized boundaries to the Big Data access (Tier I) and mining calculations (Tier III). Case in point, contingent upon distinctive space applications, the information protection and data sharing components[7] between information makers and information customers can be essentially diverse. Sharing sensor system information for applications like water quality checking may not be disheartened, while discharging and sharing portable clients' area data is obviously not worthy for dominant part, if not all, applications. In expansion to the above protection issues, the application spaces can likewise give extra data to advantage on the other hand direct Big Data mining calculation outlines. For instance, in business sector wicker container exchanges information, every exchange is considered free and the found learning is regularly spoke to by discovering exceedingly corresponded things, perhaps as for diverse fleeting and/or spatial confinements. In an informal community, then again, clients are connected and offer reliance structures. The learning is at that point spoke to by client groups, pioneers in each gathering[6], and social impact demonstrating, etc. In this way, understanding semantics and application information is critical for both low-level information access and for abnormal state mining calculation plans. At Tier III, the information mining difficulties focus on calculation plans in handling the troubles raised by the Enormous Data volumes, dispersed information appropriations, and by perplexing and element information attributes. The circle at Level III contains three stages. Initially, scanty, heterogeneous, indeterminate, deficient, and multisource information are preprocessed[4] by information combination strategies. Second, complex and dynamic information are mined subsequent to preprocessing. Third, the worldwide

4 771 information got by nearby learning and model combination is tried and significant data is feedback to the preprocessing stage. At that point, the model and parameters are balanced by criticism. In the entire procedure, data sharing is not just a guarantee of smooth advancement of every stage, additionally a reason for Big Data handling. VIII. DATA MINING ALGORITHMS USING In this paper two popular data mining algorithms are using named as Apriori and FP growth algorithms to manage big data analysis operations. Apriori is an algorithm (as Fig:2 )for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. Fig 3: FP Growth Fig 4: Algorithm Comparison IX. CONCLUSION Fig 2: Apriori Algorithm FP-growth is a program to find frequent item sets (also closed and maximal as well as generators) with the FPgrowth algorithm[5], which represents the transaction database as a prefix tree which is enhanced with links that organize the nodes into lists referring to the same item. The search is carried out by projecting the prefix tree, working recursively on the result, and pruning the original tree. The implementation also supports filtering for closed and maximal item sets with conditional item set repositories as suggested although the approach used in the program differs in as far as it used top-down prefix trees rather than FP-trees. It does not cover the clever implementation of FP-trees with two integer arrays as suggested. To investigate Big Data, we have examined a few difficulties at the information, model, and framework levels. To bolster Big Information mining, superior registering stages are obliged, which force precise plans to unleash the full force of the Big Data. At the information level, the independent data sources and the assortment of the information accumulation situations, frequently bring about information with confused conditions, for example, missing/unverifiable qualities. In different circumstances, protection concerns, clamor, and mistakes can be brought into the information, to create changed information duplicates. Adding to a protected and sound data sharing convention is a noteworthy test. At the model level, the key test is to produce worldwide models by joining by regional standards found examples to shape a binding together view. This obliges deliberately outlined calculations to examine model connections between circulated destinations, and circuit choices from numerous sources to pick up a best model out of the Big Data. At the framework level, the fundamental test is that a Big Data

5 772 mining structure needs to consider complex connections between tests, models, and information sources, alongside their advancing changes with time and other conceivable components. REFERENCES [1] C.T. Chu, S.K. Kim, Y.A. Lin, Y. Yu, G.R. Bradski, A.Y. Ng, and K. Olukotun, Map-Reduce for Machine Learning on Multicore, Proc. 20th Ann. Conf. Neural Information Processing Systems (NIPS 06), pp , [2] X. Wu, Building Intelligent Learning Database Systems, AI Magazine, vol. 21, no. 3, pp , [3] Twitter Blog, Dispatch from the Denver Debate, [4] D. Luo, C. Ding, and H. Huang, Parallelization with Multiplicative Algorithms for Big Data Mining, Proc. IEEE 12th Int l Conf. Data Mining, pp , [5] X. Wu and X. Zhu, Mining with Noise Knowledge: Error-Aware Data Mining, IEEE Trans. Systems, Man and Cybernetics, Part A, vol. 38, no. 4, pp , July [6] R. Chen, K. Sivakumar, and H. Kargupta, Collective Mining of Bayesian Networks from Distributed Heterogeneous Data, Knowledge and Information Systems, vol. 6, no. 2, pp , [7] P. Domingos and G. Hulten, Mining High-Speed Data Streams, Proc. Sixth ACM SIGKDD Int l Conf. Knowledge Discovery and DataMining (KDD 00), pp , 2000.