A Novel Big Data Approach to Classify Bank Customers - Solution by Combining PIG, R and Hadoop
|
|
|
- Bertina Townsend
- 9 years ago
- Views:
Transcription
1 I.J. Information Technology and Computer Science, 2016, 9, Published Online September 2016 in MECS ( DOI: /ijitcs A Novel Big Data Approach to Classify Bank Customers - Solution by Combining PIG, R and Hadoop Lija Mohan Division of Computer Science, Cochin University of Science & Technology, Kochi, Kerala, India. [email protected] Sudheep Elayidom M. Division of Computer Science, Cochin University of Science & Technology, Kochi, Kerala, India. [email protected] Abstract Large amount of data that is characterized by its volume, velocity, veracity, value and variety is termed Big Data. Extracting hidden patterns, customer preferences, market trends, unknown correlations, or any other useful business information from large collection of structured or unstructured data set is called Big Data analysis. This article explores the scope of analyzing bank transaction data to categorize customers which could help the bank in efficient marketing, improved customer service, better operational efficiency, increased profit and many other hidden benefits. Instead of relying on a single technology to process large scale data, we make use of a combination of strategies like Hadoop, PIG, R etc for efficient analysis. RHadoop is an upcoming research trend for Big Data analysis, as R is a very efficient and easy to code, data analysis and visualization tool compared to tradit ional Map Reduce program. K- Means is chosen as the clustering algorithm for classification. Index Terms BigData Analysis, Bank customer classification, Hadoop, PIG, R. I. INTRODUCTION Emergence of Internet and World Wide Web (WWW) caused the flooding of large amount of data. This data characterized by its volume, velocity, veracity, value and variety is termed BigData [41]. Big Data contains terra bytes or peta bytes of data and it can be structured or unstructured. Big Data Analysis is very much crucial as it leads to more accurate analysis. Accurate analysis is the back bone of accurate decision making. Accurate decision making leads to operational efficiency and cost and risk reduction in industries. Traditional client server architecture seems inefficient to tackle big data. Hadoop comes to rescue here. Hadoop is a distributed processing framework from Apache which can store and process data in a parallel processing manner. A. Motivation This article identifies an application of BigData analytics to classify bank customers. Every bank will maintain a secret, customer profile to cross check the risk perception of customer. This customer classification is necessary for many reasons. To identify if the customer furnishes wrong information about himself. If the customer does not re-pay the loans on time. To restrict loans and credit card issue to high risky customers To correctly predict apt customers for their products and policies, thus leading to excellent target based marketing. Productive marketing thus leads to the profit of the banks etc. Classification done based on the current salary, assets he own etc may not be an actual reflection of the customer behavior. People who possess valuable assets may show reluctance to pay back the loans on time. Hence this article tries to classify the customers based on their previous transactions. Here we include information such as how many loans he already possesses, how many have re-payed etc for better classification. The challenge here is to mine valuable information from transactions which ranges from terra bytes or peta bytes of information. In this article, we consider bank transactions as Big Data and provide a Big Data approach to solve the problem of customer classification. As per our knowledge, it is the first time in literature to solve the problem of customer classification using a Big Data approach. We combine PIG and R on Hadoop platform to do the classification. Advantages of solving the problem in a Big Data environment:
2 82 A Novel Big Data Approach to Classify Bank Customers - Solution by Combining PIG, R and Hadoop All the hardware related tasks will be handled by the Hadoop framework itself. Eg: Checking whether all nodes are up, transferring a job in case of failure, selecting a node with less overhead etc. Programmer can concentrate more on programming. Task can be completed in minimum time. Scalability is ensured Accuracy of classification is increased as training data is large Efficient system development PIG, R etc can be programmed by a person who do not even know map reduce programming principles II. BACKGROUND AND LITERATURE REVIEW There has been several classification models developed both in literature and academia to categorize banking customers based on their transactions performed. Several techniques like Support Vector machines [15,16], Decision trees [30, 31], Neural network based models [17, 18], Abnormal detection [26, 27], Optimizations [28], Bayesian models [24], Rule based models [22, 23], Nearest neighbor approaches [21], Pattern recognition algorithms [25], Hybrid approaches [29] etc were proposed, but all of these methods depends on the actual input data that it could receive to classify the customers accurately. As the amount of training data increases, the accuracy of classification also increases. But if the amount of data increases beyond a limit, training and optimizations becomes very difficult and time consuming. Apart from traditional client server architecture, a distributed processing system which is capable of processing large amount of data can come to rescue here. Hadoop [32] is a distributed processing frame work provided as open source by Apache. Data processing applications like Mahout, PIG, HIVE, R etc can be integrated with Hadoop to handle Big Data. Map Reduce [33] programming is followed in Hadoop. Hadoop is widely used in applications where large scale data processing is necessary. Customer feedback analysis [36], social media analysis [38], ad-targeting [39], data warehouse modernization [34], operation analysis [35], fraud detection [37] etc are the wide spread applications of big data analysis. Apart from literature, hadoop is gaining so much popularity in industries for Market Basket Analysis and Pricing Optimization, Credit risk, Compliance and regulatory reporting, Merchandizing and market analysis, Behaviour-based targeting, Supply-chain management and analytics, Market and consumer segmentations, Fraud detection and security analytics, scoring and analysis, Customer Segmentation, Risk analysis and management, Medical insurance fraud, CRM, Trade surveillance and abnormal trading pattern analysis, Games Development, Clinical trials, Drug development analysis, Patient care quality analysis, Disease pattern analysis etc. This article proposes a method to classify the bank customers using Hadoop framework. Classification is done based on the transactions performed by the customers. Transactions analysis is tedious as the size of input data is very huge. According to the RBI statistics [40], in a single month, a bank will have to handle approximately 8,00,00,000 transactions which is not possible using traditional client server processing. Thus we approach this problem as a BigData problem and propose a solution based on Hadoop, PIG and R. The next section gives an overview of Hadoop, Pig and R. Clustering method adopted is K-Means clustering. But based on the application requirement it can be changed to SVMs, Neural Networks or Decision Trees. But for simplicity here we adopt K Means clustering. A. Hadoop An Architecture to process BigData Hadoop [13, 14] is an open source distributed processing framework from Apache, for storing large amount of data on huge number of commodity clusters. Key features that make hadoop suitable for processing large scale data is its flexibility, cost effectiveness, fault tolerance, reliability, scalability, robustness and real time processing. Major components of Hadoop are Hadoop Distributed File System, Namenode and DataNodes. Hadoop Distributed File System (HDFS): HDFS is a distributed file system which allows data to be stored in different blocks of pre determined sizes. Since the data is split across multiple blocks, which spread across multiple machines, data access and operations can be performed in a parallel fashion thus enabling faster processing. Data replication is done automatically to allow fault tolerance.hdfs maintains a master/ slave architecture where the master is named as Name node and slave is named as Data node in Hadoop. Name Node: Name node functions as a master node. Name node should be a high end system which could survive hardware faults. The daemon that performs the functions of a name node is called Job Tracker. Different functions of name node include: Maintaining the namespace of files stored. i.e. keeping the metadata of file blocks and their locations. Maintains an index of cluster configuration Directs data nodes to execute low level operations Records the changes that take place in a cluster Replication of data blocks is taken care by Name node Receives the heart beat of each data node to check whether it is alive, in case of a data node failure, name node will assign the task to another data node depending on the data block availability, location, overhead etc. Data Node: Data node serves as the slave node. Data node can be any commodity hardware which will not create any problem even if the node crashes. Replication will avoid any damage associated with data node failure.
3 A Novel Big Data Approach to Classify Bank Customers - Solution by Combining PIG, R and Hadoop 83 The daemon that performs the operation of data node is called Task Tracker. The functions of data node include: Performs low level read/write operations on blocks Replication is implemented by Data node. Forward data to other data nodes, pipeline the data and send heart beats to name node. observed that some programmers felt difficulty in transforming to map reduce way of programming. Hence more time was spent on writing programs than analysis. PIG was developed to solve these issues. PIG is designed to handle any type of data. PIG is built on two components. PIGLatin PIG Latin is a data flow language unlike declarative or procedural language. Program operates on files in HDFS. User code and binaries can be included anywhere. Hence frequently used operations like join, group, filter, sort etc can be directly re-used. A programmer who is not familiar with Java can also write PIG programs. Programmers can develop their own functions and they are named PIG UDFs (User Defined Functions). PIG Run Time Environment/ PIG Engine B. PIG for BigData Fig.1. Nodes in Hadoop PIG [1, 2] was developed at Yahoo to process large data sets stored in Hadoop distributed file system. It was PIG Engine runs over Hadoop. Hence programmers are completely isolated from Hadoop ecosystem. Even the changes in Hadoop cluster are handled by PIG Engine. It is the responsibility of PIG Engine to parse, optimize, generate execution plan and job monitoring. Figure below illustrates the working of PIG with a sample usecase. Fig.2. Working of PIG C. R for Data Analysis R [3, 4] is an open -source data analys is software widely used by statisticians, data scientists, analysts etc for analysis, visualization and predictive modeling. R allows to write scripts and functions in a complete, interact ive, and object or iented way. Graphica l Visualization is the unique capability provided by R. Scatter plots, box-plots, lattice charts etc are all supported by R. A majority of the data mining algorithms like regression analysis, classification, clustering, association rule mining, text mining, etc can make use of R for efficient and easy implementation. R and Hadoop can go hand in hand to analyze and visualize large amount of data. Several projects have emerged by combining R and Hadoop for handling different applications. RHadoop, RHIPE, ORCH, Hadoop Streaming with R etc are different ways by which R and Hadoop are combined. In RHadoop [5] the hadoop distributed file system will store the large amount of data and analytic code is written using R and the data will be processed in a map reduce manner.
4 84 A Novel Big Data Approach to Classify Bank Customers - Solution by Combining PIG, R and Hadoop III. RESEARCH DESIGN AND METHODOLOGY A. Problem Statement Categorize the customers in a bank into low, medium and high risk based on their previous transactions. This could help the banks to do marketing by understanding the customers. Bank should analyze terra bytes or peta bytes of transaction data for accurate classification. Hence we make use of Hadoop distributed processing framework to store and process this large amount of data. Apart from relying on a single tool like R to do the analysis, we are making use of PIG also to ease the preprocessing step. First we propose a solution solely based on PIG. Then we will illustrate how we can improve the classification using K-Means and implement and visualize it using R. B. Data Pre-Processing using PIG Assume that the bank possess a large dataset containing the customer details, transactions, loan, deposits and credit card details. Clean the data by removing not applicable and null values and save the files in comma separated format to hdfs. Now we can use PIG to pre-process the data. Extract all useful information from different tables and group then based on the customer id. Join the tables to extract loan amount taken, duration to repay the loan, status of loan, type of loan, amount involved in a transaction by customer, date, type of owner, name of the district to which he belongs, average salary, rate of unemployment in that district, number of entrepreneurs in that district, type of card he holds and birth date. C. PIG Algorithm to classify customers PIG can be used to easily classify the data based on some conditions. Here the number of codes needed is only 3 to categorize into 3 groups. Hence saves a lot of coding time. We categorize the customers based on the loan they have already taken, how much they have repayed, unemployment rate in the district to which he belongs etc. The criterions are actually chosen in a random manner. These criteria can be changed according to the Bank s requirement. The above obtained data can be used to classify the bank customers by following Algorithm 1. Algorithm 1: Classification of Bank Customers using PIG Input: loan amount taken, duration to repay the loan, status of loan, type of loan, amount involved in a transaction by customer, date, type of owner, name of the district to which he belongs, average salary, rate of unemployment in that district, number of entrepreneurs in that district, type of card he holds and birth date. Output: Category of Each User (Low Risk / Medium Risk / High Risk) 1. Extract all transaction details within 1 year. 2. If( transaction_amount > 10 lakh) and (avg_sal > 10k) and loan_status== A and (age between 25 and 65) then categorize the customer as low risk. Card Status is card can be upgraded 3. If( transaction_amount > 10 lakh) and (avg_sal > 6k) and loan_status== A and loan_status== C and (age between 25 and 55) and unemployment_rate < 0.80 then categorize him to low risk but card status is card can be upgraded only after loan repayment. 4. If (avg_sal > 6k) and loan_status== B and loan_status== D and (age >35) and no_of_entrepreneur>100 then categorize him to risky and his card status can be down graded 5. Write the category of customers and card status to seperate files. Algorithm 1 shows how to classify bank customer s based on several randomly chosen criteria like loans taken, re-payment status etc. Implementation of above algorithm using PIG will be discussed in section 4. PIG allows us to implement all SQL like operations; little bit of programming capabilities can be brought by writing PIG UDFs. But PIG is not suitable for applications which need complex algorithms, optimizations, machine learning etc. Hence to classify the data more effectively without writing complicated map reduce codes, we can use another data analytic language R. Thus in the proposed method we use PIG for doing pre-processing (cleaning data and extracting useful information from original data) and R for classification. IV. CLASSIFICATION BY R ON HADOOP A clustering algorithm [6] should be devised to obtain a more accurate classification. Here we use the simple and elegant K-means [7, 8] unsupervised clustering algorithm to classify the customers. R has inbuilt capabilities to implement K means clustering. Before feeding data to K-Means we check for outliers. Outliers [10] are eliminated from salary, unemployment rate etc but no change made to loan amount sanctioned as it is more sensitive one compared to other fields. We can make use of box plotting [11] to detect outliers. In R there exist built in functions to implement all these features. Since different data fields maintain different value range, we perform a scaling operation to uniformly scale the attributes. Then we calculate the variance of each attributes and feed this as input to K-Means clustering. A. K-Means Clustering for Classification Bas ic goal of k-means clustering is to classify n objects to k clusters based on their similarity. Lloyd s algorithm [9] is widely accepted for implementing k- means clustering. Lloyd s algorithm is particularly suited
5 A Novel Big Data Approach to Classify Bank Customers - Solution by Combining PIG, R and Hadoop 85 for Big Data applications. In the context of banking customer classification, we set k=3 as we need to classify the customers into low/ medium/ high risk. After executing the K-Means algorithm in R we get a set of parameters like clusters (1to k), cluster centroids, difference in each cluster, difference between other clusters, number of iterations taken to converge, etc. Algorithm 2: Lloyd s Algorithm for implementing K-Means Clustering Input: N data points and initial value for k Output: K clusters 1. Initialize the center of the clusters (μ) with 0; μ i = 0 for i = 1 to k 2. Assign the closest cluster to each data point. c i = { j:d(x j,μ i ) d(x j,μ l ), l i, j=1,...,n } where d(x,μ i )= x μ i Reset the position of each cluster to the mean of all data points currently belonging to that cluster. μi=1/ c i j ci x j, i 4. Repeat steps 2-3 until the cluster center remains same for all successive iterations. V. RESULTS AND DISCUSSION This section gives an overview of the implementation constructs adapted to our approach. A. Dataset We use Berka Data set [12] created by Petr Berka and Marta Sochorova as the source to obtain banking details. Berka Data set contains the datils of about 5300 bank customers with approximately 1,000,000 transactions, 700 loan details and 900 credit card details. Data is collected fro m Czech bank. Different tables in the data set are: ACCOUNTS: containing details of each account CLIENTS: personal information of each customer ORDER: details of each payment order TRANSACTIONS: details of each transaction made by the customer LOANS: details of each loan sanctioned by bank. CREDIT CARDS: details of credit cards issued by the bank DEMOGRAPHIC DATA: details about the districts Entity relationships between the tables are as shown in figure: Fig.3. Entity Relationship in Berka Dataset B. Setting Up RHadoop After installing the Hadoop set environmental variables for integrating R. Load RHadoop packages and read input data from hadoop distributed file system (For details refer Appendix 1). C. Data Preprocessing After data cleaning, the structure of the fetched data will be as shown in figure 4. Fig.4. Structure of data after cleaning stage.
6 86 A Novel Big Data Approach to Classify Bank Customers - Solution by Combining PIG, R and Hadoop Fig.5. Summary of fetched data Fig.6. Box Plot of all attributes Fig.7. Boxplot on average salary, unemployment rate and loan amount Fig.8. Scatterplots of a) average salary and b) loan amount
7 A Novel Big Data Approach to Classify Bank Customers - Solution by Combining PIG, R and Hadoop 87 The box plot of all attributes is as shown in figure 6. But from the boxplot it is clear that there exist outliers in avg_sal, loan_amount and unemp_rate. The box plot applied on each attribute separately is given in figure 7. Scatter plots of average salary and loan amount is as shown in figure 8. Scatter plots will give more insight to the data distribution of sensitive attributes like avg_sal and loan_amt. From fig 8.a it is clear that a lot of pixels lie in the outlier range for avg_sal and loan_amt, hence it is better to maintain the values as such without any change. K-Means Clustering After the data pre-processing stage, the cleaned data is fed as input to K Means clustering algorithm. Elbow graph obtained by plotting the variance obtained in 15 successive iterations are as shown in figure 9.From this figure it is clear that the slope changes in every 3 successive iterations, hence the optimized number of cluster can be deduced as 3. Fig.9. Elbow Graph of 15 successive iterations Fig.10. K - Means Summary K-Means summary of the input data set is given in figure 10. K Means plot of age vs. average salary is illustrated in fig 11.a. Three co lors denote 3 different cluster data points. Fig 11.b. plots data points with cluster names 1, 2, 3 so that the spread of data points is clearly reflected. Fig 12 illustrates by how much, each data points vary within each cluster or it projects the variation of each cluster point from its mean or centroid.
8 88 A Novel Big Data Approach to Classify Bank Customers - Solution by Combining PIG, R and Hadoop Fig.11.a. K means plot for age vs. avg_sal Fig.11.b. Plotting data points with cluster numbers 1,2 and 3. Fig.12. Variation of data points from cluster mean Fig.13. Mean of each cluster.
9 A Novel Big Data Approach to Classify Bank Customers - Solution by Combining PIG, R and Hadoop 89 The cluster mean obtained from the Berka data set for each cluster low, medium and high risk customers is given in fig 13. (The details of how to obtain the graphs and plots in R are given in appendix 2). VI. CONCLUSION Analysis of Big Data helps organization to get valuable insights from the structured or unstructured data they possess. This article identifies an application of Big Data in banking domain to categorize the customers based on their transactions. We use Hadoop as the software platform for storing the data. Data analysis tools like R and PIG are used to extract useful information from the data. To classify the customers, K Means clustering is used. But this could be replaced with SVMs or Neural Networks for better accuracy. Since we have addressed the problem in a BigData perspective, the learning time or heavy processing requirement will not be a problem. ACKNOWLEDGEMENT I sincerely thank Inspire Fellowship provided by Department of Science & Technology, India for supporting my research in the form of fellowship. REFERENCES [1] Alan Gates (2011), Programming PIG, O Reilly Media, New York. [2] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar and Andrew Tomkins, Pig Latin: A Not-So- Foreign Language for Data Processing, SIGMOD 08, June 9 12, 2008, Vancouver, BC, Canada. [3] Emmanual Paradis, R for Beginners, [4] W. N. Venables, D. M. Smith and the R Core Team, An Introduction to R, [5] Vignesh Prajapati (2013), Big Data Analytics with R and Hadoop, Packt Publishing, UK. [6] Jiawei Han, Micheline Kamber (2006), Data Mining Concepts & Techniques, Morgan Kaufmann Publishers, Canada. [7] J. B. MacQueen (1967): "Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability", Berkeley, University of California Press, 1: [8] Brian T. Luke: K-Means Clustering, [9] S. Lloyd, Least square quantization in PCM, IEEE Trans. Infor. Theory, 28, 1982, pp [10] Charu C. Aggarwal (2013), Outlier Analysis, Kluwer Academic Publishers, Boston. [11] Purple Math: Box Plot and 5 number summary, [12] Berka, P. (2000). Guide to the financial data set. The ECML/PKDD 2000 Discovery Challenge. [13] Robert Chansler, Hairong Kuang, Sanjay Radia, Konstantin Shvachko, and Suresh Srinivas, Hadoop Distributed File System, [14] Jeffrey Shafer, Scott Rixner, and Alan L. Cox, The Hadoop Distributed Filesystem: Balancing Portability and Performance, ISPASS 2010, New York, USA, Pages [15] Y. Wang, S. Wang, and K.K. Lai, A New Fuzzy Support Vector Machine to Evaluate Credit Risk, IEEE Trans. Fuzzy Systems, vol. 13, no. 6, pp , Dec [16] T. Joachims, Making Large-Scale SVM Learning Practical, Advances in Kernel Methods: Support Vector Learning, B. ScholLkopf, C. Burges, and A. Smola, eds., MIT-Press, [17] L. Yu, S. Wang, and K. Lai, Credit Risk Assessment with a Multistage Neural Network Ensemble Learning Approach, Expert Systems with Applications, vol. 34, no. 2, pp , [18] H. Guo and S.B. Gelfand, Classification Trees with Neural Network Feature Extraction, IEEE Trans. Neural Networks, vol. 3, pp , [19] R. Rymon, An SE-Tree Based Characterization of the Induction Problem, Proc. Int l Conf. Machine Learning, [20] J.R. Quilan, Induction of Decision Trees, Machine Learning, vol. 1, pp , [21] Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, B.V. Dasarathy, ed. IEEE Computer Society Press, [22] D. Martens, M. De Backer, R. Haesen, J. Vanthienen, M. Snoeck, and B. Baesens, Classification with Ant Colony Optimization, IEEE Trans. Evolutionary Computation, vol. 11, no. 5, pp , Oct [23] D. Martens, B.B. Baesens, and T. Van Gestel, Decomposition Rule Extraction from Support Vector Machines by Active Learning, IEEE Trans. Knowledge and Data Eng., vol. 21, no. 2, pp ,Dec [24] D. Pedro and M. Pazzani, On the Optimality of the Simple Bayesian Classifier under Zero-One Loss, Machine Learning, vol. 29, pp , [25] G. Guo, CR Dyer, Learning from Examples in the Sample Case: Face Expression Recognition, IEEE Trans. Systems, Man, and Cybernetics, Part B, vol. 35, no. 3, pp , June [26] W.A. Chaovalitwon Se, Y.-J. Fan, and R.C. Sachdeo, Support Feature Machine for Classification of Abnormal Brain Activity, Proc. ACM SIGMOD, pp , [27] B.P. Rachel, T. Shlomo, R. Alex, L. Anna, and K. Patrick, Multiplex Assessment of Serum Biomarker Concentrations in Well-Appearing Children with Inflicted Traumatic Brain Injury, Pediatric Research, vol. 65, no. 1, pp , [28] S. Olafsson, X. Li, and S. Wu, Operations Research and Data Mining, European J. Operational Research, vol. 187, no. 3, pp , [29] A. Benos and G. Papanastasopoulos, Extending the Merton Model: A Hybrid Approach to Assessing Credit Quality, Math and Computer Modelling, vol. 48, pp , [30] R. Rymon, An SE-Tree Based Characterization of the Induction Problem, Proc. Int l Conf. Machine Learning, [31] J.R. Quilan, Induction of Decision Trees, Machine Learning, vol. 1, pp , [32] Apache Hadoop Architecture : [33] Google s Technical Paper on Map Reduce, [34] Kumagai, J., Mission impossible? [FBI computer network], IEEE Spectrum, Volume: 40, Issue: 4, April 2003.
10 90 A Novel Big Data Approach to Classify Bank Customers - Solution by Combining PIG, R and Hadoop [35] J. Cohen, Graph Twiddling in a MapReduce World., Computing in Science & Engineering (Volume:11, Issue: 4 ), June [36] Shunmei Meng; Wanchun Dou ; Xuyun Zhang ; Jinjun Chen, KASR: A Keyword-Aware Service Recommendation Method on MapReduce for Big Data Applications, IEEE Transactions on Parallel and Distributed Systems, Volume:25, Issue: 12, Dec [37] Hormozi, H, Akbari, M.K., Hormozi, E.., Javan, M.S, Credit cards fraud detection by negative selection algorithm on hadoop (To reduce the training time), IEEE 5th Conference on Information and Knowledge Technology (IKT), May, [38] Conejero, J.; Burnap, P.; Rana, O.; Morgan, J., Scaling Archived Social Media Data Analysis Using a Hadoop Cloud, IEEE Sixth International Conference on Cloud Computing (CLOUD), June, [39] Xu, J.; Yu, Y.; Chen, Z.; Cao, B.; Dong, W.; Guo, Y.; Cao, J., MobSafe: cloud computing based forensic analysis for massive mobile applications using data mining, Tsinghua Science and Technology, Volume: 18, Issue: 4, June [40] RBI Bank Transaction Statistics, [41] What is BigData?, Authors Profiles Lija Mohan (born April 4, 1988) is an Academician and Researcher who is very much interested in High Performance Computing Domain. She is more interested in Big Data Analytics and has done several projects based on that. She is a University rank holder for both Masters in Computer Science and Bachelors in Computer Science Engineering. She has published several research works and handles sessions during Big Data workshops. She has got sufficient experience as a software engineer and teacher. She has won several fellowships and research grants including Windows Azure and AWS Research Grant. Dr. Sudheep Elayidom M. is working as Associate Professor in Computer Science Division of School of Engineering for the past fifteen years. He secured 1 st Rank in the University for both his Masters Degree and Bachelor s Degree. He has several journal and conference publications to his credit and he is the author of the book titled Data Mining and Warehousing published by Cengage publishers. He obtained Ph.D degree from CUSAT and he is a pioneer in Data Mining. He is actively involved in latest research trends including Cloud Computing, Big Data Analytics, Sentiment Analytics, etc. How to cite this paper: Lija Mohan, Sudheep Elayidom M., "A Novel Big Data Approach to Classify Bank Customers - Solution by Combining PIG, R and Hadoop", International Journal of Information Technology and Computer Science (IJITCS), Vol.8, No.9, pp.81-90, DOI: /ijitcs
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Hadoop Technology for Flow Analysis of the Internet Traffic
Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Storage and Retrieval of Data for Smart City using Hadoop
Storage and Retrieval of Data for Smart City using Hadoop Ravi Gehlot Department of Computer Science Poornima Institute of Engineering and Technology Jaipur, India Abstract Smart cities are equipped with
BIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759
Performance Analysis of Book Recommendation System on Hadoop Platform
Performance Analysis of Book Recommendation System on Hadoop Platform Sugandha Bhatia #1, Surbhi Sehgal #2, Seema Sharma #3 Department of Computer Science & Engineering, Amity School of Engineering & Technology,
L1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li [email protected] School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
International Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
Indian Journal of Science The International Journal for Science ISSN 2319 7730 EISSN 2319 7749 2016 Discovery Publication. All Rights Reserved
Indian Journal of Science The International Journal for Science ISSN 2319 7730 EISSN 2319 7749 2016 Discovery Publication. All Rights Reserved Perspective Big Data Framework for Healthcare using Hadoop
ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION
ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
A Survey on Outlier Detection Techniques for Credit Card Fraud Detection
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud
International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop
ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: [email protected]
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Customer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: [email protected] November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
Data Mining Solutions for the Business Environment
Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen
Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen Anil G, 1* Aditya K Naik, 1 B C Puneet, 1 Gaurav V, 1 Supreeth S 1 Abstract: Log files which
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.
Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics
Big Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Sunnie Chung. Cleveland State University
Sunnie Chung Cleveland State University Data Scientist Big Data Processing Data Mining 2 INTERSECT of Computer Scientists and Statisticians with Knowledge of Data Mining AND Big data Processing Skills:
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Enhanced Boosted Trees Technique for Customer Churn Prediction Model
IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V5 PP 41-45 www.iosrjen.org Enhanced Boosted Trees Technique for Customer Churn Prediction
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
Mining Interesting Medical Knowledge from Big Data
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from
You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.
What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Processing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
Dynamic Data in terms of Data Mining Streams
International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining
Data Refinery with Big Data Aspects
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Map/Reduce Affinity Propagation Clustering Algorithm
Map/Reduce Affinity Propagation Clustering Algorithm Wei-Chih Hung, Chun-Yen Chu, and Yi-Leh Wu Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology,
Suresh Lakavath csir urdip Pune, India [email protected].
A Big Data Hadoop Architecture for Online Analysis. Suresh Lakavath csir urdip Pune, India [email protected]. Ramlal Naik L Acme Tele Power LTD Haryana, India [email protected]. Abstract Big Data
CLOUD STORAGE USING HADOOP AND PLAY
27 CLOUD STORAGE USING HADOOP AND PLAY Devateja G 1, Kashyap P V B 2, Suraj C 3, Harshavardhan C 4, Impana Appaji 5 1234 Computer Science & Engineering, Academy for Technical and Management Excellence
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
A Database Hadoop Hybrid Approach of Big Data
A Database Hadoop Hybrid Approach of Big Data Rupali Y. Behare #1, Prof. S.S.Dandge #2 M.E. (Student), Department of CSE, Department, PRMIT&R, Badnera, SGB Amravati University, India 1. Assistant Professor,
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON BIG DATA MANAGEMENT AND ITS SECURITY PRUTHVIKA S. KADU 1, DR. H. R.
Highly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 3, Ver. II (May-Jun. 2014), PP 58-62 Highly Available Hadoop Name Node Architecture-Using Replicas
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System
Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
International Journal of Emerging Technology & Research
International Journal of Emerging Technology & Research High Performance Clustering on Large Scale Dataset in a Multi Node Environment Based on Map-Reduce and Hadoop Anusha Vasudevan 1, Swetha.M 2 1, 2
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
How To Analyze Network Traffic With Mapreduce On A Microsoft Server On A Linux Computer (Ahem) On A Network (Netflow) On An Ubuntu Server On An Ipad Or Ipad (Netflower) On Your Computer
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pig and Typical Mapreduce Anjali P P and Binu A Department of Information Technology, Rajagiri School of Engineering and Technology,
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health
Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
Information Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE Anjali P P 1 and Binu A 2 1 Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi. M G University, Kerala
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
A REVIEW ON EFFICIENT DATA ANALYSIS FRAMEWORK FOR INCREASING THROUGHPUT IN BIG DATA. Technology, Coimbatore. Engineering and Technology, Coimbatore.
A REVIEW ON EFFICIENT DATA ANALYSIS FRAMEWORK FOR INCREASING THROUGHPUT IN BIG DATA 1 V.N.Anushya and 2 Dr.G.Ravi Kumar 1 Pg scholar, Department of Computer Science and Engineering, Coimbatore Institute
Analysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University [email protected] Dr. Thomas C. Bressoud Dept. of Mathematics and
Efficient Analysis of Big Data Using Map Reduce Framework
Efficient Analysis of Big Data Using Map Reduce Framework Dr. Siddaraju 1, Sowmya C L 2, Rashmi K 3, Rahul M 4 1 Professor & Head of Department of Computer Science & Engineering, 2,3,4 Assistant Professor,
Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science
A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
BIG DATA ANALYSIS USING RHADOOP
BIG DATA ANALYSIS USING RHADOOP HARISH D * ANUSHA M.S Dr. DAYA SAGAR K.V ECM & KLUNIVERSITY ECM & KLUNIVERSITY ECM & KLUNIVERSITY Abstract In this electronic age, increasing number of organizations are
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
