Contemporary Techniques for Data Mining Social Media

Size: px
Start display at page:

Download "Contemporary Techniques for Data Mining Social Media"

Transcription

1 Contemporary Techniques for Data Mining Social Media Stephen Cutting ( ) 1 Introduction Social media websites such as Facebook, Twitter and Google+ allow millions of users to communicate with one another in a range of dynamic ways [Zafarani et al., 2014, p.18] and also offer new ways for data about these users to be collected, processed and analysed [Adedoyin-Olowe et al., 2013]. The unique nature of social networks means that the data they produce offers distinct challenges for those seeking to use it for data mining [Zafarani et al., 2014, p.15]. This report analyses the challenges posed to the knowledge discovery in databases (KDD) process, of which data mining forms a part, by data gathered from social media websites. The report also examines the main stages of the KDD process in the context of data mining social media data. 2 Challenges Presented by Social Media Data Due to the nature of social media, it can generate a wide range of different types of data including user profile information, text, video and images. In order for many tasks related to data mining social media to be successful, these different of forms of data have to processed and analysed effectively [Witten and Frank, 2005]. This can create unique challenges [Adedoyin-Olowe et al., 2013, p.1]. Different types of often complex relationships and interactions between users, or social atoms, communities, or social molecules and content must be analysed, requiring the development of new methodologies and algorithms [Zafarani et al., 2014, pp.16-22]. Because well known social networks can have tens of millions of users [Rossi, 2010], vast amounts of data can be produced by them (see Figure 1). Those trying to mine data gathered from social media often face challenges caused by the sheer amount of generated data that needs to be processed and analysed [Adedoyin-Olowe et al., 2013, p.1] and [Zafarani et al., 2014, p.15]. The challenges of mining data from social media include data processing and data storage. Processing large volumes of data can be time consuming and very memory intensive [Yan et al., 2015]. It is not practical to use some data processing techniques that can be used on smaller data sets when dealing with very large amounts of data, including data replication and data migration [Lawrence et al., 2013, p.70]. Physically storing enormous amounts of data may be prohibitively time consuming or expensive for some individuals and organisations. Although social media generates and enormous amount of data, there can be a lack of information about specific entities and this is known as the big data paradox [Zafarani et al., 2014, p.17]. This issue could cause problems if data was needed on an individual user. The unstructured, noisy nature of data produced by social media can make data mining more difficult [Zafarani et al., 2014, p.15], [Adedoyin-Olowe et al., 2013, p.1]. One reason that data collected from such sources is frequently noisy and unstructured is that it is generated by a large number of individual users [Zafarani et al., 2014, p.15]. Noise in data can cause a range of issues in the data mining process. Erroneous data can have a negative impact on data mining algorithms simply because the data that they are processing contains a large number of errors [Wu and Zhu, 2008, p.917]. Misclassification can often occur, 1

2 even in data that is being utilised while training the classifier [Witten and Frank, 2005, p.5]. Errors, or noise, in data can also make it more difficult for the algorithms used in machine learning to identify an optimal concept description for a specific set of data because an optimal data description may be discarded because it contains attributes that are identified as errors [Witten and Frank, 2005, pp.26-29]. Figure 1: Diagram of estimated data generated on social network sites every minute. Source: [Zafarani et al., 2014, p.36] 3 Harvesting Social Media Data Application program interfaces (APIs) are one of the most common ways that data is gathered from social media [Zafarani et al., 2014, p.17], [Abdulrahman et al., 2013]. APIs enable software applications to communicate with one another securely and accurately over the Internet [Abdulrahman et al., 2013, p.89] and often take the form of libraries. Facebook, Twitter and Google+ all have APIs that developers can utilise when carrying out tasks such as data mining [Russell, 2013, p.48]. Although it is being phased out [Facebook, 2014a], Facebook s Graph API has an SQL like language called Facebook Query Language (FQL), which can be used to develop queries for the API [Facebook, 2014a]. Facebook s Graph API is a good example of an API that is used in the data mining of social 2

3 media. It is the primary way that people request and submit data to Facebook s social graph [Facebook, 2014b]. Social graphs are a representation of the information held on social networks and [Sharad and Danezis, 2014]. The two main tables that make up Facebook s social graph are called node and link [Facebook, 2014b]. The node table contains information about nodes, which represent objects such as users, photos and videos [Qu and Dessloch, 2014, p.139]. The link table contains information about the links between different objects, whether two users or friends or whether a user likes a photograph for example [Qu and Dessloch, 2014, p.139]. 4 Storing Social Media Data Some KDD techniques require data to be stored in some way [Adedoyin-Olowe et al., 2013]. This section will explore how social media data is stored in preparation for such analysis. As has been discussed, storing the massive amounts of data that can be produced by social media websites for use in KDD processes can present challenges. As those running such websites require more storage space, scalability becomes an important consideration [Tran, 2013, p.1]. There are two primary methods for addressing this scalability issue; horizontal scaling and vertical scaling [Tran, 2013, p.1]. Vertical scaling is the addition of more hardware resources to servers [Pokorny, 2013]. Horizontal scaling involves the spreading of the workload across new commodity servers [Pokorny, 2013]. Horizontal scaling is utilised by most social media websites today as it avoids bottleneck issues and is more cost effective than vertical scaling [Tran, 2013, p.1]. Although it is well known that duplicate data can cause issues in distributed database systems [Witten and Frank, 2005, p.54]; it can be useful in the context of social media websites. This is because replicated data can be used to minimise the disruption caused by a system failure [Tran, 2013, p.1]. It has also been shown that it can be more efficient for, sometimes duplicate, data related to connected individuals to be stored on servers that are more closely connected [Tran, 2012, p.2]. A storage scheme that adheres to this principle is known as being socially aware [Tran, 2012]. 5 Data Cleansing Data cleansing is an important part of the KDD process in general [Witten and Frank, 2005, pp.48-54] but it is especially important and potentially more difficult when the data comes from a social media website, as is discussed in section 1 of this report. Commonly used KDD data cleansing techniques are utilised in this process, which aim to remove noise and deal with outliers, missing values and duplicate data [Zafarani et al., 2014, p.143]. As has been discussed, data from social media websites can be very noisy. Filtering algorithms are often effective at reducing the amount of noise in data in the data cleansing process [Zafarani et al., 2014, p.141]. The reduction of noise in data can lead to data being lost [Wu and Zhu, 2008]. Although it is often true that reducing class noise in a training dataset can frequently help a learner be more accurate [Zhu and Wu, 2004]. Simply deleting a noisy instance from a dataset that contains erroneous attribute values can cause issues [Wu and Zhu, 2008]. This is because additional correct attribute values of the instance may still aid the learning process [Zafarani et al., 2014, p.141]. There is no evidence to suggest that data cleansing can improve data mining results on data sets that contain erroneous or missing data [Zafarani et al., 2014, p.141]. 6 Pre-processing Social Media Data In social media, a great deal of information is represented as networks. Which can be sampled by selecting a subset of their edges and nodes through the utilisation of various sampling techniques 3

4 [Adedoyin-Olowe et al., 2013]. Networks can additionally be sampled using a small set of nodes. This is achieved by sampling the set of edges and nodes that are connected or closely related to them and their connected components [Adedoyin-Olowe et al., 2013]. Aggregation is one technique that is used in the pre-processing of data in the KDD process and it involves the merging of multiple objects into a single object [Tan et al., 2006, pp.45-47], [Witten and Frank, 2005, p.49]. Aggregation is used in pre-processing data gathered from social media websites [Zafarani et al., 2014, p.142]. Combining multiple features can save storage space and reduce data variance, which can cause the data to have a higher resistance to noise and distortion [Zafarani et al., 2014, p.142]. Discretisation is the process of converting a continuous attribute into a categorical attribute, which may be necessary if a data mining algorithm needs data to be in the form of categorical attributes [Tan et al., 2006, p.57]. Discretisation is also useful as it allows data from social media that has been placed in a group to be compared with other information and analysed as a whole [Zafarani et al., 2014, p.142]. The selection of subsets of features helps reduce the amount of redundant and irrelevant features in a data set [Tan et al., 2006, p.52] and it is used in the pre-processing of data from social media [Zafarani et al., 2014, p.142]. This is important because the irrelevant or redundant features can cause the quality of the clusters that are identified and the accuracy of classification to be reduced [Tan et al., 2006, p.52]. Sampling is selecting and processing a small, random section of the data, which avoids processing the entire data set [Riondato, 2014]. This is especially important in the process of data mining information gathered from social media as processing an entire data set can be prohibitively computationally expensive [Zafarani et al., 2014, p.143]. 7 Mining Social Media Data Data mining is a stage in the KDD process in which pre-processed data is analysed to discover relationships and patterns [Zafarani et al., 2014, p.135]. Various techniques are utilised in the data mining of data from social media websites and some of the most relevant of these are discussed in this section. The nature of the data that is being analysed often defines the best data mining techniques to deploy when analysing it [Zafarani et al., 2014, pp ]. Figure 2: Diagram of KDD process. Source: [Zafarani et al., 2014, p.36] There are several different categories of data mining algorithms. Supervised learning and unsupervised learning are two important categories in the data mining of data from social media websites. In supervised learning, the class attribute exists, and the objective is to predict the 4

5 class attribute value [Adedoyin-Olowe et al., 2013]. In unsupervised learning, the dataset has no class attribute, and the objective is to identify similar instances in the data and place them in groups [Adedoyin-Olowe et al., 2013]. The grouping of instances that share similarities allows the identification of important patterns in a dataset [Cabanes et al., 2010]. Despite the use of filtering algorithms, like those discussed in section 5 of this report, any data mining effort is likely to suffer if there are many errors in data [Wu and Zhu, 2008, p.918]. This means that it is important that data mining algorithms are error tolerant and do not allow themselves to be severely adversely affected by a relatively small amount of noise in the data being analysed [Wu and Zhu, 2008, p.918]. Graph theory, the study of graphs, is very important in data mining information from social media. This is because this methodology allows the identification important features of the data and it is particularly effective at dealing with large datasets [Adedoyin-Olowe et al., 2013]. Community detection is one graph theory technique that is used to analyse social media data. This entails the utilisation of methodologies including vertex clustering, a form of hierarchical clustering, to try to identify the social molecules, or communities, that social atoms, or users, belong to [Adedoyin-Olowe et al., 2013]. 8 Conclusions As social media websites continue to grow, the amount of data that they produce will increase. Fortunately, the combination of the ever increasing power and storage capacity of computer systems and the development of algorithms and methodologies like those discussed in this report means that this data may be able to be used in KDD processes. The data mining stage of these processes may well be able to reveal a great deal about the individuals that make up these vast social networks and the dynamic connections between them. References [Abdulrahman et al., 2013] Abdulrahman, R., Neagu, D., Holton, D. R. W., Ridley, M. J., and Lan, Y. (2013). Data Extraction from Online Social Networks Using Application Programming Interface in a Multi Agent System Approach. T. Computational Collective Intelligence, 11: [Adedoyin-Olowe et al., 2013] Adedoyin-Olowe, M., Gaber, M. M., and Stahl, F. (2013). Survey of Data Mining Techniques for Social Media Analysis. CoRR, abs/ A [Cabanes et al., 2010] Cabanes, G., Bennani, Y., and Fresneau, D. (2010). Mining RFID behavior data using unsupervised learning. IJAL, 1(1): [Facebook, 2014a] Facebook (2014a). Facebook Query Language (FQL) Overview. https: //developers.facebook.com/docs/technical-guides/fql/. [Online; Accessed February 2015]. [Facebook, 2014b] Facebook (2014b). Quickstart for Graph API. facebook.com/docs/graph-api/. [Online; Accessed February 2015]. [Lawrence et al., 2013] Lawrence, B. N., Bennett, V. L., Churchill, J., Juckes, M., Kershaw, P., Pascoe, S., Pepler, S., Pritchard, M., and Stephens, A. (2013). Storing and manipulating environmental big data with JASMIN. In Hu, X., Lin, T. Y., Raghavan, V., Wah, B. W., Baeza-Yates, R. A., Fox, G., Shahabi, C., Smith, M., Yang, Q., Ghani, R., Fan, W., Lempel, R., and Nambiar, R., editors, Proceedings of the 2013 IEEE International Conference on Big Data, 6-9 October 2013, Santa Clara, CA, USA, pages IEEE. 5

6 [Pokorny, 2013] Pokorny, J. (2013). Nosql databases: a step to database scalability in web environment. International Journal of Web Information Systems, 9(1): [Qu and Dessloch, 2014] Qu, W. and Dessloch, S. (2014). A Demand-Driven Bulk Loading Scheme for Large-Scale Social Graphs. In Manolopoulos, Y., Trajcevski, G., and Kon- Popovska, M., editors, Advances in Databases and Information Systems - 18th East European Conference, ADBIS 2014, Ohrid, Macedonia, September 7-10, Proceedings, volume 8716 of Lecture Notes in Computer Science, pages Springer. [Riondato, 2014] Riondato, M. (2014). Sampling-based data mining algorithms: Modern techniques and case studies. In Calders, T., Esposito, F., Hüllermeier, E., and Meo, R., editors, Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2014, Nancy, France, September 15-19, Proceedings, Part III, volume 8726 of Lecture Notes in Computer Science, pages Springer. [Rossi, 2010] Rossi, L. (2010). Playing your network: gaming in social network sites. Available at SSRN [Russell, 2013] Russell, M. (2013). Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More. O Reilly Media. [Sharad and Danezis, 2014] Sharad, K. and Danezis, G. (2014). An Automated Social Graph De-anonymization Technique. CoRR, abs/ [Tan et al., 2006] Tan, P.-N., Steinbach, M., Kumar, V., et al. (2006). mining, volume 1. Pearson Addison Wesley Boston. Introduction to data [Tran, 2012] Tran, A. (2012). Data Storage for Social Networks: A Socially Aware Approach. SpringerBriefs in Optimization. Springer. [Tran, 2013] Tran, D. (2013). Data Mining Social Networks: A Socially Aware Approach. Springer. [Witten and Frank, 2005] Witten, I. H. and Frank, E. (2005). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. [Wu and Zhu, 2008] Wu, X. and Zhu, X. (2008). Mining With Noise Knowledge: Error-Aware Data Mining. IEEE Transactions on Systems, Man, and Cybernetics, Part A, 38(4): [Yan et al., 2015] Yan, D., Yin, X., Lian, C., Zhong, X., Zhou, X., and Wu, G. (2015). Using Memory in the Right Way to Accelerate Big Data Processing. J. Comput. Sci. Technol., 30(1): [Zafarani et al., 2014] Zafarani, R., Abbasi, M. A., and Liu, H. (2014). Social Media Mining: An Introduction. Cambridge University Press. [Zhu and Wu, 2004] Zhu, X. and Wu, X. (2004). Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review, 22(3):

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

How To Solve The Kd Cup 2010 Challenge

How To Solve The Kd Cup 2010 Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are

More information

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing Introduction to Data Mining and Machine Learning Techniques Iza Moise, Evangelos Pournaras, Dirk Helbing Iza Moise, Evangelos Pournaras, Dirk Helbing 1 Overview Main principles of data mining Definition

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

Neural Networks in Data Mining

Neural Networks in Data Mining IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V6 PP 01-06 www.iosrjen.org Neural Networks in Data Mining Ripundeep Singh Gill, Ashima Department

More information

TIM 50 - Business Information Systems

TIM 50 - Business Information Systems TIM 50 - Business Information Systems Lecture 15 UC Santa Cruz March 1, 2015 The Database Approach to Data Management Database: Collection of related files containing records on people, places, or things.

More information

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph Janani K 1, Narmatha S 2 Assistant Professor, Department of Computer Science and Engineering, Sri Shakthi Institute of

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Chapter 6. Foundations of Business Intelligence: Databases and Information Management Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2 Class Announcements TIM 50 - Business Information Systems Lecture 15 Database Assignment 2 posted Due Tuesday 5/26 UC Santa Cruz May 19, 2015 Database: Collection of related files containing records on

More information

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem: Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Chapter 6 Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Introduction. Chapter 1

Introduction. Chapter 1 This chapter is from Social Media Mining: An Introduction. By Reza Zafarani, Mohammad Ali Abbasi, and Huan Liu. Cambridge University Press, 2014. Draft version: April 20, 2014. Complete Draft and Slides

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113

Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113 CSE 450 Web Mining Seminar Spring 2008 MWF 11:10 12:00pm Maginnes 113 Instructor: Dr. Brian D. Davison Dept. of Computer Science & Engineering Lehigh University davison@cse.lehigh.edu http://www.cse.lehigh.edu/~brian/course/webmining/

More information

A Review of Data Mining Techniques

A Review of Data Mining Techniques Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Oman College of Management and Technology Course 803401 DSS Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization CS/MIS Department Information Sharing

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Chapter 5. Warehousing, Data Acquisition, Data. Visualization Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization 5-1 Learning Objectives

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery Runu Rathi, Diane J. Cook, Lawrence B. Holder Department of Computer Science and Engineering The University of Texas at Arlington

More information

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com

More information

Data Modeling for Big Data

Data Modeling for Big Data Data Modeling for Big Data by Jinbao Zhu, Principal Software Engineer, and Allen Wang, Manager, Software Engineering, CA Technologies In the Internet era, the volume of data we deal with has grown to terabytes

More information

Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing

Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 E-commerce recommendation system on cloud computing

More information

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

More information

Rule based Classification of BSE Stock Data with Data Mining

Rule based Classification of BSE Stock Data with Data Mining International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 4, Number 1 (2012), pp. 1-9 International Research Publication House http://www.irphouse.com Rule based Classification

More information

DATA PREPARATION FOR DATA MINING

DATA PREPARATION FOR DATA MINING Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Foundations of Business Intelligence: Databases and Information Management Content Problems of managing data resources in a traditional file environment Capabilities and value of a database management

More information

Exploring Big Data in Social Networks

Exploring Big Data in Social Networks Exploring Big Data in Social Networks virgilio@dcc.ufmg.br (meira@dcc.ufmg.br) INWEB National Science and Technology Institute for Web Federal University of Minas Gerais - UFMG May 2013 Some thoughts about

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

Syllabus. HMI 7437: Data Warehousing and Data/Text Mining for Healthcare

Syllabus. HMI 7437: Data Warehousing and Data/Text Mining for Healthcare Syllabus HMI 7437: Data Warehousing and Data/Text Mining for Healthcare 1. Instructor Illhoi Yoo, Ph.D Office: 404 Clark Hall Email: muteaching@gmail.com Office hours: TBA Classroom: TBA Class hours: TBA

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

Knowledge Discovery from Data Bases Proposal for a MAP-I UC

Knowledge Discovery from Data Bases Proposal for a MAP-I UC Knowledge Discovery from Data Bases Proposal for a MAP-I UC P. Brazdil 1, João Gama 1, P. Azevedo 2 1 Universidade do Porto; 2 Universidade do Minho; 1 Knowledge Discovery from Data Bases We are deluged

More information

Data Mining and Soft Computing. Francisco Herrera

Data Mining and Soft Computing. Francisco Herrera Francisco Herrera Research Group on Soft Computing and Information Intelligent Systems (SCI 2 S) Dept. of Computer Science and A.I. University of Granada, Spain Email: herrera@decsai.ugr.es http://sci2s.ugr.es

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining José Hernández ndez-orallo Dpto.. de Systems Informáticos y Computación Universidad Politécnica de Valencia, Spain jorallo@dsic.upv.es Horsens, Denmark, 26th September 2005

More information

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014 White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Importance or the Role of Data Warehousing and Data Mining in Business Applications

Importance or the Role of Data Warehousing and Data Mining in Business Applications Journal of The International Association of Advanced Technology and Science Importance or the Role of Data Warehousing and Data Mining in Business Applications ATUL ARORA ANKIT MALIK Abstract Information

More information

Introduction. A. Bellaachia Page: 1

Introduction. A. Bellaachia Page: 1 Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Chapter 5 Foundations of Business Intelligence: Databases and Information Management 5.1 Copyright 2011 Pearson Education, Inc. Student Learning Objectives How does a relational database organize data,

More information

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks Text Analytics World, Boston, 2013 Lars Hard, CTO Agenda Difficult text analytics tasks Feature extraction Bio-inspired

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Foundations of Business Intelligence: Databases and Information Management Problem: HP s numerous systems unable to deliver the information needed for a complete picture of business operations, lack of

More information

DBTech Pro Workshop. Knowledge Discovery from Databases (KDD) Including Data Warehousing and Data Mining. Georgios Evangelidis

DBTech Pro Workshop. Knowledge Discovery from Databases (KDD) Including Data Warehousing and Data Mining. Georgios Evangelidis DBTechNet DBTech Pro Workshop Knowledge Discovery from Databases (KDD) Including Data Warehousing and Data Mining Dimitris A. Dervos dad@it.teithe.gr http://aetos.it.teithe.gr/~dad Georgios Evangelidis

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

Course 103402 MIS. Foundations of Business Intelligence

Course 103402 MIS. Foundations of Business Intelligence Oman College of Management and Technology Course 103402 MIS Topic 5 Foundations of Business Intelligence CS/MIS Department Organizing Data in a Traditional File Environment File organization concepts Database:

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Data Cleansing for Remote Battery System Monitoring

Data Cleansing for Remote Battery System Monitoring Data Cleansing for Remote Battery System Monitoring Gregory W. Ratcliff Randall Wald Taghi M. Khoshgoftaar Director, Life Cycle Management Senior Research Associate Director, Data Mining and Emerson Network

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

Pattern-Aided Regression Modelling and Prediction Model Analysis

Pattern-Aided Regression Modelling and Prediction Model Analysis San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Pattern-Aided Regression Modelling and Prediction Model Analysis Naresh Avva Follow this and

More information

Spam detection with data mining method:

Spam detection with data mining method: Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,

More information

Politecnico di Torino. Porto Institutional Repository

Politecnico di Torino. Porto Institutional Repository Politecnico di Torino Porto Institutional Repository [Proceeding] NEMICO: Mining network data through cloud-based data mining techniques Original Citation: Baralis E.; Cagliero L.; Cerquitelli T.; Chiusano

More information

AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP

AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP Asst.Prof Mr. M.I Peter Shiyam,M.E * Department of Computer Science and Engineering, DMI Engineering college, Aralvaimozhi.

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Prerita Gupta Research Scholar, DAV College, Chandigarh Dr. Harmunish Taneja Department of Computer Science and

More information

Data Mining and Business Intelligence CIT-6-DMB. http://blackboard.lsbu.ac.uk. Faculty of Business 2011/2012. Level 6

Data Mining and Business Intelligence CIT-6-DMB. http://blackboard.lsbu.ac.uk. Faculty of Business 2011/2012. Level 6 Data Mining and Business Intelligence CIT-6-DMB http://blackboard.lsbu.ac.uk Faculty of Business 2011/2012 Level 6 Table of Contents 1. Module Details... 3 2. Short Description... 3 3. Aims of the Module...

More information

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant

More information

Inner Classification of Clusters for Online News

Inner Classification of Clusters for Online News Inner Classification of Clusters for Online News Harmandeep Kaur 1, Sheenam Malhotra 2 1 (Computer Science and Engineering Department, Shri Guru Granth Sahib World University Fatehgarh Sahib) 2 (Assistant

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

Implementing Graph Pattern Mining for Big Data in the Cloud

Implementing Graph Pattern Mining for Big Data in the Cloud Implementing Graph Pattern Mining for Big Data in the Cloud Chandana Ojah M.Tech in Computer Science & Engineering Department of Computer Science & Engineering, PES College of Engineering, Mandya Ojah.chandana@gmail.com

More information

Educational Social Network Group Profiling: An Analysis of Differentiation-Based Methods

Educational Social Network Group Profiling: An Analysis of Differentiation-Based Methods Educational Social Network Group Profiling: An Analysis of Differentiation-Based Methods João Emanoel Ambrósio Gomes 1, Ricardo Bastos Cavalcante Prudêncio 1 1 Centro de Informática Universidade Federal

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

NoSQL Database Options

NoSQL Database Options NoSQL Database Options Introduction For this report, I chose to look at MongoDB, Cassandra, and Riak. I chose MongoDB because it is quite commonly used in the industry. I chose Cassandra because it has

More information

Method of Fault Detection in Cloud Computing Systems

Method of Fault Detection in Cloud Computing Systems , pp.205-212 http://dx.doi.org/10.14257/ijgdc.2014.7.3.21 Method of Fault Detection in Cloud Computing Systems Ying Jiang, Jie Huang, Jiaman Ding and Yingli Liu Yunnan Key Lab of Computer Technology Application,

More information

A Study on the application of Data Mining Methods in the analysis of Transcripts

A Study on the application of Data Mining Methods in the analysis of Transcripts A Study on the application of Data Mining Methods in the analysis of Transcripts Luis Raunheitte*, Rubens de Camargo*, TAkato Kurihara*, Alan Heitokotter*, Juvenal J. Duarte* *School of Computer and Informatics

More information

can you effectively plan for the migration and management of systems and applications on Vblock Platforms?

can you effectively plan for the migration and management of systems and applications on Vblock Platforms? SOLUTION BRIEF CA Capacity Management and Reporting Suite for Vblock Platforms can you effectively plan for the migration and management of systems and applications on Vblock Platforms? agility made possible

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Foundations of Business Intelligence: Databases and Information Management Wienand Omta Fabiano Dalpiaz 1 drs. ing. Wienand Omta Learning Objectives Describe how the problems of managing data resources

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Name of Module: Big Data ECTS: 6 Module-ID: Person Responsible for Module (Name, Mail address): Angel Rodríguez, arodri@fi.upm.es

Name of Module: Big Data ECTS: 6 Module-ID: Person Responsible for Module (Name, Mail address): Angel Rodríguez, arodri@fi.upm.es Name of Module: Big Data ECTS: 6 Module-ID: Person Responsible for Module (Name, Mail address): Angel Rodríguez, arodri@fi.upm.es University: UPM Departments: DATSI, DLSIIS 1. Prerequisites for Participation

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

Praseeda Manoj Department of Computer Science Muscat College, Sultanate of Oman

Praseeda Manoj Department of Computer Science Muscat College, Sultanate of Oman International Journal of Electronics and Computer Science Engineering 290 Available Online at www.ijecse.org ISSN- 2277-1956 Analysis of Grid Based Distributed Data Mining System for Service Oriented Frameworks

More information

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support Rok Rupnik, Matjaž Kukar, Marko Bajec, Marjan Krisper University of Ljubljana, Faculty of Computer and Information

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

More information

Assessing Data Mining: The State of the Practice

Assessing Data Mining: The State of the Practice Assessing Data Mining: The State of the Practice 2003 Herbert A. Edelstein Two Crows Corporation 10500 Falls Road Potomac, Maryland 20854 www.twocrows.com (301) 983-3555 Objectives Separate myth from reality

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Mining Big Data. Pang-Ning Tan. Associate Professor Dept of Computer Science & Engineering Michigan State University

Mining Big Data. Pang-Ning Tan. Associate Professor Dept of Computer Science & Engineering Michigan State University Mining Big Data Pang-Ning Tan Associate Professor Dept of Computer Science & Engineering Michigan State University Website: http://www.cse.msu.edu/~ptan Google Trends Big Data Smart Cities Big Data and

More information

<no narration for this slide>

<no narration for this slide> 1 2 The standard narration text is : After completing this lesson, you will be able to: < > SAP Visual Intelligence is our latest innovation

More information

IT services for analyses of various data samples

IT services for analyses of various data samples IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical

More information

Research on Operation Management under the Environment of Cloud Computing Data Center

Research on Operation Management under the Environment of Cloud Computing Data Center , pp.185-192 http://dx.doi.org/10.14257/ijdta.2015.8.2.17 Research on Operation Management under the Environment of Cloud Computing Data Center Wei Bai and Wenli Geng Computer and information engineering

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Project for Michael Pitts Course TCSS 702A University of Washington Tacoma Institute of Technology ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Under supervision of : Dr. Senjuti

More information

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang

More information

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI Data Mining Knowledge Discovery, Data Warehousing and Machine Learning Final remarks Lecturer: JERZY STEFANOWSKI Email: Jerzy.Stefanowski@cs.put.poznan.pl Data Mining a step in A KDD Process Data mining:

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Graph Database Proof of Concept Report

Graph Database Proof of Concept Report Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

More information