What is Big Data
Outline What is Big data and where they come from? How we deal with Big data?
Big Data Everywhere! As a human, we generate a lot of data during our everyday activity. When you buy something, you generate some transaction record for your purchase, when you go online, when you message your friend over your phone, that all generates tons of data. In the past, most of the data just get throw out. But for the recent years, people start realizing we can find many interesting things from these data. For example, the store can use these data to find out your purchasing behavior and to sell more things to you. Biologist can use these data to find out how one disease propagate over different places. In a environment like IoT when everything is connected to the Internet, we will generate even more data.
How much data? When bill gates invented Windows, he used to say 640K is enough for the memory of a computer. But today, we count these big data as PB and TB. A TB is 1000GB and a PB=1000TB. Just give you some example, Google process 20PB per day. And Facebook and Ebay generate from 10-50TB per day. Assuming we use 4G to send these data (100MB/sec). It will take more than one day to transmit these data generated by Facebook users.
Some scientific projects generate even more data than these online service. And HLC (15PB), for high-energy physics, generates more than 15PB per year. Earthscope generates 67TB per day. Without a supercomputer, it will be impossible to analyze these data. Maximilien Brice, CERN
The Earthscope The Earthscope is the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data.
Type of Data These data can be generated in different forms: structured, semi-structured, graphic and text. It can be real-time and non-realtime.
What to do with these data? What can we do with the data? You can use them to generate some statistics of the past. For example, you Can use amazon s data to find out what s the curent most popular book people have bought. Or given a question, you can use these data to find the answer, For example, FBI can use your facial image to find out everything about you. Or you can discover something new from the data, and These are what many scientists do every day. For example, biologist Use the biological data to figure out how to make people live longer.
Warehouse Architecture For the first type of usage, so-called, data warehousing. We Normally collect the data from various places and then integrate them And put them together in a central server. So that people can Access the central server to do the analysis they want. During the integration and analysis, we can also generate some intermediate data, or so-called metadata.
Aggregates For example, you can query how many pieces of products has been sold on day 1 using A simple SQL. Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 sale prodid storeid date amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 81
What is Data Mining? Data mining is generally different from the first two usage of the data. In this, we are trying to discover something unexpected or unknown from The data. Unlike the previous example, we knew what we store and what We are gonna get from the database.
Data Mining Tasks There are many different techniques can be used for data mining, and here We are just briefly describing some common ones.
Classification: Definition Classification is one of the common things we do in data mining. The idea here is to use the data to train model based on some features of the data. For example, we want to divide people into two kind of classes one with healthy living style, who eat well, sleep well and exercise regularly, so that we can first collect a dataset of these people and use that to train a model based on their sleeping time, dieting behavior, exercise hours. So in the future, when we have a new person s data, we can then use this model to tell if this person living healthily or not. When building a model, we usually need to test the accuracy of the model, and the general practice to use half of the data as training data to build the model, and the use the remaining data to validate the model.
Decision Trees Decision tree is one of the mechanisms for classification. For example, from this data, we can find if a person is from SF or he is driving a van, he is more likely to buy a new car. sale custid car age city newcar c1 taurus 27 sf yes c2 van 35 la yes c3 van 40 sf yes c4 taurus 22 sf yes c5 merc 50 la no c6 taurus 25 la no training set 15
Clustering Clustering is a way to divide data into different groups. For example, if you have people s age, education and income data. You can see people have higher education and older generally have a higher income. income education age 16
Association Rule Mining Association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness. The most commonly used example to analyze the sales record. 17
Other Types of Mining What the above said is assuming you have a structured database, so you have structured column and row for the analysis. There are also data which are un-structured data like text and graph mining. Text mining is mostly often used in mining information from web pages. For example, finding which web pages are more related. Graph mining Is one special structured data in which data entity are stored in a graph format in which different parameter/features in the data form a graph relationship. That is, nodes are the features and links are the relationship between features.
Data Streams Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities. We normally only look at a subset of these data one at a time using so called window 'technique and there are different ways of defining the window. Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches. 19
Challenges in Handling Big Data The issue with big data is because it s big so that you need Big storage and big processing power to handle them. And you also need fast algorithm/architecture to process them. 20
Big Data Landscape There are many technologies have been proposed to handle big data. In this course, we will focus on Hadoop, But will also briefly mention some of the other technology. 21
Big Data Technology (#1) The current big data technologies generally focus on 3 aspects: 1. How to reduce the running time of computing big data 2. How to make the analysis tool of big data become more and more effective 3. How to get more and more insight out of these data and use them for business Another trend we can foresee in the future is that the data will become bigger and bigger from terabytes to even zettabytes! it should be understood that there at least three significant aspects of Big Data that make it unique, beyond just "an order of magnitude more data beyond what you have now : First, we need to Recognize that traditional methods for moving, processing, and querying data were not sufficient, the Big Data industry has created an entirely new set of techniques -- and adapting some of those that existed -- so that organizations can actually process the full universe of information that they possess in enough time to actually get inside the windows of key business processes and critical decision trees. Thus, Fast Data techniques provides the ability to 'see' all (or at least enough) of what you know in a short enough time to actually do something with what you've learned. 22
Big Data Technology (#2) Second. There are qualitative differences between traditional business databases and Big Data. While Fast Data is about new techniques to process and transform raw information considerably faster than ever before, we need Big Analytics to turn information into knowledge using a combination of existing and new approaches. As you can see from the slides, some of the classic players in analytics are in use here including MATLAB, SAS, and R. But some of the most interesting aspects of Big Data can be found in relatively new entrants such as Apache Hive and Mahout, the latter which brings to bear automated machine learning to find hidden trends and otherwise unthought of or unconsidered ideas. In fact, an entire industry is growing up insmart information management systems that will "not rely on users dreaming up smart questions to ask computers; rather, they will automatically determine if new observations reveal something of sufficient interest to warrant some reaction, e.g., sending an automatic notification to a user or a system about an opportunity or risk." 23
Big Data Technology (#3) Finally, The powerful yet unfocused tools of Big Analytics are not sufficient to reap the rewards of Big Data. That requires taking the sum of the information at hand, applying analytic processes to it, and finally generating new knowledge and insights using a specific, situated method. Insight must be in the domain of the business to be useful, and this part of Big Data is where the technology is connected to ground truth in a feedback loop. That is, the tools of Big Analytics are just tools by themselves. It's not until they are directed at deriving a particular type of result that they are actually useful in a business context. Insights must also be connected to specific objectives (examples depicted in the moving parts visual above) in order to have high levels of impact.