Modelling and Big Data Leslie Smith ITNPBD4, October 10 2015. Updated 9 October 2015
Big data and Models: content What is a model in this context (and why the context matters) Explicit models Mathematical models Statistical models Implicit models Neural networks Data Models Models and parameters Constraining models Creating models Directly from the data, or using explicit knowledge? Using Neural Networks ITNPD4: Applications of Big Data 2
Models A word that means many different things in different scientific contexts. And has even more meanings in Computing (never mind elsewhere) In Biology: model organism Also in Biology: a simplified version of a complex system That can be used to make predictions In Physics: a set of equations (etc.) that explains (up to a point) the behaviour of a system Again often for making predictions In data analysis: a set of equations, or a set of computer code, that describes a complex set of data + different meanings in a Computing/data processing context One of the most used words in science with many confusingly different meanings. ITNPD4: Applications of Big Data 3
Different types of model in experimental/empirical science Explicit model A model that can be described precisely For example a set of coupled differential equations describing how different aspects of a dataset interact with each other Implicit model A model that is described in a set of computer code Generally created from a set of data Implicit in the sense that, although an explicit description may be possible, the model is generally used to make predictions directly from a set of data, rather than directly. Note that models may or may not be deterministic. ITNPD4: Applications of Big Data 4
Models in Computing In Computing: a data model, A data model organizes data elements and standardizes how the data elements relate to one another. Since data elements document real life people, places and things and the events between them, the data model represents reality, for example a house has many windows or a cat has two eyes (Wikipedia) (Note: even though this is a Computing Science Department, Computing is generally not an experimental or empirical subject) ITNPD4: Applications of Big Data 5
Data models See Big Databases and NoSQL course, ITNPD3 Data models provide a framework for storing data At one end, one has an SQL database Structured data At the other end one has completely unstructured data (actually, even unstructured data usually has some structure: without structural metadata, data is not usable at all) In fact Data Modelling has many forms Try the Wikipedia page on data models! ITNPD4: Applications of Big Data 6
Data driven business models (DDBM) DDBM is a model of how the business uses data, what the business uses data for Useful for an overview of the whole Big Data system in an organisation ITNPD4: Applications of Big Data 7
Explicit and implicit models We saw that we needed models to allow us to understand causation Without a model we can only have correlations: causation implies mechanism We use models to make sense of data Such models can take many forms Simple linear models With a and b constants: a model connecting y and x. Like most models it has parameters: a and b And we can use existing data to set these. This is clearly an explicit model y = ax + b ITNPD4: Applications of Big Data 8
More explicit mathematical models Or a polynomial model of degree n y = a n x n + a n 1 x n 1 +...+ a 0 which has n+1 parameters. Explicit models are often expressed in differential equation terms: dy dx = 1 y + s(t) ITNPD4: Applications of Big Data 9
Using explicit models We often want to make predictions from models For explicit models this means constraining the parameters of the model: giving them values The quality of the prediction depends on The appropriateness of the model The accuracy of the parameters One can argue that the model selection is itself a parameter selection problem Which functions to use, how many to use, etc. In general, one uses a mixture of the actual data available, and knowledge about the system to choose the model, The parameters are then set using the data. Sometimes initialised to ballpark correct values first using domain knowledge. ITNPD4: Applications of Big Data 10
Simple linear interpolation ITNPD4: Applications of Big Data 11
Implicit models Implicit models (generally) learn from the data Idea is that the model learns directly And is unbiased by the designer of the model Neural networks are the best known type of implicit model. These generally need to be used in conjunction with some kind of possibly informal model of the system Idea: use existing data to train the network Then use the trained network to make predictions ITNPD4: Applications of Big Data 12
Neural network Input layer Hidden layer Output layer Input #1 Input #2 Input #3 Output Input #4 ITNPD4: Applications of Big Data 13
Training a neural network 1. Initialise network architecture 2. Initialise weights 3. For each training input:output pair, adjust the weights 4. If the overall error exceeds some delta Go to step 3 5. Test on validation set If result is not good enough, go to step 1 6. Finished (i.e. use trained neural network) ITNPD4: Applications of Big Data 14
Neural networks for prediction What are the dangers here? However, there are specific aspects of appraisal work which pose specific problems for the utilization of MRA in these types of contexts. In this regard, small sample size as well as the difficulty in obtaining sales information due to Texas being a non-disclosure state where tax payers are not required by law to reveal what they paid for their property are major obstacles to the typical larger samples needed for MRA. I have heard that ANN (artificial neural networks) are not encumbered by these factors. Quote is from an email I received asking for my advice. ITNPD4: Applications of Big Data 15
Prediction and NNs Neural networks will always make a prediction And the prediction may look quite sensible But: Is it the right answer? Has the NN been appropriately trained? Is it the right NN? Is it the right type of NN? Generally, one breaks up the training data into three disjoint sets A training set A cross-validation set A test set One trains up the system repetitively, and checks each network with the cross-validation set Then one tests all the networks with the test set ITNPD4: Applications of Big Data 16
Big Data and Models Data sets are used to constrain models For explicit mathematics models, this means adjusting parameters so that the data conforms to the models This will never be exact So some form of approximation or error minimisation Is required. E.g. minimising the sum of the squares of the error For other types of model, there may be specific techniques Error correction in neural networks is a good example of this ITNPD4: Applications of Big Data 17
In Conclusion Be careful when using the word model Because it has many meanings Data models describe the structure of data in general And in Big Data applications this can be quite complex Implicit and explicit models describe systems And can be constrained (adapted, trained) by data Getting the model right (or at least not too wrong) can make a big difference to predictions from data ITNPD4: Applications of Big Data 18