EMERGING FRONTIERS AND FUTURE DIRECTIONS FOR PREDICTIVE ANALYTICS, VERSION 4.0 ELINOR L. VELASQUEZ Dedicated to the children and the young people. Abstract. This is an outline of a new field in predictive analytics: topologicalgeometrical-analytic-algebraic predictive analytics, a cellular-based data analytics field - the notion here is an old one: cellular in the biological sense, an essential biological theme often referred to as form and function, in the biological sense, in other words, what biologists often emphasize by use form and function. Mathematics intertwines with bio-mimicry to reform the foundation of what has long been a simple, yet elegant prediction theory - the theory encapsulated by the deceptively modest Central Limit Theorem. Later papers will focus on specifics and applications of this rather simple notion. 1. Introduction Predictive analytics, a subfield of data analytics, estimates outcomes of future events using probability and statistics based tools. While algorithm designs are rapidly developing with regard to information science as an outcome of machine learning, mathematical modeling, and data mining, the actual foundation for making a prediction remains the still: The premise of prediction essentially on the Central Limit Theorem. What is typically predicted using the Central Limit Theorem? Here is a toy example. Consider predicting if Patient X will have the flu given their body temperature. The standard method is to obtain a list of patient body temperatures arrived via a Date: June 2, 2015. Thank you to all the library staff and administrative personnel of the University of California, The San Francisco State University, The City College of San Francisco, The Public Libraries of San Francisco, and the civil workers of San Francisco for providing such a welcoming and pleasant environment to create such innovative technical research. Thank you to all the members of the Bioinformatics Department, University of California, San Cruz for providing such ongoing stimulating conversation and encouragement. 1
EMERGING FRONTIERS AND FUTURE DIRECTIONS FOR PREDICTIVE ANALYTICS, VERSION 4.02 sampling of flu patients temperatures, to estimate a sample mean T of body temperature, and to measure Patient X s temperature, T X. The Central Limit Theorem gives us a confidence or probability as to how likely it is that Patient X has the flu: P (µ µ 0 < T X < µ + µ 0 ) = P (a < T < b) = p, meaning there exists a probability, p, that Patient X will have the flu, given temperature T X. To make this prediction, it is needed to have a dataset of flu patients, each with their own unique health pathology, as well as a population mean, µ 0, of flu patient body temperature, derived from another dataset. Even if a quantity of data is bootstrapped in order to create a theoretical population flu patient temperature dataset, the fact remains that the prediction really rests on the sample mean, T, simply a single numerical value. In fact, the whole population of flu patients may be viewed as a large universe equipped with a data invariant, actually the canonical mean defined as the flu temperature. If the set of data invariants is enlarged to more than one value or estimate, an increase and improvement in the possible methods of prediction ought to occur. The key message here is to examine a given dataset subspace of the universe of all information, denoted as Ω, and to look for other useful data invariants, to improve predictive estimates. Thus, one roadmap for creating novel predictive analytics methodologies is quite clear: Step 1. Consider novel data invariants for a given dataset. Step 2. Consider how we use these data invariants, both individually and collectively, to create new predictions for a scenario. Step 3. Consider novel ways to estimate the accuracy of these predictions that we have made via our novel collection of data invariants for any given subset. 2. Introduction Of A New Field: Topological-Geometrical-Analytical-Algebraic Predictive Analytics While the Central Limit Theorem has worked reasonably well for most applications using small data sets, the current and future needs of data analytics applications which apply information theory concepts require complex rationales because of Big Data constraints; Big Data has unusual data landscapes not well understood until supercomputer algorithms have been slowed or halted due to nuances in the big Data that were not previously apparent nor even possibly considered prior to the
EMERGING FRONTIERS AND FUTURE DIRECTIONS FOR PREDICTIVE ANALYTICS, VERSION 4.03 data processing. Big Data has hills, mountain ranges, gullies and valleys, not just the data meadows common to small data; this is a well recognized fact. What is not well understood is how to manage such unusual geographical territory? In the days of small data and static or deterministic dynamics assumptions, computer processing power and capability for standard technical applications was indeed trivial. Supercomputing algorithms and Big Data together create a new world: The Big World. It is necessary to significantly focus attention on expanding what have been long accepted mathematical foundations in data analysis, meaning it is important to create novel themes in statistical and mathematical modeling, in addition to enlarging current work in areas quite well established, such as nonlinear (both deterministic and stochastic) dynamics. For instance, applying stochastic dynamics in areas such as climate theory to study climate data has been done innumerable times. For instance, what is most needed is a better understanding of how Big Data will affect not just possible outcomes arising via standard stochastic dynamics modeling, but also how Big Data affects novel ways of thinking which naturally result when predictive analytics is applied to what is really an old problem: What happens when the climate moves in directions completely chaotic? Simple regression applied to practically any historical data predicts that the earth is warming and global change is happening, but simple or even nonlinear regression cannot possibly predict how global water movement will change as time increases in a realistic manner, especially when regression is confronted with Big Data. For such practical applications, innovative, artful ways of creating non-standard modeling are simply de rigueur. Thus, it is necessary to reconsider all the standard ways of prediction, in order to extrapolate and move beyond what used to be simply called forecasting. It is necessary to reconsider what exactly is the Central Limit Theorem and why did it in particular become the foundation for basic hypothesis testing? Certainly there are other ways of arriving at a prediction. The definite opening up of a entirely new world in prediction theory becomes apparent, when approached via this mindset. Cellular predictive analytics, at first glance, may be modeled using a topologicalgeometrical-analytical-algebraic predictive analytics approach, for envisioning new ways of predicting outcomes, in hopes of deriving more optimal solutions to the most complicated of problems, but ideally, the journey does not sit still on this particular path only but easily segues in directions not in any presently imagined universe. This offering of a prediction theory which is based on a theme so common in bio-mimicry, that of modeling how Nature behaves, can feed-forward and feedback to other more desirable, sophisticated ways of thinking, e.g. what is presently considered the basis of knowledge.
EMERGING FRONTIERS AND FUTURE DIRECTIONS FOR PREDICTIVE ANALYTICS, VERSION 4.04 2.1. Rationale for This New Theory: Topological-Geometrical-Analytical- Algebraic Predictive Analytics Equals Cellular Predictive Analytics - Form and Function. 2.2. Justification for Why Such a New Theory is needed: Precision Medicine Predictive Analytics. 3. The Foundational Theory Recall that Ω is the universe of all information. Equip the universe, Ω be with the laws of physics, allowing the concept that certain laws will be discovered as time evolves, once time begins. Note that the universe could have easily been conceived or have been equipped using other laws or other concepts. This is just one representation. Definition 1. The Physics Laws Representation of the Universe of All Information, Ω. Let time to be unidirectional, so that time evolves only in the forward direction. Denote the Canonical Ghost-child map or GC map, by Q, the Present Theory of Everything Entropy or T OE Entropy, by S, and the canonical universal measure or UM, by µ. The universe, Ω, as represented by the laws of physics, has the following definition: [Q, ˆQ] = is, with ˆQ denoting the Fourier transform of Q, and the laws of physics encoded by [Q, ˆQ]dµ = 0. C What is key here is that information can be parsed into components, and it is the components and their interactions that are to be studied in this theory. Each component an has exceptionally well defined structure which allows the data to be reformulated in a manner more tractable for amenable predictions. Each of the canonical ingredients is then reformulated in an explicit way in each specific component. Remark: In later versions of this outline, a discussion will be provided so that the foundational theory will become quite explicit regarding the structure of these components and how they are used and when, etc. At present, the foundational theory is completely esoteric, but soon will become quite explicit in later versions.
EMERGING FRONTIERS AND FUTURE DIRECTIONS FOR PREDICTIVE ANALYTICS, VERSION 4.05 4. Description of Cellular Predictive Analytics: The Cellular Components Currently, two fields helpful for data analysis are topological data analysis and information geometry. These two fields are also being adapted to work on predictive analytics problems. 4.1. Current field: Topological Data Analysis. 4.2. Emerging field: Topological Predictive Analytics. 4.2.1. Current Topological Invariant for Topological Data Analysis: The Persistence Diagram. 4.2.2. Proposal for Canonical Topological Invariant for Predictive Analytics: The Canonical Knot Polynomial. Consider the Kaufmann bracket as the best canonical knot polynomial at present. Question is: Can we do better? 4.3. Current field: Geometric Information. 4.4. Emerging field: Geometric Predictive Analytics. 4.4.1. Proposal for Canonical Geometric Invariant for Geometric Predictive Analytics: The Canonical Curvature. Open question: Is there a coordinate-free definition for the canonical curvature, or are coordinates needed for the best representation? 4.5. New Field: Analytic Predictive Analytics. 4.5.1. Proposal for Canonical Analytic Invariant for Analytic Predictive Analytics: The Canonical Automorphic Form. 4.6. New Field: Algebraic Predictive Analytics. 4.6.1. Proposal for Canonical Algebraic Invariant for Algebraic Predictive Analytics: The Canonical Determinant.
EMERGING FRONTIERS AND FUTURE DIRECTIONS FOR PREDICTIVE ANALYTICS, VERSION 4.06 4.7. Open Questions. 1. Does there exist yet a discrete version of geometric topology surgery theory? 2. Let DG : G SG be the deformation map from the group G to the semigroup SG; this map is quite advantageous. The open question is what is the best representation for DG that is optimal for predictive analytics?