The Big Data Integration and Analytics Revolution in Agricultural Finance, Risk, and Insurance Joshua D. Woodard Assistant Professor and Zaitz Faculty Fellow in Agribusiness and Finance Dyson School of Applied Economics and Management Cornell University IARFIC Keynote June 8th, 2015
Introduction Overview, society and data The data integration problem Ongoing system development efforts Challenges and considerations Purpose is to provide high level overview relevant to ag finance, risk and insurance field
The Data Integration Problem Analysts typically source data from many different government and nongovernment sources, different temporal and spatial resolution Relevant data spread over a wide variety of operational/transaction based databases, datamarts, unstructured text files, etc. Sources all have different data storage and formatting protocols, API s, different levels of temporal and spatial aggregation etc. Existing infrastructures can not be queried jointly, nor at all Not processed to scales appropriate for most uses Typical approach is to one off for every study, to do the following: At a point in time, download slices of data from several different sources, Then format (often by hand or copy/paste) individual data sets and mash together (may take days or weeks; not automated/replicable/documented) Perform one off analysis To expand analysis or update, entire process must be recreated by human
A Fairly Small Sampling
The Data Integration Problem This makes it impossible for most to conduct analysis of today s programs, and imposes large costs on agencies and others Results in duplication in effort, redundancy, error, and difficulty/waste in sourcing data Limits use and usability of data Pushes out many potential users Very difficult to recreate analyses and update Renders research and analysis less credible (see recent blunders) Data analysis versus management/integration/processing Yet, very little focus in the community on building such systems to date
Advantages of Data Warehousing and Integration Systems Acts as a clearinghouse for data to support policy makers, oversight, research & development, and business intelligence Not a transactional database, but rather used for informational/reporting/research purposes Integrated and centralized Subject oriented and optimized to give answers to diverse questions Data are processed in various ways for variety of uses, flexible Non-volatile, meaning data are never deleted, and is always growing Consistent data storage and formatting protocols within warehouse (reconciles source data)
AgDB Data Warehousing Overview Data Chunks External Databases, Datastores, Datastreams: RMA, USGS, NRCS, AMS, ERS, PRISMS, CME, NASA, NASS, FSA, FAS, etc. OLTP Data Scheduled Jobs To Download and Extract from Source Over Web Integration Services Filter/clean Prepocessing/ Aggregation/ Interpolation/ Transformation Load Data Auditing & Validation AgDB Data Warehouse Web Data Services, OLAP, Data Marts External Clients Web Decision Tools
Advantages of Data Warehousing and Integration Systems Resilient to change, additions, updates Data from different sources can be joined, integrated, and queried with low effort Wide degree of control and consistency in aggregating, interpolating, and cross referencing among and between different types of data Improved data integrity (auditing, cleaning, validation) Increases utility, usability, and access to data Results in lower costs and more reliability for analysts and policymakers who use the data Overall: Save time, save money, increase capabilities Facilitate user tools and access to data that farmers, researchers, and policy makes want
Current Efforts, AgDB Data Warehousing System at Cornell Genesis of system & motivation for Open Data Warehouse Pulls in data from disparate sources and consolidates in a single repository (primarily various USDA data, but also as others) Basic ETL: Data extraction and/or sourcing Extraction, Preprocessing and transformation before loading; Filter, transform, integrate, classify, aggregate, summarize DBMS: Microsoft SQL Server SSIS and other programs for transformation (Python, Matlab, ArcGIS, ArcPy, etc.) Built in spatial libraries, SSDT, etc. Other candidates: Oracle, PostgreSQL, MySQL, MongoDB, other custom Deployed on CIT servers at Cornell (moving to Azure this summer) Data Access: Web API: Virtually any language or stat program (e.g., Matlab, Python, Excel, STATA, Java) Point and click interfaces (also generate API calls for replicability) Others: RMA rating calculator API, spot/basis interpolation, Dairy Margin Tool, Grape Vine Cost tool, etc. See Forum for code samples Web interfaces in development, test site at: http://agfinance.dyson.cornell.edu/agriskmanagement/ Some qualifiers
Abridged/Partial Summaries of Major Datasets/Sources Currently in AgDB Data Source and Item IPCC Climate Change Projections National Climatic Data Center Drought Data PRISMs Climate Group Chicago Mercantile Exchange Risk Management Agency (RMA) US Census Bureau USDA Economic Research Service (ERS) USDA Agricultural Marketing Service (AMS) USDA National Agricultural Statistics Service (NASS) USDA Foreign Agricultural Service USDA National Resource Conservation Service (NRCS) Description Future temperature and precipitation projections across different emission scenarios and percentiles of the 16 General Circulation Models (GCMs). Raw or spatially processed data. Monthly PDSI drought index data available at the climate district level aggregation. Data is available from 1895 to present, by NCDC District. Monthly and daily historical temperature and precipitation data, as well as GDD/HDD processed data. Monthly data is available from 1895 to present. Daily weather data is available from 1981 to present. 800 meter resolution (raw) and processed by FIPS, Township, and in certain cases CLU (pre-2008) available. Daily historical futures and options data for agricultural commodities from the Chicago Mercantile Exchange (CME), Chicago Board of Trade (CBOT), and Kansas City Board of Trade (KCBOT). Data is available from 1959 to present, updated daily. Agricultural insurance price and participation data available at the county level aggregation. Data is available from 1989 to present from Summary of Business. Other data also loaded from various unstructured text files (including historical discovery prices, GRIP yields, etc.) County-level and township level geographical coordinates, land area size, water area size, and population data. Annual farm structural and financial data available at state-level aggregation for the 15 Agricultural Resource Management Survey (ARMS) states. Data is available from 1996 to present. Other various datasets are also sourced from the ad hoc ERS tools and API s. Monthly data on the volume, pricing, and utilization of raw milk received by handlers regulated under Federal milk orders from dairy farmers. All tables in the Public MMO database. Census and survey data available at regional, state, and county level aggregation. The broad categories of data available are crops, animals and products, economics, demographics, and environmental. Data is available from 1926 to present. Obtained via FTP bulk download from QuickStats. CDL data processed against ready to map gssurgo NRCS data by crop also available (raw and county processed). Data on production, supply and distribution of agricultural commodities for the U.S. and key producing and consuming countries. Soil data for the continental US from gssurgo, raw and processed available at various levels of aggregation.
Applications & Accessing Data Applications: Virtually anything Insurance Conservation and Climate Change Policy Analysis and oversight Farm Bill Program Analysis Product Development Different tools for different users 1) Direct DB Connect (or bulk download for external users) Matlab, R, Python, or Web apps using, or standard SQL connections (ODBC, BCP, etc.) 2) Web API and data services for analysts 3) Interactive web tools for farmers, consumers of research Workflows for webtools Enables extension of research and tool dev
Soil Rating/Insurance 0.07 Soil Productivity Index Kernel Density among Common Land Units (CLU's), McLean County, Illinois (SSURGO Data) 0.06 0.05 0.04 0.03 0.02 0.01 0.00 Soil Productivity Index (IL 810 Circular)
Ongoing Efforts and Priorities Recently received a Microsoft Azure Research Grant for use of Azure cloud platform, conversion to cloud platform in progress (early to mid-summer) Additional datasets, API s, tools (ongoing) Open Source launch (mid-summer) Upgraded data portal interface (late summer) for faster and more flexible cataloging/access Movement towards and incorporation of NoSQL platforms Identify various user needs, partners, and collaborators
Challenges, Policy Considerations, and Opportunities Technical and training Some degree of learning curve, but frankly minimal Technical limitations are eroding quickly, political ones are not Expanding purview Still inherently a public good, so without intervention will be underprovisioned Marginal cost curse Coordination within the community Improving access to government data (incentives and bandwidth) What data are made available or opened up How data are made available Privacy concerns Field is at an interesting vantage point compared to many others given mix of market, business, environmental and other natural systems data
Thank you Questions?