1 White Paper A Real-time Data Hub For Smarter City Applications Intelligent Transportation Innovation for Real-time Traffic Flow Analytics with Dynamic Congestion Management
2 2 Understanding traffic flow and reducing congestion is an on-going challenge for the transportation industry. Adoption of new technology is helping, for example, by providing more frequent and detailed data using cheaper GPS devices and networks of wireless sensors. With approximately 800 million vehicles on the world s roads today, estimated to increase to four billion vehicles by 2050, reducing congestion will require operational systems and web application architectures with the performance required to process a vast volume of sensor data, but also with the capability to automate responses to changing conditions in real-time. This paper discusses the emergence of new Big Data technologies, and how these platforms can be harnessed to deliver real-time and actionable operational intelligence from streaming sensor, GPS and smartphone data. SENSOR BIG DATA STREAM PROCESSING. A DRIVER FOR CHANGE Streaming analytics with real-time data integration can improve an organization s ability to transform raw sensor data into actionable information. The challenge is to deliver a scalable real-time architecture that integrates sensor data with existing systems, and to deliver real-time information that can be acted on with confidence. Continuous, real-time integration also enables duplicated systems, databases, and data warehouses to be rationalized, and with fewer systems, organizations can operate with lower costs yet operate a more effective, real-time IT platform. Detecting an event such as a traffic incident as it happens is important, but it is better to predict congestion in advance and to offer the correct information so that avoiding actions can be taken. This could include dynamic updates to speed signs and automatic adjustment to traffic light phasing. Today s traffic management systems have the capacity to monitor some aspects of the road networks in real-time, but are unable to scale to complete network coverage, or to deliver accurate avoidance information and real-time predictive analytics. This next step requires a step change in both approach and in the underlying software system technology.
3 3 REAL-TIME SMART SERVICES Smart Services go beyond the kinds of upkeep and upgrades a company may practice internally and towards its customers. To create them, intelligence is needed that is, awareness, connectivity and real-time analytics alongside the products and services offered. Most importantly, action needs to be taken according to what the production/buying cycle reveals about itself- Smart Services are the result of systems taking intelligent actions in real time. WHAT MAKES SMART SERVICES SMART? Smart Services are completely different from the offerings based on traditional stored data technology: Predictive, rather than reactive. The flurry of data coming from different sources and integrated in real time with historic data provides actual evidence that an equipment is about to fail, inventories are too low for the weekend or that traffic will delay a delivery Reliable for customers or users, Smart Services add the value of removing unpleasant surprises. Companies can now calculate in real time product performance and customer behaviours, and focus the targeting strategies with unprecedented accuracy Efficient. In a Smart Services environment, machines and devices do what they are very good at doing: producing and digesting billions of data points, talking to one another about the data, controlling one another based upon the state of the data all in a matter of milliseconds. Humans cannot do this, nor should they; this continuous stream of business information is the feed for today s stream processing technologies. Provided this wealth of information is exploited when it s produced and, more importantly, when it is needed, managers and decision makers can gain much more visibility into a business s assets, costs, and liabilities- and decide on the right triggers for automation.
4 4 Big Data Technologies The term Big Data is used to describe datasets whose volume, velocity, and variety are beyond the ability of traditional database and data management systems to capture, store, manage, and analyze The emergence of streaming Big Data technology is focused on the challenge of managing high volume, high velocity streams of data, transforming these into actionable information, and responding in a timely, predictable, and reliable manner. However, the emergence of Big Data technology is made possible by other factors such as Cloud computing and the availability of much cheaper server hardware and storage platforms. Big Data is not just about volume and velocity. In fact, the volume of data created each year exceeds the world s global storage capacity. Furthermore, the rate of increase in data creation is faster than the rate at which storage capacity is expanding. The transportation industry is at the forefront of the next generation of sensor network management and the exploitation of streaming Big Data analytics. This raises two questions. First, how can transportation industry exploit its Big Data? Secondly, if there is more data being created than can be processed by existing traditional database-oriented systems, how can a Big Data asset be exploited? BIG DATA Is defined by Volume, Velocity and Variety.
5 5 The data management technologies underpinning Big Data and streaming data processing are not new. As illustrated in Fig. 1, the initial model for data storage was the sequential data model, where data was stored as a sequence of data records with indexed access. The sequential model evolved into the hierarchical model for record databases, where complexity was managed by storing data in hierarchies, for example, IMS from IBM. Next came the major step forward in the form of the relational model, originated from IBM s System R project. This project also included the SQL language, a high level declarative model that has become the standard for data management. SQL is a specification for the problem to be solved that lets the underlying execution platform determine the more efficient way to compute the answers, including automatic optimization and management of distributed processing. So how has the history of data storage and management influenced Big Data technology? First, the original indexed file sequential model has returned in the form of name:value storage systems that underpin static Big Data storage platforms. Second, and more importantly, the SQL model is now being adopted as the de facto query language for Big Data management. Figure 1: The Evolution of Big Data Technology Big Data storage platforms are based on an open source software ecosystem called Hadoop, managed by the Apache Software Foundation. Hadoop was designed to overcome the two main limitations of traditional RDBMS technology for processing Internet data the ability to manage data with different formats and structure on the same platform and the ability to scale out over multiple servers for massively high performance. The core Hadoop parallel processing infrastructure was the enabler for new types of NoSQL databases ( Not Only SQL ) that allow data management to be distributed across many hundreds or even thousands of commodity server, all processing in parallel. However, as Big Data storage technology matures, SQL is now increasingly being added to the Big Data management portfolio for its querying power and ease of application building.
6 6 Quisque volutpat erat vel dolor. Maecenas leo. - Attribution Mauris The evolution of real-time data in motion technologies has paralleled that of static data management. The first effective data communication mechanism between systems was based on the concept of sockets, where each socket is essentially a logical address to which applications can send data and to which other applications can read the data. This evolved into messaging middleware, software applications responsible for the real-time and reliable delivery of data between multiple communication applications. Prior to the emergence of Big Data storage technology and Hadoop, stream processing technology was already emerging to address the requirement for processing real-time data. Stream processing also provides the capability to analyze and aggregate the data on-the-fly, as the data are created, and before the data are stored. As with Big Data storage platforms, SQL also emerged as the data management language for streaming data processing, that is, SQL as a continuous query language for streaming data. The Big Data industry has evolved to maximize the capability of existing technologies for both streaming and static data analysis. The volume, velocity and variety of sensor and other data in the transportation industry is sufficient to cause significant business and operational issues. Typical metrics given for Big Data problems include: Volume. The volume of data created in 2009 was 800 Exabytes, forecast to grow by 40% per year Velocity. Industry and sensor technology are at the forefront of Big Data velocity requirements. Large scale GPS applications are now exceeding 1 million events per second; however, telecommunications and IP-based services (Internet Protocol) monitoring applications require the capacity to process many tens of millions of events per second Variety. It is estimated that approximately 80% of all data in an organization is unstructured. This includes s and documents.
7 7 Big Data in the context of the transportation industry It is interesting to compare the scale of data processing capacities across different industries in order to better understand where Big Data technologies come into play. A Big Data system has become classified in terms of its ability to address the 3Vs of volume, velocity and variety, a definition originally attributed to Gartner Research. Data Volume. The total of GPS sensor data, fixed road sensor data, social media feeds, weather data, telematics and other location-based traveller information may exceed many terabytes of data per day, which must be processed both in real-time and stored for historical analysis. Data Velocity. Large-scale telematics applications for example can deliver Vehicle-to-Infrastructure (V2I) events at rates of many million of records per second. The average car generates between 5 and 250 gigabytes of sensor data an hour. Even if only a small percentage of the data is transmitted back through local access technologies, the core data processing platform must operate at the required speed for the complete network. To put this in perspective, Twitter messages during major sporting events peak at approximately 10,000 tweets per second. Data Variety. Unstructured and semi-structured data from sensors and from social media feeds such as Twitter are increasing and pose a significantly greater challenge for data processing and integration than conventional structured and sensor data. A BIG DATA STREAM PROCESSOR FOR REAL TIME TRAFFIC MANAGEMENT Real-time traffic management from streaming Big Data requires real-time monitoring, traffic analytics and automation, as well as the integration of traditional data sources: Per-segment historical speed/travel-time comparisons, for example, comparisons in real-time with the same period yesterday, last week or even last year Combine GPS sensor data streams in real-time with roadside camera and signal data, to achieve improved accuracy for congestion and travel time predictions Integration with roadside variable speed signs, providing dynamic adjustment of per-segment speed limits in order to respond in real-time to changing traffic conditions Driver and user access to the application, providing for example, real-time Travel Time through Smartphones and GPS devices Extend Travel Time application to provide end-to-end journey travel time across multiple transportation modes including heavy vehicles, rail, bus and ferry networks. However, the issues with operational traffic management platforms can be summarized as a lack of business integration across operational siloes, inability to manage the increasing volume and velocity of data, and a lack of real-time analysis and forecasting over the live sensor data feeds. The function of a streaming data management platform is to address these issues by offloading the real-time analysis and continuous integration while retaining the existing operational systems.
8 8 Stream processing is a paradigm for the continuous processing and transformation of real-time, dynamic data. Sensors are the most common source of streaming, real-time data, but any static data source such as a log files and databases can be instrumented using Change Data Capture (CDC) adapters to transform new updates into real-time streams. Change Data Capture refers to the ability to detect source data has changed and to capture the new data in real-time, as they are created. For example, capturing new records that have been added to a log file or new rows that have been written to a table in a database. The resulting streams of processed data and analytics are output to real-time dashboards in the operations center, traveller smartphone applications, transportation agency website applications, and pushed simultaneously to the existing database and data warehouse systems. The traditional approach of loading data into databases and data warehouses is referred to as Extract, Load and Transform (ETL). With a streaming data platform, data are aggregated on the fly (that is, analyzed as the data arrive before being persisted to a database or Big Data storage platform) and delivered in real-time using continuous ETL operations. This eliminates the high latency, slow update overhead that is commonplace with traditional batch-based ETL solutions. As illustrated in Figure 2, a streaming data management platform enables multiple applications to be deployed on a single core platform. Applications include data cleansing, monitoring and alerting, integration and visualization. Each application has access to any or all of the arriving data streams simultaneously. The platform assembles the streams that each application has requested, effectively sharing all the streams across any number of applications. Applications process the data streams using relational Views, where each View is generated by a continuously executing SQL query. Streaming SQL queries are active queries that execute over the live streaming data while the data are still in flight, without having to store the data first. Data and information is pushed out continuously to external systems as streams of results and processed data. Figure. 2: Data Stream Management Platform as a Real-time Data Hub for streaming applications
9 Real-time Platform for Traffic Flow Analysis with Dynamic Congestion Management The architectural paradigm for streaming data management has both similarities with and significant differences from traditional Relational Database Management Systems (RDBMS). Both RDBMS and streaming data management platforms are based on the industry standard SQL language; however, a traditional RDBMS must first store the data (data are persisted) before the data can be queried and analyzed. The primary differences between the two paradigms are described in Table 1. Query Duration Query Scope Query Federation Relational Streaming Platform Queries execute continuously Queries over arriving data Stream processing distributed inmemory over many server nodes RDBMS Queries complete and exit Ad-hoc queries over stored data Processing executed centrally over a single inmemory or disk-based repository Table 1: Streaming and traditional RDBMS SQL query comparison In summary, in a streaming data management platform, the arriving data streams are processed in-memory before the data are stored. Processing data entirely in main memory is significantly faster than processing stored data held on disk as it eliminates the performance bottleneck of data retrieval. The same RDBMS SQL queries can be deployed as streaming queries on a streaming data management platform. However, unlike an RDBMS query that always completes and returns a fixed data set, a streaming data SQL query is continuous and executes forever. Figure 3: Real- time traffic m anagement solutions from streaming GPS data with Twitter overlay THE ADOPTION OF BIG DATA AND REAL-TIME STREAM PROCESSING IS THE FUTURE TECHNOLOGY FOR TRANSPORTATION NETWORK MANAGEMENT. 9
10 10 Streaming data management is a complement to traditional RDBMSbased solutions. Both share the concept of a data model centered on processing relational rows, queries, and views. Both share common data manipulation and definition languages standardized as SQL. They are able to share a common security model and application programming interfaces, such as JDBC (Java DataBase Connectivity), and a common representation of metadata. A streaming data processor is based on predetermined queries executing continuously over arriving data, while an RDBMS is used for ad hoc queries over historical, stored data, processing each query until it terminates. Transactional processing is supported in both an RDBMS and a streaming data management platform. In an RDBMS, transactions mark the start and end of an update operation; in a relational streaming platform, the transactions delineate the arrival and delivery of data. The next generation of data processing architectures must utilize the strengths of different data management technologies. Examples are traditional RDBMS for data warehousing and Master Data Management (MDM), Big Data and NoSQL technologies for offline, batch-based pre-processing, and streaming data integration with inmemory analytics for the real-time, intelligent integration fabric across all operational systems. STREAMING BIG DATA IS THE COMPLEMENT TO TRADITIONAL SYSTEMS.
11 11 Planning an ideal architecture is not just about applying new technology. It is also important to understand how to integrate new technology with existing platforms and systems. This includes augmenting what is working already while offloading the realtime performance and streaming Big Data management bottleneck. The core of a real-time enterprise architecture is a streaming data processing platform capable of high volume, high velocity data acquisition, and continuous integration of data across all existing operational siloes. The stream processing platform operates on the arriving sensor data before the data are stored, providing in-memory fast analytics, geospatial analysis, and predictive alerts over the data as it streams past into the offline systems. In summary: Low latency, real-time traffic information. Eliminate latency at all stages - data acquisition, real-time analysis, real-time forecasting, and predictive analytics in order to deliver real-time actionable intelligence that can be utilized by transportation agencies to address congestion and incidents in real-time, and by commuters using smartphone apps for real-tome traffic and route information. Eliminate data siloes through continuous, real-time integration. Streaming integration (continuous ETL) of data from sensors, historic data, applications and databases enables wider visibility and better automation of key operational processes. Combine real-time and historical trend information for optimum real-time decision-making. Databases, data warehouses, and data historians contain mined data and trend information that can be streamed out and joined with the real-time arriving data in order to separate business as usual events from real business exceptions. Collect all data sources. Augment existing information by streaming and joining data of all types, including GPS, Bluetooth, in-road sensors, Twitter and social media, weather data and traveler location information. Solution architecture for real-time operations. It is currently difficult to share data sources across different applications in real-time when the data is held in many different vertical business and operational siloes. A real-time data hub built on a stream processing platform eliminates redundancy through continuous ETL and supports the real-time applications needed for reducing congestion and delivering a better traveler experience.