A data pipeline for PHM data-driven analytics in large-scale smart manufacturing facilities

A data pipeline for PHM data-driven analytics in large-scale smart manufacturing facilities P. O Donovan 12, K. Leahy 12, D. Og Cusack 2, K. Bruton 12, D. T. J. O Sullivan 12 1 IERG, University College Cork, Ireland 2 Department of Civil and Environmental Engineering, University College peter_odonovan@umail.ucc.ie ABSTRACT The term smart manufacturing refers to a future-state of manufacturing, where the real-time transmission and processing of information across the factory will be used to produce advanced manufacturing intelligence that can optimize every aspect of its operation. In recent years, initiatives and groups such as the Smart Manufacturing Leadership Coalition (SMLC), Industry 4.0, and the Industrial Internet Consortium (IIC), have led the way by bringing together industry, academia and government to establish policies, roadmaps and platforms to support smart manufacturing efforts. Although there are many characteristics associated with smart manufacturing, a common thread between these initiatives is the emphasis on the need for operations to transition from reactive and responsive, to predictive and preventative. The research presented in this paper focuses on the development of a data pipeline that supports the development of data-driven Prognostics and Health Management (PHM) applications throughout the factory. In the context of smart manufacturing, PHM enables facilities to transition from preventative and reactive maintenance strategies for machinery, with highly optimized predictive, preventative and condition-based maintenance. The benefits of PHM are closely aligned with those identified by smart manufacturing leaders, which include the opportunity to decrease maintenance costs, increase machine availability and uptime, reduce energy consumption, and improve production yield. However, the process of ingesting, cleaning and transforming multiple real-time data sources that are needed for data-driven PHM systems is a difficult, complex and time-consuming task, with estimates from business intelligence projects ranging from 80% to 90% of overall project effort. This effort level is only exacerbated further in manufacturing environments due to additional technology challenges, such as low levels of standardization, disparate protocols and interfaces, and lack of data contextualization. While emerging embedded technologies such as Cyber Physical Systems (CPS) and Internet of Things (IoT) can be used to overcome many of these challenges by providing an open platform for transmitting data, existing large-scale manufacturing facilities that are subject to compliance, regulation, and stringent quality assurance policies, will not be able to undertake a wholesale adoption of these technologies in the short to medium term due to the prohibitive cost, risk and effort that is associated with new technology adoption. Therefore, to realize PHM systems that can provide a high-level of coverage across the factory, a highly transparent data integration strategy that does not discriminate between new and legacy technologies must be employed. In this paper, we present a scalable, robust and fault tolerant data pipeline for ingesting, cleaning, transforming, processing and contextualizing temporal data in near-time from a wide-range of data sources across the factory floor. 1. INTRODUCTION The term smart manufacturing refers to a paradigm that involves the transmission and sharing of real-time information across pervasive networks with the aim of creating manufacturing intelligence throughout every aspect of the factory (Davis, Edgar, Porter, Bernaden, & Sarli, 2012a; Lee, Lapira, Bagheri, & Kao, 2013; Lee, 2014; Wright, 2014). Experts predict that smart manufacturing will become a reality over the next 10 to 20 years. The objective of smart manufacturing is similar to manufacturing intelligence in the sense that it focuses on the transformation of large amounts of manufacturing data to knowledge, which can positively affect decisionmaking, which in turn can have a positive impact across the organisation. However, smart manufacturing supersedes manufacturing intelligence in its emphasis on real-time collection, aggregation and sharing of knowledge across physical and computational processes to facilitate seamless operating intelligence (Manufacturing et al., 2011). In general terms, smart manufacturing can be considered an intensified application of manufacturing intelligence, where every aspect of the factory is monitored, optimised and visualised (Davis et al., 2012a) - and as this real-time and pervasive technologies permeate the factory floor the facility can be transformed in to a smart enterprise. 1

While smart technologies facilitate the creation of knowledge, there is a dependency on the workforce to use this knowledge in a manner that has a positive impact. However, similarly to the technological transformation that must take place within each facility and organisation, smart manufacturing will require a transformation across the workforce. The demands placed on the workforce in a smart manufacturing facility will not be confined to vertical operations, and will therefore require a strong multi-disciplinary perspective. Many of the technologies and systems are visible at enterprise-level, where the potential impact of a decision can be evaluated in the context of the entire organisation, which is in contrast to vertical and department-level decision-making that is synonymous with traditional manufacturing. Without adopting a holistic approach to decision-making, it is difficult to envisage how challenges such as demanddriven and intelligent production, real-time data management, system interoperability, and cyber security, can be addressed for smart manufacturing (Manufacturing et al., 2011). As a result, many individuals in the workforce will need to develop a high-level knowledge of multiple disciplines that are related to their main area of expertise, including engineering, computing, analytics, design, planning, automation, and production, to name a few (Meziane, Vadera, Kobbacy, & Proudlove, 2000; Sharma & Sharma, 2014). 1.1. Groups and initiatives focused on smart manufacturing There are a number of government, academic and industry groups that are promoting an awareness of smart manufacturing. These initiatives include the Smart Leadership Coalition (SMLC) (Manufacturing et al., 2011), Technology Initiative SmartFactory (Zuehlke, 2010), Industry 4.0 (Lee, Kao, & Yang, 2014), and The Industrial Internet Consortium (IIC) from, to name a few. These initiatives have formed from the realisation that the scope of smart manufacturing is simply too large for any single organisation to manage, and while the terminology used by some of the initiatives may differ slightly, they share an overarching vision of smart factories that utilise intelligent real-time pervasive networking and computing to produce significant amounts of manufacturing intelligence and efficiencies. At present, the two most prominent smart manufacturing initiatives found in research are the SMLC and Industry 4.0, with each being loosely related to their geographical origins the US and EU respectively. The SMLC working group differs from other initiatives in a couple of ways. Firstly, it is comprised of numerous academic institutions, government agencies and industry partners. This blend of partners enables the SMLC to identify real problems by consensus, which may mitigate their roadmap and recommendations from hidden agendas and biases, which may not be of benefit to the wider manufacturing community. Secondly, the SMLC have not only developed technology roadmaps, recommendations and guidelines for smart manufacturing, they are also actively involved in the development of a smart manufacturing platform that facilitates the implementation of these ideas. Another prominent smart manufacturing initiative is Industry 4.0. Industry 4.0 is a high-tech strategy developed by the German government to promote an awareness of smart manufacturing and its future economic impact. The term Industry 4.0 is a simple naming convention that serves to partition each industrial revolution, with 4.0 referring to an anticipated fourth revolution that will occur in the future. At present, expert opinions differ regarding a realistic timeline for the realisation of Industry 4.0, with estimates ranging from 10 to 20 years. Using Industry 4.0 terminology, previous industrial revolutions are predictably labelled 1.0, 2.0 and 3.0. Industry 1.0 was brought about by the introduction of mechanical production using water and steam power, with the first mechanical loom used in 1784. Industry 2.0 was brought about by the division of labour and the realisation of mass production, which were largely facilitated by the use of electrical energy. The first assembly line was introduced in the Cincinnati slaughter house in 1870. Finally, Industry 3.0 was brought about by advances in electronics and IT systems. This enabled more automation of production using control networks, with the first programmable logic controller (PLC) Modicon 084 introduced in 1969. 1.2. Benefits of smart manufacturing In contrast with traditional manufacturing, smart manufacturing is based on pervasive networking and intelligent data-driven applications that are highly integrated, intelligent, and flexible. These technologies are used to facilitate demand-driven production, from the ingestion of raw materials, to the delivery of a finished product through an optimised supply chain. It also addresses many business and operating challenges, such as increasing global competition and rising energy costs, while also creating shorter production cycles that can respond quickly to customer demand (Manufacturing et al., 2011; Sharma & Sharma, 2014). In addition to these well-known efficiencies, more quantifiable measures are also available. For example, the SMLC provide numerous performance targets for different facets of smart manufacturing, some high-level targets include (1) a 30% reduction in capital intensity, (2) up to a 40% reduction in product cycle times, and (3) an overarching positive impact across energy, emissions, throughput, yield, waste, and productivity, to name a few. Another report produced by the Fraunhofer Institute and Bitkom, highlights the potential economic benefit of Industry 4.0 to the German 2

economy. In particular, it is suggested that the transformation of traditional factories, to Industry 4.0 factories, could be worth a cumulative figure of 267 billion euros to the German economy alone by 2025 (Heng, 2014). 1.3. Impediments to smart manufacturing adoption While the potential high-level benefits of smart manufacturing are clear, there are numerous challenges and issues that must be overcome. In particular, factories must develop the infrastructure and network-intensive real-time technologies that provide the foundations for smart manufacturing, as well as cultivating a multidisciplinary workforce and next-generation IT department, which are capable of leveraging smart technologies to realise efficiencies across the enterprise (Manufacturing et al., 2011). Of course, the degree to which these challenges are evident in any particular facility will vary. In particular, there is an obvious difference between the implementation challenges encountered by greenfield and brownfield sites (Davis, Edgar, Porter, Bernaden, & Sarli, 2012b). Apart from obvious challenges, such as budgetary constraints, technology availability and the presence of a skilled workforce, greenfield sites can adopt smart manufacturing without significant impediments. In contrast, brownfield sites can be significantly more restricted by the existence of legacy devices, information systems, and protocols, which in some cases may also include proprietary technologies. More to the point, many of these legacy technologies are from a time when low latency distributed real-time networks, or large-scale data storage and processing, were simply not a concern. In some scenarios, these legacy technologies can be replaced with smarter equivalents, but there are numerous reasons why substitution may not be a viable option; Historical investment in IT and automation. Many manufacturing facilities have invested in IT systems and automation networks over the past 40 years. Therefore, there may be a reluctance to replace technologies that have received significant investment, and continue to function at an appropriate level. Regulatory and quality constraints. In certain industries, such as pharmaceuticals and medical devices, internal or external constraints may exist in the form of regulatory or quality standards. In such cases, the existence of exhaustive processes and procedures may negate the enthusiasm for replacing legacy technologies. Dependency on proprietary systems or protocols. Although there numerous open standards exist for manufacturing information systems and automation networks, such as ISA95 for system interoperability and OPC for device-level communication, they are not adopted by all facilities. Therefore, when proprietary and closed technologies are used in place of open standards, technology adoption (i.e. smart technologies) is somewhat limited by the vendors offerings. Weak vision and insufficient commitment. The transition to smart manufacturing is a significant undertaking, and one that requires strong leadership and a shared vision of the short and long-term benefits to the organisation. More to the point, it stands to reason that organisations that do not have a clear vision of how smart manufacturing will improve their business, are less likely to have an appetite to replace equipment and technologies that are currently in use. High risk and disruption. The implementation of new technologies and systems are considered to be highrisk projects. More specifically, new technologies and systems can negatively impact operations while personnel are developing a competency with the new technologies. Therefore, the appetite to undertake large-scale IT projects, such as smart manufacturing, may prove to be weak until such time lost opportunities result affect the organisations competitiveness. Skills and technology awareness. Traditional IT, automation and facilities departments are entrenched in mature computing, automation and networking methods, which have been in existence for decades. However, smart manufacturing technologies represent a significant shift from these approaches. Indeed, if internal departments do not possess an understanding of these new technologies, and they actively contribute to the organisations technology roadmap, their lack of technology awareness may impede, or at least slow-down, implementation efforts. There are a number of potential impediments surrounding the introduction of smart manufacturing technologies, with the majority of these relating to brownfield sites where technology replacement is an issue. In these situations, the challenge will be to encapsulate and integrate legacy technologies in a manner that enables them to be used with newer smart manufacturing technologies, methodologies and roadmaps. If this cannot be realised, organisations will be restricted to incomplete implementations of smart manufacturing, which will significantly restrict their ability to realise many of the smart manufacturing performance targets. In this paper we focus on a subset of smart manufacturing, namely Prognostics and Health Management (PHM) 3

which comprises methods for detecting and predicting equipment faults with the intention of promoting equipment uptime and availability (Bruton et al., 2014; Bruton, Coakley, O Donovan, Keane, & O Sullivan, 2013; Lee, Bagheri, & Kao, 2015; Lee et al., 2013; O Donovan, Leahy, Bruton, & O Sullivan, 2015; Wright, 2014). In particular, we highlight the operational requirements and system architecture that are needed to support data-driven PHM applications in highly regulated industrial environments, where the need to incorporate legacy devices and systems can represent a genuine barrier to the creation of next generation smart manufacturing tools and applications. 2. RESEARCH METHODOLOGY The methodology employed in this research is that of an embedded study, which was undertaken in DePuy Ireland - a large-scale manufacturing facility that is part of the Johnson & Johnson family of companies. The aim of the research was to identify the requirements and an architecture that could support the development of PHM applications in the facility, with an emphasis on reducing the level of expensive and time-consuming activities, such as ad hoc data integration, while also improving data accessibility and reusability for applications consuming the data. 2.1. Establishing scope As there are many potential applications of PHM and industrial analytics, the first priority was to establish boundaries for the research. After an initial discussion between the research team, and the automation team in DePuy Ireland, we narrowed the focus of the project by agreeing to the following specificities. 2.1.1. Type of PHM applications We agreed that the research should focus on data-driven PHM applications that deal with predictive and intelligent equipment maintenance. Given the nature of manufacturing, equipment uptime and availability can have an obvious impact on operations. Therefore, the development of a solution that is capable of streaming data directly to applications that are focused on promoting uptime and availability was considered a worthwhile pursuit. 2.1.2. Regulation and compliance Due to the research being undertaken in a highly regulated and quality-focused large-scale manufacturing facility, the scope of the research was also narrowed in this regard. As the adoption of emerging technologies is not easily achieved in such environments, it was agreed that there should be an emphasis placed on enabling legacy technologies, as well as supporting the amalgamation of legacy and emerging technologies. This was seen as a positive decision, as it enabled the research to focus on solving a real challenge for facilities of this type, rather than producing a solution that was only applicable to facilities that could ensure that a number of pre-requisites are met (e.g. a facility must have 100% coverage of smart sensors). 2.1.3. Time-series data Given the focus of this research on data-driven PHM applications for equipment maintenance, we agreed that data ingestion, processing and accessibility would be limited to time-series data/measurements. Based on the research teams experience, as well as input from the automation team in DePuy Ireland, time-series data are the most main type of data used in decision-making related to equipment maintenance. Limiting and characterising the type of data enabled us to use the predictable structure (e.g. time/value pairs) of time-series data to limit the number of cleaning and transformation permutations. 2.1.4. Data flow and direction The main focus of the research is to provide a mechanism for collecting and transmitting data from the factory floor. Therefore, it was agreed that this research would only consider the scenarios where data is transmitted one-way (i.e. to consuming PHM application) and was immutable. Although this research does not consider providing a twoway channel for consuming PHM applications to send data back to the factory to initiate actions or controls, these applications could implement their own methods to send instructions back to the factory if they chose to do so. 2.1.5. Industrial data integration While legacy integration is a central part of this research, due to the number of ad hoc and proprietary systems that exist in industry, as well as the diverse number of standards-based protocols, we agreed that out initial integration efforts would be limited to log files from Programmable Logic Controllers (PLC) and Manufacturing/Building Systems, OLE Process Control (OPC), Modbus, and BACnet. These sources of data constitute should provide an adequate level of coverage across modern large-scale manufacturing facilities. 2.1.6. Industry collaboration To understand the specificities of the technologies in DePuy Ireland, and to gain a greater appreciation for the manufacturing operations in the facility, we participated in internal team meetings for automation, energy, big data 4

and smart manufacturing. These meetings were used to identify data sources and industrial protocols that were relevant to enabling PHM applications in the factory. Automation - the automation team consisted of eight staff, with skills covering control, production, energy and information technology. The automation team was particularly useful for identifying the infrastructure surrounding production and the characteristics of these systems, as well as gaining an insight in to current scheduling and maintenance strategies. Energy the energy team comprised of five staff, with skills in the fields of engineering and energy. The energy team was provided use with an understanding of how equipment energy use is monitored, and to discuss scenarios where energy fluctuations may be linked to malfunctioning equipment. Big Data and Smart Manufacturing - there is no single team currently responsible for big data and smart manufacturing. Therefore, we interacted with multiple teams and personnel to ascertain how emerging technologies, such as Internet of Things (IoT) and big data technologies, were being used in the factory, or identify plans to integrate them in the future. 2.2. Research questions To direct our research efforts we identified two research questions to guide our research efforts. The purpose of the first question (RQ1) was to establish the real-world data integration requirements for large-scale industrial environments, with a specific emphasis on establishing the requirements that are not considered by traditional data integration (e.g. business intelligence tools). The second question (RQ2) was used to use these requirements to create an open and accessible architecture for data ingestion, processing and management. 2.2.1. RQ1 What requirements and characteristics are important to large-scale manufacturing facilities when it comes to data integration methods? This question focuses on establishing the requirements and characteristics that are needed to support the development of data-driven PHM applications in realworld large-scale manufacturing environments, with a particular emphasis on facilitating transparent data flows across the entire factory, which is aligned with the vision of smart manufacturing. 2.2.2. RQ2 How can a data pipeline serve datadriven PHM applications using legacy and emerging technologies in an indiscriminate manner? This question considers if a data pipeline could be used to provide a framework for PHM applications for equipment maintenance. In particular, the question considers the results from RQ1 to develop an architecture that could meet these requirements without being over-dependent on a single factory-level protocol, or indeed, smart technologies. 3. RESULTS AND DISCUSSION 3.1. RQ1 Requirements and characteristics The following requirements and characteristics were identified by our research efforts in response to RQ1, which is described in our methodology. While these characteristics specifically were derived through discussions focusing on the practicalities of data management for PHM applications, they should also be somewhat representative of challenges associated with industrial data integration in general. 3.1.1. Legacy integration Not all facilities will be in a position to adopt emerging and smart technologies to avail of intelligent systems promised by smart manufacturing. Based on our observations, it is likely that many organisations have invested time and resources in building sophisticated control and automation networks, and it may not prove beneficial in the short-term to replace these with smarter equivalents. Similarly, while many organisations may be aware of the high-level benefits of emerging technologies and paradigms, such as big data analytics, they may not necessarily have a definitive strategy of how to integrate them within their organisation, or indeed, the practicalities of obtaining the skillsets and technical understanding needed to derive the benefits from them. Therefore, as organisations will most likely want to leverage existing investments, skills, knowledge, vocabulary and systems, while transitioning and integrating smarter technologies, being able to support legacy systems is a key requirement. 3.1.2. Cross-network communication While smart manufacturing promotes seamless data transmission across pervasive networks, the reality is that existing networks in manufacturing environments are far from pervasive. Most notably, we discovered that many of the equipment and devices to which we needed access, were separated by corporate firewalls and other security measures. In addition, we also encountered equipment (e.g. wind turbines) to which we had limited access due to 5

managed maintenance and support agreements with external vendors. These measures may make sense from a security or business perspective, but certainly represent a challenge to the vision of data-driven smart manufacturing. Therefore, to provide coverage for data integration across a facility, it is important to be able to communicate across secure networks. However, we realise that this requirement is subject to additional investigation to ensure security protocols are not violated, but this topic is beyond the scope of this research paper. 3.1.3. Fault tolerance It is typical for high demands to be placed on information systems and technologies that play a role in production, automation or maintenance, as these systems can directly impact a facilities performance, in terms of product yield and operational efficiency. Based on our observations, it is reasonable to assume that systems deployed in industrial environments must be highly available, while also being reasonably fault tolerant. Therefore, this level of performance and fault tolerance should be considered an important requirement for all systems that are connected with decision-making within the factory. 3.1.4. Extensibility The presence of proprietary or ad hoc technologies and systems in existing large-scale manufacturing facilities is somewhat unavoidable. While in recent years, facilities have become much more savvy in relation to technology integration and consolidation, there is still a certain amount of duplication and disparity across systems. Therefore, a key requirement stemming from our observations and discussions with staff that deal with cross-system data integration and consolidation, is that the ability to extend a system to incorporate new types of data and protocols is of the upmost importance. 3.1.6. Data accessibility Traditional large-scale manufacturing facilities already produce quite a lot of data. However, based on our observations, it appears to be difficult to consume this data due to its diverse structure and inaccessibility. To overcome these issues, expensive data discovery and integration must be undertaken, which can result in dataintensive projects being overlooked due to the known cost associated with data management. Those projects that are undertaken tend to develop integration routines that are highly proprietary. This ultimately restricts others from using the data in the future, which ultimately leads to the same expensive integration work being undertaken in the next project. Therefore, eliminating low-level data integration by providing a consistent means of accessing data from data sources across the factory is an important requirement. 3.2. RQ2 Data pipeline architecture Given the requirements identified from RQ1, we constructed a high-level architecture of a data pipeline that could be used to supply data-driven PHM applications with temporal data from equipment across the facility. Figure 1 outlines the architecture of the data pipeline, with each stage of a factory-to-cloud workflow numbered and highlighted. The overarching objective of the data pipeline is to provide a low cost and turnkey solution for data integration, which is built on a highly scalable and fault tolerant infrastructure that can be used by maintenance and monitoring applications across the factory. In the proceeding sections the purpose and function of each component and stage in the data pipeline is described. 3.1.5. Scalability As the digitisation of the factory accelerates, the ability for systems to scale seamlessly is becoming increasingly important. Although technology is a central component in existing facilities, the real-time and data-rich environment described by smart manufacturing will quickly illustrate the limitations of existing systems, especially those that are unable to scale across additional computing resources (e.g. cloud and distributed solutions). For example, the normal interval for data measurements that we observed during this research was 15 minutes. Even without the addition of new sensors, if the resolution of the existing facilities measurements was reduced to 1 second intervals, the data production would increase by 900%. Therefore, the ability to scale seamlessly based on demand is an important requirement. 6

Figure 1. Data pipeline architecture and workflow 7

3.2.1. Stage 1 Site manager Purpose: The site manager resides on a cloud server that acts as a central information repository for all components deployed on the site. Its purpose within the architecture is to persist essential site information in a central location, such as the information relating to the data sources in the factory that should be ingested. Functions: The site manager has multiple functions that are directly related to the factory (1) store details relating to the site, such as the type and location of local data sources that shall be ingested, (2) schedule and assign jobs to ingestion engines in the factory based on the availability and location of each node, and (3) derive a suitable amount of data to ingest for each node based on their current location, bandwidth, CPU and bandwidth availability. 3.2.2. Stage 2 Ingestion process Purpose: The ingestion engines are distributed software agents that are deployed autonomously across multiple networks in the factory to collect temporal data from the sources that are necessary to support the factories datadriven PHM applications (e.g. HVAC, Chillers, Boilers). These engines run as background applications that are deployed on local servers and continually communicate their status to the site manager (Stage 1), and when instructed, collect data from local sources to transmit to the cloud. As illustrated in Figure 1, the ingestion engines distributed and autonomous nature enables them to be deployed in isolation on different networks that are separated by firewalls and/or geographical boundaries, as well as providing the option to scale the process to handle more data, or improve latency, by adding new ingestion engines. Functions: The ingestion engine has multiple functions (1) communicate location, bandwidth, CPU and memory availability to the site manager so an appropriate ingestion task can be assigned, (2) interpret ingestion tasks sent by the site manager and automatically extract temporal data from the relevant sources in accordance with the task parameters (e.g. date within a particular date range), and (3) transmit the collected temporal data to the cloud-based message queue. A novel aspect of the ingestion engine is an expert ruleset automatically maps and extracts common representations of temporal data, thereby limiting the manual the task of mapping data sources. 3.2.3. Stage 3 Message queue Purpose: The cloud-based message queue is a highly available and distributed service that accepts temporal data from the ingestion engines deployed in the factory. It main purpose is to act as an intermediary storage area between the factory floor, and the processing components in the data pipeline. Decoupling the data ingestion process from the data processing components embeds a level of resilience in the pipeline by removing dependencies of synchronous processes, which enable these components to operate asynchronously when the pipeline is in peak demand. Functions: The message queue has two main functions (1) notify the subscription service when new data has been ingested from the factory, and (2) add the received data to a queue so that it may be used by components to process the data later in the pipeline. 3.2.4. Stage 4 Subscription service Purpose: The subscription service operates as a notification service between the endpoint for data ingestion (i.e. message queue) and the data processing components that transform raw data in to a state suitable for turnkey analysis (e.g. create new data set by extracting certain features). It serves to separate the message queue from the components that process the data, which enables these expensive bandwidth and compute operations to run independently. Functions: The subscription service functions are relatively limited, but are extremely important in linking the chain of events that are undertaken in the pipeline (1) listen to the message queue for new data, and (2) notify subscribers that new data are available for processing. 3.2.5. Stage 5 Data processing Purpose: The processing components are responsible for transforming the temporal data to a form that is useful for analysis. The processing undertaken at this stage is intended to remove the onus on individual PHM applications to undertake expensive processing and aggregation of high residual data. At the most basic level, the processing undertaken includes the aggregation of data to provide different levels of granularity, such as hourly, daily, monthly and annual averages for all parsed data, while more sophisticated processing may build new data sets using deductive logic. Examples of more involved processing include the execution of expert rules to build a new data set of faults, or encoding temporal data in a semantic format (e.g. Project Haystack) to promote interoperability with supported applications. Each processing component that resides in the architecture is responsible for a single use case, such as those previously mentioned. Therefore, as new processing requirements and standards arise that cannot be fulfilled by existing components, a new component can be added by simply deploying it and subscribing to the subscription service. Functions: The functions associated with data processing are diverse. This is the one area of the application that enables custom functionality, given that the processing and preparation of data are likely to vary somewhat from 8

application-to-application, and factory-to-factory. However, it is our intention to add to the default processing components as we discover common use cases. By default, the data pipeline performs data aggregation on temporal data (1) daily average, (2) monthly average, and (3) annual average. 3.2.6. Stage 6 Data access Purpose: The purpose of the data access stage is to expose a consistent and open method for PHM application to consume data from different equipment in the factory. More specifically, the data access stage simply serves the output from the data processing stage to PHM applications that need the data. To enforce consistency, a simple naming convention is used to give context to the temporal data collected from across the facility. This convention is simply an encoded URL that returns the appropriate data based on the context of the URL request. The convention is illustrated in Figure 1 and explained in more detail below: Parameter Data Object Year Month Day Table 1. URL convention for data requests Description This parameter refers to a particular data set (e.g. energy). An identifying name or code that exists within the data set (e.g. machine number). The year for the data request. The month for the data request. The day for the data request. Functions: The functions of the data access components are to (1) ensure data are stored in the appropriate location/context, and (2) respond to requests for data that adhere to the aforementioned convention. 3.3. Alignment of architecture with requirements This section provides a brief summary, which is designed to align the requirements discussed in RQ1, with the proposed architecture described in RQ2. Table 2. Relationship between requirements and architecture communication Extensibility Scalability Data accessibility possible to communicate across networks. The ingestion engines can be considered remote autonomous agents, which can operate in isolation on different networks once they can communicate with the cloud queue service. Stages 2, 5 and 6 illustrate the main points of extensibility in the architecture. First, the data ingestion instructions are dynamically disseminated from the site manager that can be modified to support new types of data sources. Second, the data processing components are modular in nature to allow new processing routines to be added easily. Finally, the data interface returns the result of processing tasks, which means the type of data served by the interface can be changed by updating the appropriate processing routines. Stages 1-6 all embrace the requirement of scalability. First, the distributed and autonomous design of the ingestion engines enable operations to run in parallel to scale based on the number of data points in the factory. Second, the cloudbased queue, notification and storage services are inherently scalable. Finally, all cloud-based components (e.g. data processing) avail of the load balancing and auto scaling features of the cloud to dynamically distribute and scale workloads based on compute demand. Stage 6 is responsible for promoting data accessibility. The architecture utilises a cloud storage repository to serve preprocessed views of data, which provides geographically distributed users and applications with low-latency access to time-series data. Furthermore, the adoption of open standards and protocols (e.g. HTTP) promotes interoperability with 3 rd party applications, which can use an intuitive and descriptive naming convention for data requests. Requirement Legacy integration Cross-network Data Pipeline Stages Stages 1 and 2 in the architecture are responsible for the legacy integration. The site manager is used to create meta-data for each data source in the factory. This information is then used by the expert rules in the ingestion engine to collect data independent of the underlying source. Stages 2 and 3 in the architecture make it 4. CONCLUSIONS AND FUTURE WORK In this paper, we have addressed the main challenges and desirable characteristics that are associated with data collection in industry, such as automating and simplifying data ingestion, embedding fault tolerant behaviour within the system, promoting scalability to handle large quantities 9

of data at a low cost/effort, embracing the need to extend and adapt the system based on emerging requirements, and harmonising data access to promote ease of access to consuming applications. There is no doubt that emerging technologies that enable the pervasive and seamless transmission of data, such as IoT, will eventually eliminate the need for legacy integration. However, given the fact that 20 year old PLC s are still operating in modern manufacturing facilities today, we would suggest that the research community should be conservative when estimating the speed at which large-scale industrial facilities will be fully equipped with smart technologies. Therefore, to enable these facilities to begin their smart manufacturing journey, and derive the benefits of predictive and optimised operations, they must first address the issue of transparent data integration across the factory. ACKNOWLEDGEMENTS The authors would like to thank the Irish Research Council, DePuy Ireland and Amazon Web Services for their funding of this research, which is being undertaken as part of the Enterprise Partnership Scheme (EPSPG/2013/578). REFERENCES Bruton, K., Coakley, D., O Donovan, P., Keane, M. M., & O Sullivan, D. (2013). Development of an Online Expert Rule based Automated Fault Detection and Diagnostic (AFDD) tool for Air Handling Units: Beta Test Results. In ICEBO - International Conference for Enhanced Building Operations. Montréal, Canada. Bruton, K., Raftery, P., O Donovan, P., Aughney, N., Keane, M. M., & O Sullivan, D. T. J. (2014). Development and alpha testing of a cloud based automated fault detection and diagnosis tool for Air Handling Units. Automation in Construction, 39, 70 83. doi:10.1016/j.autcon.2013.12.006 Davis, J., Edgar, T., Porter, J., Bernaden, J., & Sarli, M. (2012a). Smart manufacturing, manufacturing intelligence and demand-dynamic performance. Computers & Chemical Engineering, 47, 145 156. doi:10.1016/j.compchemeng.2012.06.037 Davis, J., Edgar, T., Porter, J., Bernaden, J., & Sarli, M. (2012b). Smart manufacturing, manufacturing intelligence and demand-dynamic performance. Computers and Chemical Engineering, 47, 145 156. doi:10.1016/j.compchemeng.2012.06.037 Heng, S. (2014). Industry 4.0: Huge potential for value creation waiting to be tapped. Deutsche Bank Research. Retrieved from http://www.dbresearch.com/servlet/reweb2.reweb?rw site=dbr_internet_en- PROD&rwobj=ReDisplay.Start.class&document=PRO D0000000000335628 Lee, J. (2014). Recent Advances and Transformation Direction of PHM, 1 31. Lee, J., Bagheri, B., & Kao, H. (2015). A Cyber-Physical Systems architecture for Industry 4. 0-based manufacturing systems. MANUFACTURING LETTERS, 3, 18 23. doi:10.1016/j.mfglet.2014.12.001 Lee, J., Kao, H.-A., & Yang, S. (2014). Service Innovation and Smart Analytics for Industry 4.0 and Big Data Environment. Procedia CIRP, 16, 3 8. doi:10.1016/j.procir.2014.02.001 Lee, J., Lapira, E., Bagheri, B., & Kao, H. (2013). Recent advances and trends in predictive manufacturing systems in big data environment. Manufacturing Letters, 1(1), 38 41. doi:10.1016/j.mfglet.2013.09.005 Manufacturing, S., Manufacturing, C. S., Coalition, L., Smart, T., Leadership, M., Incorporated, E., Any, D. (2011). About this Report About the Smart Manufacturing Leadership Coalition. Meziane, F., Vadera, S., Kobbacy, K., & Proudlove, N. (2000). Intelligent systems in manufacturing: current developments and future prospects. Integrated Manufacturing Systems, 11(4), 218 238. doi:10.1108/09576060010326221 O Donovan, P., Leahy, K., Bruton, K., & O Sullivan, D. T. J. (2015). Big data in manufacturing: a systematic mapping study. Journal of Big Data, 2(1), 20. doi:10.1186/s40537-015-0028-x Sharma, P., & Sharma, M. (2014). Artificial Intelligence in Advance Manufacturing Technology-A Review Paper on Current Application, (1), 4 7. Wright, P. (2014). Cyber-physical product manufacturing. Manufacturing Letters, 2(2), 49 53. doi:10.1016/j.mfglet.2013.10.001 Zuehlke, D. (2010). SmartFactory Towards a factory-ofthings. Annual Reviews in Control, 34(1), 129 138. doi:10.1016/j.arcontrol.2010.02.008 10