Green Big Data A Green IT / Green IS Perspective on Big Data
Agenda 1. Starting Point and Research Question 2. Subject of Analysis 3. Research Methodology 4. Results 5. Conclusion Green Big Data 2
Starting Point and Research Question Increasing usage of the notion Big Data in science and practical environment Inconsistent understanding, resulting from the extensive use of the notion in a marketing context Relevance of Big Data differs amongst scientific disciplines Research questions Are resource efficient processes for Big Data applications discussed in recent publications? In how far can aspects of Big Data applied to EMIS? Green Big Data 3
The Big Data Concept Improved understanding of the notion Big Data by identifying characterizing dimensions using a deductive approach Green Big Data 4
Selection of Methodology Problem: Which method can be used to identify if and in how far a new concept (Big Data), which characteristics are not yet defined consistently is already existent in a certain field of research? Traditional Literature Analysis Manual identification of recent fields of research within the current EMIS / Green IT / Green IS - literature Generative Literature Analysis Automated identification of recent fields within the current EMIS / Green IT / Green IS - literature Green Big Data 5
Data Basis Data source: Scopus Keywords: EMIS, Green IT, Green IS in Title, Abstract, Keywords Period under consideration: 2007-2012 Number of resulting documents: 1055 Processed data: Abstracts Green Big Data 6
Underlying Assumptions of Topic Models Topics are probability distributions over words A probability distribution over the contained topics can be defined for each document Each document is represented by a list of words word vector Green Big Data 7
Application of Topic Models Parameter calculation 1. Defining an a-priori distribution over topics 2. Defining an a-priori distribution over words for each topic 3. Applying the Latent Dirichlet Allocation for the calculation of the latent variables based on a corpus Abstracts of the identified publications Green Big Data 8
Results and Discussion of the Topic Models The period 2008 2010 is marked by natural science related topics A rise of an application-oriented perspective since 2011, which contains the first aspects of the Big Data concept implicitly Green Big Data 9
Hadoop A Green IT Perspective on Big Data Goiri et al. (2012) GreenHadoop: Leveraging green energy in data-processing frameworks Mao et al. (2012) GreenPipe: A Hadoop Based Workflow System on Energy-efficient Clouds Hadoop Based on MapReduce, a framework for distributed computation developed by Google, focused on scalability Contains amongst other a file system (HDFS) and a column-oriented database (Hbase), which runs on Commodity Hardware Basis for numerous Big Data products Application of Hadoop in the field of Green IT Energy-efficient controlling of Hadoop cluster Scheduling of MapReduce jobs according to the availability of green energy Green Big Data 10
Possible areas of application for Big Data in the field of Green IS Aspects of Big Data can not be found in Green IS publications so far Outlook Possible area of application: Exploitation and utilization of new data sources for the calculation of the environmental impact Company internal Potential data sources for the development of petri nets can be found in terms of Event-Logs from ERP-System and sensor data in the production environment Company external Identification of environmental impact of upstream supply chain members by using databases as Ecoinvent Application of text mining / ontologies for the analysis of unstructured data Closing the Semantic Gaps resulting from inconsistent denotation standards of different product databases Incorporating public available data sources (Open Data Initiative) Data collection of the Sustainability Consortium within the Open IO projects Green Big Data 11
Results and outlook Conclusion Generative approach has proofed itself as useful for the analysis of emerging research fields Big Data can not be found explicitly in the field of Green IT / Green IS, but has already arrived in terms of Hadoop for Green IT applications Focus on the energy-efficient controlling of Hadoop cluster Outlook Data basis Utilizing further data sources Discipline/Dimension-specific data gathering (Infrastructure, Method etc.) Method Validation of the results using an intrusion approach Green Big Data 12
Thank you very much for your attention. Green Big Data 13