: Exploratory Data Analysis: Applying Visualization Tools Introduction Economic boom, though inspiring, is always connected with unsustainable development. Because of this, people tend to view economic indicators exclusively from the other development indicators. In this assignment, we try to explore the interconnections between economic indicator and other factors that affect the well being of each individual. Four hypotheses starting from gross national income (GNI), a dollar value of a country s final income in a year, and then involving other indicators about human activities and climate are discussed. Three interesting results involving GNI, education expenditure, and agriculture are found from the dataset. The data analysis is obtained by using Tableau. Description of Dataset The data from the World Bank covers several economic indicators, climate indicators and educational indicators. Information from 214 countries (autonomous regions are counted as countries) is gathered from 1980 to 2010. Data classified by continents and by income levels are also provided. Since some values are unavailable from some countries, we try to avoid using such indicators for fair comparison in worldwide discussions. Hypothesis According to Popular Science issued in January 2013, 27631 high-temperature records was either broken or tied in the US last year. Greenhouse effect is a known main cause for the global warming, and carbon dioxide (CO2) is the primary greenhouse gas emitted through human activities. It alters the natural carbon cycle by adding more CO2 and by destroying natural forest. GNI is a useful indicator to understand the country s economic strength; in other word, it refers the degree of human activity. Our hypotheses are focused on the relations among GNI, energy consumption, and climate changes. The hypotheses start from GNI and then involve other indicators about human activities and climate. 1. GNI and Electric Power Consumption (EPC) We assume a positive correlation between GNI and electric power consumption through countries, since electric power consumption indicates the strength of human activities in a given country. The following 2-D graphs with GNI in x-axis and EPC in y-axis show an increasing trend in 2000 (left) and 2010 (right) respectively. Each color plot represents a country. The assumption is proved by the graphs. Almost every country increases the both values. Notably,
China moves its location with a significant increase for both GNI and in EPC in the decade. Different story happened to Japan. EPC slightly decreases with a great GNI growing. It may indicate an effective energy policy or an industrial transformation. 2. Electric Power Consumption, Population, and Carbon Dioxide Emissions (CO2E) We zoom our interest in the US, the largest economy with greatest EPC. We assume another positive correlation among the three values, since rising population indicates vigorous human activities. The following graph with population in x-axis, EPC in y-axis, and CO2E in size of plot shows a clear relation. Each plot represents a single year from 1980 to 2008. Since population increases every year, the plots are ordered in time from the left to the right. CO2E continuously grew from 1980 to 2004 and so does EPC except for 1982 and 2001. After 2004, the gradients of the CO2E increase become smooth and even negative for 2006, 2008, and 2009 but the EPC still stably grows. It may reveal renewable energy uses or more efficient combustion.
3. Carbon Dioxide Emissions, Population, Region Comparison in population and in CO2E among different regions (roughly according to continent) is our next interest. CO2E proportional to regional population was expected. From the following two area charts with time in x-axis, CO2E and population in respective y-axis, East Asia & Pacific, Europe & Central Asia, and North America contribute most portion of CO2E in order; however, regions with most population in order are East Asia & Pacific, South Asia, and Europe & Central Asia. It indicates that carbon dioxide emissions per capita are incoherent for different regions. North America contributes almost equal amount of CO2E with Europe & Central Asia; nevertheless, population in North America is approximately one third of population in Europe & Central Asia. Information on South Asia is unexpected and worth to be dug deeper in industrial types. Moreover, a notable increasing of CO2E beginning from 2002 for East Asia & Pacific may imply the rapidly rising of China. Which can also be recognized from the graph in the first hypothesis.
From the above finding, we explore the carbon dioxide emissions per capita in each region in 2009. North America is extremely far ahead of other regions and should be responsible to global warming. 4. Carbon Dioxide Emissions, Population, Electric Power Consumption Based on the above analysis, we are trying to forecast future trend in carbon dioxide emissions by comparing population and electric power consumption of the top three economies, United State, China, and European Union. The following graph with time in x-axis, CO2E and population in respective y-axis, and EPC in width of curve implies a worrying future. CO2E tremendously enhances from China, which surpassed European Union for CO2E in 2003 and then surpassed the United State in 2005. Note that EPC in China is much smaller than in United State and European Union, but Chinese population is much greater than the other two economies and even continuously grows. According to the previous analysis, CO2E has positive correlation with both EPC and population. That implies a horrible growth of CO2E from China in next decade if Chinese government does not face this problem. We are going to face a serious climate change.
Other Interesting Findings with the Dataset: 1. Is economic development always accompanied with declining agriculture sector? In this visualization we explore the relationship between GNI per capita and the proportion of agriculture in GDP. Beside the general correlation between these variables, we also try to find out how well does this correlation apply to individual country. In order to do so, we present each country s data with a line, and observe the correlation within each country individually. For most of the countries with GNI per capita larger than 10,000 US dollars, agriculture consists less than 10% of their total GDP. Iceland, represented with yellow dots in the graph, is an exception. Although figure 1 reveals an obvious general negative correlation between GNI per capita and the proportion of agriculture in GDP, no obvious correlation pattern of these two variables can be found in Argentina s case (purple line in figure 2).
Figure 1 Figure 2
2. Which Country owns the largest proportion of arable land? Arable land shows the level of self-sufficiency level of one community. The increasing pressure on food supply because of population growth and diminishing arable land accompanied with urbanization process make the arable land issue important. This visualization presents the proportion of arable land for each country in 2009, which is the latest data available in our dataset. The data was presented with gradient green. The intensity of green increases as the proportion of arable land increases. This simple visualization reveals some interesting facts. For example, India and Ukraine, instead of the US, have the highest proportion of arable land. And in terms of regions, west and east Europe have high proportion of arable land in general.
3. Do rich countries invest more on education? This visualization explores the correlation between GNI per capita and Expenditure in 3 different education sectors (in terms of percentage of GDP per capita). In order to better observe and verify the correlation, we use small multiple and present the visualization result of 5 continuous years from 2005 to 2009. Different colors are used to represent the data from different countries. Some of those countries, maybe because of their low GDP per capita, have extremely high education expenditure in terms of percentage of GDP per capita, which prohibits us from observing the general correlation. Thus we only include the countries that have a GNI per capita higher than $10,000 for each year. There is a consistent positive correlation between GNI per capita and Education Expenses in tertiary education within these 5 years. Although similar correlation can also be found in the secondary sector, it is not so obvious, especially in the graph of 2008 and 2009.
Comments on Tableau 1. There is no way to set the two filters on same measures to different measures. For example, Plotting GNI of EU in 2008 and GNI of USA in 2007 is not available on the same graph since a time filter cannot work different for two countries. Either 2008 or 2007 for all countries are available. The reason for this demand is that we want to do grouping. For instance, figure 6 in other interesting findings shows countries with colors. We try to separate countries with GNI per capita greater than $30,000, from $10,000 to $30,000, as well as below $10,000, and then assign the same color for countries in the same group. 2. Visualization Templates provided in Tableau are helpful especially when we are exploring the patterns inside the data. However, higher flexibility in editing the visualization is desired when it comes to data presentation. 3. The dashboard function provides us certain flexibility in combining different visualizations and data selection. For example, by adding the map and table of countries into one dashboard and setting the map as a global filter, we can select countries according to their geographic locations and group them together. 4. Dataset format is not flexible. In general, dataset are in the following format for easy review. Each year occupies a single column. However, Tableau is not able to recognize that they are time variables but nominal data. In this result, reshape of dataset is required before data import. Users are required to work on reshaping by using SQL or other tools. Tableau provides an addon for Microsoft Excel to do reshaping; however, it cannot work out if the dataset is too huge. Fortunately, World Bank provides its own dataset format function. Furthermore, although year is arranged in Tablea-prefered format, it still cannot be recognized as time variables unless which format is day/month/year. In this assignment, we manually modified the year variables to the requiring format. 5. Only printing to PDF is available in Tableau. Users may expect some useful export format such as jpeg, png, eps.
6. Tableau is incapable of highlight for a particular plot in printing output, user can get detail information in a small box when cursor in right on a spot, curve, or symbol. Unfortunately, it is unable to highlight such details in printing output. 7. The colors in Tableau are not enough, especially when we want to use colors to present many different countries around the world. Conclusion A good visualization tool not only makes data analysis efficient but also makes fun. From the database of World Bank, we found some facts that we have wrong understandings before. Unexpected findings drive us dig into the truth and think in different degrees. Besides the above discussion on economics, education, and climate change, we also look into topics such as poverty and inequality. We see our history through the enormous numbers. They tell us how human beings change the world good and bad. The most importance from this assignment is that now we know how to persuade people by a story with good visualizations.