DATA VISUALIZATION: FINDING PICTURES IN NUMBERS @PratapVardhan Pratap Vardhan, Data Scientist, Gramener
You will see 3 questions. You have 30 seconds. Try it! Your timer starts now A DATA VISUALISATION CHALLENGE
HOW 23 17 37 62 101 39 11 75 12 29 37 46 3 48 MANY NUMBERS ARE ABOVE 32 21 8 55 56 53 12 10 52 56 23 10 46 56 107 59 45 22 36 69 41 10 25 5 43 19 39 25 72 44 14 64 26 67 69 58 57 30 102 37 50 58 68 12 33 43 26 70 51 104 33 21 11 50 57 22 87 51 41 55 94 48 94 77 7 96 70 81 64 100? 11 84 69 73 97 2 92 88 66 65 95 65 91 1 77 20 14 58 78 82 59 66 84 81 66 84 63 76 70 18 103 6 73 92 81 78 101 63 9 16 40 92 93 98 82 91 87 88 98 91 79
HOW 23 17 37 62 101 39 11 75 12 29 37 46 3 48 MANY NUMBERS ARE BELOW 32 21 8 55 56 53 12 10 52 56 23 10 46 56 107 59 45 22 36 69 41 10 25 5 43 19 39 25 72 44 14 64 26 67 69 58 57 30 102 37 50 58 68 12 33 43 26 70 51 104 33 21 11 50 57 22 87 51 41 55 94 48 94 77 7 96 70 81 64 10? 11 84 69 73 97 2 92 88 66 65 95 65 91 2 77 20 14 58 78 82 59 66 84 81 66 84 63 76 70 18 103 6 73 92 81 78 101 63 9 16 40 92 93 98 82 91 87 88 98 91 79
WHICH 23 17 37 62 101 39 11 75 12 29 37 46 3 48 QUADRANT HAS HIGHEST TOTAL? 32 21 8 55 56 53 12 10 52 56 23 10 46 56 107 59 45 22 36 69 41 10 25 5 43 19 39 25 72 44 14 64 26 67 69 58 57 30 102 37 50 58 68 12 33 43 26 70 51 104 33 21 11 50 57 22 87 51 41 55 94 48 94 77 7 96 70 81 64 11 84 69 73 97 2 92 88 66 65 95 65 91 77 20 14 58 78 82 59 66 84 81 66 84 63 76 3 70 18 103 6 73 92 81 78 101 63 9 16 40 92 93 98 82 91 87 88 98 91 79
The same questions again. But with a few visual cues. See how long it takes now. Your timer starts now A DATA VISUALISATION CHALLENGE
HOW 23 17 37 62 101 39 11 75 12 29 37 46 3 48 MANY NUMBERS ARE ABOVE 32 21 8 55 56 53 12 10 52 56 23 10 46 56 107 59 45 22 36 69 41 10 25 5 43 19 39 25 72 44 14 64 26 67 69 58 57 30 102 37 50 58 68 12 33 43 26 70 51 104 33 21 11 50 57 22 87 51 41 55 94 48 94 77 7 96 70 81 64 100? 11 84 69 73 97 2 92 88 66 65 95 65 91 1 77 20 14 58 78 82 59 66 84 81 66 84 63 76 70 18 103 6 73 92 81 78 101 63 9 16 40 92 93 98 82 91 87 88 98 91 79
HOW 23 17 37 62 101 39 11 75 12 29 37 46 3 48 MANY NUMBERS ARE BELOW 32 21 8 55 56 53 12 10 52 56 23 10 46 56 107 59 45 22 36 69 41 10 25 5 43 19 39 25 72 44 14 64 26 67 69 58 57 30 102 37 50 58 68 12 33 43 26 70 51 104 33 21 11 50 57 22 87 51 41 55 94 48 94 77 7 96 70 81 64 10? 11 84 69 73 97 2 92 88 66 65 95 65 91 2 77 20 14 58 78 82 59 66 84 81 66 84 63 76 70 18 103 6 73 92 81 78 101 63 9 16 40 92 93 98 82 91 87 88 98 91 79
WHICH 23 17 37 62 101 39 11 75 12 29 37 46 3 48 QUADRANT HAS HIGHEST TOTAL? 32 21 8 55 56 53 12 10 52 56 23 10 46 56 107 59 45 22 36 69 41 10 25 5 43 19 39 25 72 44 14 64 26 67 69 58 57 30 102 37 50 58 68 12 33 43 26 70 51 104 33 21 11 50 57 22 87 51 41 55 94 48 94 77 7 96 70 81 64 11 84 69 73 97 2 92 88 66 65 95 65 91 77 20 14 58 78 82 59 66 84 81 66 84 63 76 3 70 18 103 6 73 92 81 78 101 63 9 16 40 92 93 98 82 91 87 88 98 91 79
YOU WILL BE SHOWN A SET OF NUMBERS ALONG WITH A SUMMARY (AVERAGE, ETC) CAN YOU MAKE SENSE OF THE FIGURES? WHY VISUALISE?
DO THESE FOUR CITIES LOOK IDENTICAL TO YOU? Take a look at the sales report alongside. A company has branches in 4 cities, and each branch changes the product price every month. This leads to a corresponding change in the sales. Here is the performance of the 4 branches with their monthly price and sales for each month. Looking at the average, the four branches have an identical performance. DO YOU AGREE? 2010 Month Boston Chicago Detroit New York Price Sales Price Sales Price Sales Price Sales Jan Feb 10.0 8.0 8.04 6.95 10.0 8.0 9.14 8.14 10.0 8.0 7.46 6.77 8.0 8.0 6.58 5.76 Mar 13.0 7.58 13.0 8. 13.0 12. 8.0 7. Apr May Jun Jul 9.0 11.0 14.0 6.0 8.81 8.33 9.96 7.24 9.0 11.0 14.0 6.0 8.77 9.26 8.10 6.13 9.0 11.0 14.0 6.0 7.11 7.81 8.84 6.08 8.0 8.0 8.0 8.0 8.84 8.47 7.04 5.25 Aug 4.0 4.26 4.0 3.10 4.0 5.39 12.0 10.84 12.0 9.13 12.0 8. 8.0 5.56 Sep 19.0 12.50 Oct Nov 7.0 5.0 4.82 5.68 7.0 5.0 7.26 4. 7.0 5.0 6.42 5.73 8.0 8.0 7.91 6.89 Average 9.0 7.50 9.0 7.50 9.0 7.50 9.0 7.50 Variance 10.0 3.75 10.0 3.75 10.0 3.75 10.0 3.75 Average price is the same. Variance in price is the same. Average sales is the same too. So is the variance in sales.
ARE THEY REALLY IDENTICAL? CHECK AGAIN But in fact, the four cities are totally different in behaviour. Boston Chicago Detroit New York Boston s sales has generally increased with price. Detroit has a nearly perfect increase in sales with price, except for one aberration. Chicago shows a decline in sales beyond a price of 10. New York s sales fluctuates despite a nearly constant price.
A data analytics and visualisation company We handle terabyte-size data Gramener visualises your data via non-traditional analytics and visualise it in real-time. Gramener transforms your data into concise dashboards that make your business problem & solution visually obvious. We help you find insights quickly, based on cognitive research, and our visualisations guide you towards actionable decisions.
INDIAN ODI BATTING GRAMENER.COM/CRICKET/
Jan 100 YEARS OF INDIA S WEATHER 1901 1911 1921 1931 1941 1951 1961 19 1981 11 2001 Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
19 S
IN 2014 ELECTIONS, WHICH STATE PRODUCED MOST NUMBER OF CROREPATI CANDIDATES? AND WHICH STATE HAS HIGHEST % OF CROREPATI CANDIDATES?
GEOGRAPHY OF CANDIDATE WEALTH Uttar Pradesh, with over 400 crorepati candidates, tops the list. Number of Candidates The Northeastern states have the largest percentage of crorepati candidates. Percentage of Crorepati Candidates
AMONG THE MAINSTREAM PARTIES, WHICH PARTY HAS HIGHEST % OF CRIMINAL CANDIDATES?
CRIMINAL CASES Size: Number of candidates Color: % of criminal candidates MNS seems like a winner here. Closely followed by RJD, MDMK 23 S
AND, ONE MORE THING.. NAMESAKES OF 2014 ELECTIONS
CHANDU LALS OF MAHASAMUND Winner s Margin: 1,217 votes Namesakes' polled:,000+ votes
MOST OF WHAT I DO TODAY IS VISUALISING DATA ANOMALIES YOU DON T NEED SOPHISTICATED ANALYSES FOR THIS IT CAN BE EASY TO SPOT THEM
PREDICTING MARKS What determines a child s marks? Do girls score better than boys? Does the choice of subject matter? EDUCATION Does the medium of instruction matter? Does community or religion matter? Does their birthday matter? Does the first letter of their name matter?
LET S LOOK AT YEARS This is a dataset (1975 10) that has been around for several years, and has been studied extensively. Yet, a visualization can reveal patterns that are neither obvious nor well known. More births Some special days like April Fool s day are avoided, but Valentine s Day is quite popular Fewer births OF US BIRTH DATA For example, Are birthdays uniformly distributed? Do doctors or parents exercise the C-section option to move dates? Is there any day of the month that has unusually high or low births? Are there any months with relatively high or low births? on average, for each day of the year (from 1975 to 10) Most people prefer not to have children on the 13th of any month, given that it s an unlucky day Relatively few births during the Christmas and Thanksgiving holidays, as well as New Year and Independence Day. Very high births in September. But this is fairly well known. Most conceptions happen during the winter holiday season
THE PATTERN IN INDIA IS QUITE DIFFERENT This is a birth date dataset that s obtained from school admission data for over 10 million children. When we compare this with births in the US, we see none of the same patterns. More births Fewer births Such round numbered patterns a typical indication of fraud. Here, birthdates are brought forward to aid early school admission For example, Is there an aversion to the 13th or is there a local cultural nuance? Are holidays avoided for births? Which months have a higher propensity for births, and why? Are there any patterns not found in the US data? on average, for each day of the year (from 2007 to 2013) We see a large number of children born on the 5th, 10th, th, 20th and 25th of each month that is, round numbered dates Very few children are born in the month of August, and thereafter. Most births are concentrated in the first half of the year
THIS ADVERSELY IMPACTS It s a well established fact that older children tend to do better at school in most activities. Since many children have had their birth dates brought forward, these younger children suffer. Higher marks Lower marks CHILDREN S MARKS The average marks of children born on the 1st, 5th, 10th, th etc. of the month tend to score lower marks. Are holidays avoided for births? Which months have a higher propensity for births, and why? Are there any patterns not found in the US data? on average, for children born on a given day of the year (from 2007 to 2013) Children born on round numbered days score lower marks on average, due to a higher proportion of younger children 32
EXPLORING THE MAHABHARATA How does Mahabharata, one of the largest epics with 1.8 million words lend itself to text analytics? Can this unstructured data be processed to extract analytical insights? What does sentiment analysis of this tome convey? Is there a better way to explore relations between characters? How can closeness of characters be analysed & visualized?
MMS SPEECHES https://gramener.com/speechopedia
AAP DONATIONS https://gramener.com/aapdonations
FLAGS OF THE WORLD https://gramener.com/flags
CALVIN AND HOBBES
DETECTING FRAUD ENERGY UTILITY We know meter readings are incorrect, for various reasons. We don t, however, have the concrete proof we need to start the process of meter reading automation. Part of our problem is the volume of data that needs to be analysed. The other is the inexperience in tools or analyses to identify such patterns.
BILLING FRAUD AT AN ENERGY UTILITY An energy utility (with over 50 million subscribers) had 10 years worth of customer billing data available. Below is a simple histogram (or frequency distribution) of usage levels. Each bar represents the number of customers with a customers with a specific bill amount (in units, or KWh). Most fraud detection software failed to load the data, and sampled data revealed little or no insight. Tariffs are based on the usage slab. Someone with 101 units is billed in full at a higher tariff than someone with 100 units. So people have a strong incentive to stay at or within a slab boundary. This plot shows the frequency of all meter readings from Apr-2010 to Mar-2011. An unusually large number of readings are aligned with the slab boundaries. This can happen in one of two ways. First, people may be monitoring their usage very carefully, and turn of their lights and fans the instant their usage hits the slab boundary. Or, more realistically, there s probably some level of corruption involved, where customers pay a small sum to the meter reading staff to ensure that it stays exactly at the slab boundary, giving them the advantage of a lower price.
Github: https://github.com/pratapvardhan Elections: https://gramener.com/election/ Speechopedia: https://gramener.com/speechopedia/ AAP: https://gramener.com/aapdonations/ Cricket: https://gramener.com/cricket/ Flags: https://gramener.com/flags/ LINKS
Try it! All you need is some data and some curiosity to VISUALISE DATA YOURSELF! @PratapVardhan Pratap.Vardhan@gramener.com +91-837-4-9651