Big Data Analy1cs Radu State
Short Bio Master of Science in Engineering, Johns Hopkins University, USA Senior Researcher with INRIA, France Professor, (University of Nancy 1) Research Scien1st, SnT/UL in Netlab research group 6/11/14 NETLAB Presenta1on 2
Outline What is Big Data Big Data Analysis in team Call Details record at a country level Analysis of Network traffic research done at SnT 6/11/14 NETLAB Presenta1on 3
Big Data at a glance 6/11/14 NETLAB Presenta1on 4
What are Big Data Architectures? 6/11/14 NETLAB Presenta1on 5
Phases in Big Data Processing 6/11/14 NETLAB Presenta1on 6
Map- Reduce holy grail in Big Data 6/11/14 NETLAB Presenta1on 7
Big Data is more n simple HPC 6/11/14 NETLAB Presenta1on 8
The current Eco- System 6/11/14 NETLAB Presenta1on 9
Anomaly Detec1on in large scale CDR records David Goergen, Radu State, Thomas Engel and Veena MENDIRATTA (Bell Labs, USA) One country : Ivory Cost Time Period: 01.12.2011 to 28.04.2012 5 million users 1124 base sta1ons (for mobile communica1ons) More n 3 billions entries summarizing on a hourly basis SMS and Voice Calls 50000 mobile users tracked over se months with GPS and call records 6/11/14 NETLAB Presenta1on 10
What happened in Ivory Cost in 2012? 6/11/14 NETLAB Presenta1on 11
The silent base sta1ons 6/11/14 NETLAB Presenta1on 12
Strange calling behaviors..at 2 AM. 6/11/14 NETLAB Presenta1on 13
Where is most variability in data? PCA analysis on dura1on pcacall Variances 0 20 40 60 80 100 120 140 6/11/14 NETLAB Presenta1on 14
Cloud and Service Management ALTO solves general rendezvous problem: Given a choice of resources, which one is best candidate? Normalized costs: Type: What does cost represent? ( Air-miles, hop count,...) 15 - Numerical (virtual coordinates) - Ordinal(position-based preferences) ALTO client 2 main abstractions: - Network Map Datacenter 2 - Cost Map Datacenter 1 Dynamic network information Provisioning policies ALTO service discovery Network specified in terms of Partition/Provider ID (PID): aggregation of endpoints identified by a providerdefined network location identifier. Routing Graphics sources: protocols http://pubs.vmware.com/vi301/intro/images/introduction_chapter.3.2.1.jpg 6/11/14 IIT RTC conference Datacenter 3
Op1miza1on of service provisioning (Cloud, CDNs) over large ISPs and networks FCC Dataset specifica1on (200 GB/year) FCC has embarked on a na1onwide performance study of residen1al wireline broadband service easurement points (unit_ids) hroughput (b) Latency Aim is to use raw datasets from this Cablevision study f or a nalysis a nd Figure 6: Sharpe ratios for upload throughput and latency to create ALTO Charter Comcast topology map and a cost map from this Cox dataset portfolio is modeled as specific service of interest Embarq MCI Using a c anonical M ap- Reduce ( Big D ata) eking PIDs where nodes have a high upload throughput or os for upload bandwidth merger completed, it took about 5 months for operating Mediacom computa1onal paradigm on a Hadoop TimeWarner b). figurenodes showshave of merged network to start showing a positive DsThe where low capacities latency. cluster Windstream Cablevision 0.000 0.498 0.498 0.498 0.498 0.498 0.498 0.498 0.498 Charter 0.799 0.000 0.799 0.799 0.799 0.799 0.799 0.799 0.799 Comcast 0.804 0.804 0.000 0.804 0.804 0.804 0.804 0.804 0.804 Cox 0.796 0.796 0.796 0.000 0.796 0.796 0.796 0.796 0.796 Embarq 0.377 0.377 0.377 0.377 0.000 0.377 0.377 0.377 0.377 MCI 0.700 0.700 0.700 0.700 0.700 0.000 0.700 0.700 0.700 Mediacom 0.316 0.316 0.316 0.316 0.316 0.316 0.000 0.316 0.316 TimeWarner 0.508 0.508 0.508 0.508 0.508 0.508 0.508 0.000 0.508 Windstream 0.782 0.782 0.782 0.782 0.782 0.782 0.782 0.782 0.000 Default 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Sharpe ratio (y-axis) growth again. The Sharpe ratio for foroura portfolio p, is computed by subtable III: Cost map for upload bandwidth (Cu ) Major o utcomes graphs (at y = 0) represents Calculating cost maps using Sharpe ratio (Equation acting risk-free rate of return (R f ) from rate of e investment no longer is 1) isfinancial modeling ith Sharpe ra1os to a set of all straightforward. Wewconsider P ID to be Cablevision Charter Comcast Cox Embarq MCI Mediacom TimeWarner Windstream Default e portfolio return itself (r and dividing by standard p ), ISPs build in third par1es ALTO maps that takes ompared in terms of absolute Table II. The upload throughput cost map, Cu, Cablevision 0.00 1.67 1.81 1.10 1.41 0.86 1.32 0.32 1.38 2.00 0.76 0.00 1.81 1.10 1.41 0.86 1.32 0.32 1.38 2.00 eviation of individual portfolio ( p ), as shown in l(1): both bandwidth nd atency nto account 2 Charter volution for ISPs. returns is calculated from asharpe ratio iusing Equations and 3 Comcast 0.76 1.67 0.00 1.10 1.41 0.86 1.32 0.32 1.38 2.00 s to explore here, which we below. Cox 0.76 1.67 1.81 0.00 1.41 0.86 1.32 0.32 1.38 2.00 r R Embarq 0.76 1.67 1.81 1.10 0.00 0.86 1.32 0.32 1.38 2.00 f w to create cost maps.sp = p (1) MCI 0.76 1.67 1.81 1.10 1.41 0.00 1.32 0.32 1.38 2.00 (a) Upload Throughput 12 (b) Latency Mediacom 0.76 1.67 1.81 1.10 1.41 0.86 0.00 0.32 1.38 2.00 p X any correlation between i i TimeWarner 0.76 1.67 1.81 1.10 1.41 0.86 1.32 0.00 1.38 2.00 8i 2 P ID, C = S W (2) i Figure ratios upload0.86 throughput u yhistorical among ISPs. However, Windstream 0.76 1.67 6: Sharpe 1.81 1.10 for1.41 1.32 and latency 0.32 0.00 2.00 averages alone might not be appropriate if m m=1 Ps deliver acceptable latency Table IV: Cost map for latency (Cl ) sociated standard deviations are high, because in such cases Relevant PHere, ublica1ons and submissions ent benchmark we establish. i C is upload throughput cost for each PID i, Figure 6 shows Sharpe ratios for upload bandwidth merger completed, it took about 5 months for operating e effective metric of interest is much u lower than historhroughput (Figure 6a), where (Figure 6a) and latency (Figure 6b). The figure shows I capacities of merged networkin tocstart showing a positive David G oergen, V eena B. M endirafa ( Bell L abs, U SA), R adu S tate T homas E ngel. den1fying abnormal pafern ellular which is calculated over summation of value of each alisps average. However, Sharpe ratio takes associatedi that deliver what we monthly evolution (x-axis) of Sharpe (y-axis) foreach our growth again. as a distance from default cost. The ratio latency cost for month s Sharpe ratio for PID i (S ) and multiplying by communica1on (IPTCOMM 2013) m i calculatedline in ainsimilar manner. 9 PID, ISPs.CThe graphs (at The y = computations 0) represents andard deviation in account, so it isflows a better candidate Calculating cost maps using Sharpe ratio (Equation peed. Second, a number of a weight l, ishorizontal associated with PID. To rewardofsharpe ratios Equations 2which and 3 result in: cost matrices shown in Aggrega1ng threshold atr risk-free investment no longer is, 1) is straightforward. We consider P ID to be a set of all costdavid Goergen Vijay K. Gurbani (Bell L abs, U SA), adu State Of m aps a nd c osts: large- scale broadband ent over year inaupload r approximating function. Higher Sharpe ratios are that are positive, we weigh sum by fraction ofiiimonths Tables and IVratios for Ccan Cl, respectively. In of absolute tables, ISPs in Table II. The upload throughput cost map, Cu, best option. These be compared in terms u and and Mediacom). This may be measurements or tmonths) he Applica1on Layer ratio Traffic but O p1miza1on ( ALTO) ppid rotocol. (RTC 2013) quivalent to high averages and small standard deviations. for first column source and remaining values, alsoi on temporal evolution for individual ISPs. (from last f12 that Sharpe PID isrepresents is calculated from Sharpe ratio using Equations 2 and 3 g and provisioning, although columns destination PID. Thus,toa explore host in PIDwhich Comcast There are some interesting trends here, we below. positive (i.e., it is above risk-free investment line). maller values for Sharpe ratios are due to eir high David Goergen, Vijay Gurbani (Bell Labs, USA), adu Sbefore tate or Thomas Engel. Making historical connec1ons: Building Applica1on will R always hosts discuss below prefer showing howintocomcast create first cost(cost: maps.0), ause for this. Alternatively, In ALTO, a lower value for a cost indicates a higher andard deviations, or to small averages. A negative Sharpe 12 followed by hosts in PID Mediacom (cost: 0.316) if it wants to (submifed to CNSM 2014) Layer Traffic Op1miza1on (ALTO) network and maps public broadband data X First,cost data doesfrom not show any correlation between end in Sharpe ratio for i preference forreturn traffic to be sentperform from a source to a destination. optimize upload bandwidth, or hosts in PID Time Warner 8i 2 P ID, Cui = Sm Wi (2) tio implies that risk-free rate of would upload throughput and latency among ISPs. However, mized by Time Warner and Therefore, actual cost for each PID is furr (cost: 0.32) if it wants to minimize latency. The costs serve m=1 calculated Figure 6b does show that most ISPs deliver acceptable latency etter portfolio being analyzed. ward than trend that starts around 6/11/14 to connect disconnected components of Figure according to previously risk-free investment benchmark we establish. as: Here, Cui is upload throughput cost for each PID i, 5; Figure 7 depicts links formed by peers in We create twountil costjuly maps, each cost map is specific to sustained uptick This is less case with upload throughput (Figure 6a), largest where which is calculated over summation of value of each PID is(comcast) to peers inamong or ISPs PIDs that based on what upload re a wider variation deliver we month s Sharpe ratio for PID i (S i ) and multiplying by ion of Insight Communica-Cost map, Cu is created for P2Pclass of applications. m bandwidth of Table III. The width of edges between 1 2consider i an cost 3 acceptable upload speed. Second, a number of a weight associated with PID. To reward Sharpe ratios 8i 2 P ID, Def ault_cost = dmax(c, C,..., C )e February 29, 2012 Insight u is a measure of preference; i.e., PIDs connected PIDs ass of applications that seeks out PIDs that host peers with u uisps (3) showi continuous improvement over year in upload that are positive, we weigh sum by fraction of months i est cable operator in US
Big Data and Security Research Questions: Securing Big Data architectures Which data is accessed and processed? What happens to outcome? How to protect data in motion? Big Data approaches for advanced security analytics and advanced Persistent Threat (APT) detection massive data collecting data and event tracking at large scale deeper analytics on unstructured data (honeypots, firewall, DNS); consolidated view of security-related information; real-time analysis of streaming AAA security data in order to profile human behavior in loop Predict attacker behavior on or targets J. François, S. Wang, R. State, and T. Engel, BotTrack: Tracking Botnets using NetFlow and PageRank, in IFIP/TC6 NETWORKING 2011, Springer, Ed., Valencia, Spain, May 2011 Wagner Cynthia, J. François, R. State, and T. Engel, Machine Learning Approach for IP-Flow Record Anomaly Detection, in IFIP/TC6 NETWORKING 2011, Springer, Ed., Valencia, Spain, May 2011. Hommes, Stefan State, Radu Zinnen, Andreas Engel, Thomas. Detection of Abnormal Behaviour in a Surveillance Environment Using Control Charts. IEEE International Conference on Advanced Video and Signal based Surveillance (AVSS) (2011), no. 8th, pp. 113-118 S. Wang, R. State, M. Ourdane T. Engel. Mining NetFlow Records for Critical Network Activities. AIMS 2010: 135-146 S. Wang, R. State, M. Ourdane T. Engel. FlowRank: ranking NetFlow records. IWCMC 2010: 484-488 6/11/14
Botnet tracking Jerome Francois, Radu State, Thomas Engel Botnets: Army of controlled compromised machines Powerful afack vector (spam, DDoS, espionage...) P2P Bots detec=on: Well interconnected machines to maintain underlying network Graph Analysis (PageRank/Google) Improvement by leveraging honeypots Source: hfp://www.csoonline.com/ar1cle/348317/what- a- botnet- looks- like, Scof Berinato 2012-05- 08 Network Security 18
Botnet tracking BotTrack / BotCloud (MapReduce version): Results: Stealthy botnet detec1on (1% of IP addresses) High accuracy ~ 99% Scalability (60,000 flows / second) Publica=ons: BotTrack: Tracking Botnets Using NetFlow and PageRank, François J., Wang S., State R., Thomas E., IFIP Networking 2011 BotCloud: Detec<ng Botnets Using MapReduce, François J., Wang S., Bronzi W., State R., Engel T., IEEE Interna1onal Workshop on Informa1on Forensics and Security - WIFS'11 2012-05- 08 Network Security 19
Financial Data Analysis Research Ques1ons: Model complex rela1onships for co- lending in EU banking zone Modeling of financial instruments Mining of highly unstructured data formats Integrate economical (NYSE), regulatory (SEC) and media news (chairman, board member informa1on obtained from social media Twifer, Google Trend) Assess and model risk based on graph modeling and Distributed computa1ons Analyze loan interests rates (Libor) with respect to addi1onal loan rates and economic indicators. Expected Outcomes 6/11/14 Link to local (Luxembourg) economy Impact to EU regulatory bodies Figure 1: Co-lending network for 2005. that are connected to ones it is directly connected with. Therefore, even if a bank has very few co-lending relationships itself, it may impact entire system if it is connected to a few major lenders. Since matrix L represents pairwise connectedness of all banks, we may write impact of bank i on system as following equation: x i = N j=1 Lijxj, i. This may be compactly represented as x = L x, wherex =[x 1,x 2,...,x N] R N 1 and L R N N. We pre-multiply left-hand-side of equation above by a scalar λ to get λ x = L x, i.e.,an eigensystem. The principal eigenvector in this system gives loadings of each bank on main eigenvalue and represents influence of each bank on lending network. This is known as centrality vector in sociology literature [5] and delivers a measure of systemic effect a single bank may have on lending system. Federal regulators may use centrality scores of all banks to rank banks in terms of ir risk contribution to entire system and determine best allocation of supervisory attention. The data we use comprises a sample of loans filings made by financial institutions with SEC. Our data covers a Development of such ac1vi1es at SnT 2006 2008 2007 2009 Figure 2: Co-lending networks for 2006 2009. No1ce: Figure from (1) sented in Figure 1 for 2005. We see that re are three large components of co-lenders, and three hub banks, with connections to large components. There are also satellite colenders. In order to determine which banks in network are most likely to contribute to systemic failure, we compute normalized eigenvalue centrality score described previously, and report this for top 25 banks. These are presented in Table 1. The three nodes with highest centrality are seen to be critical hubs in network se are J.P. Morgan (node 143), Bank of America (node 29), and Citigroup (node 47). They are bridges between all banks, and contribute highly to systemic risk. Figure 2 shows how network evolves in four years after 2005. Comparing 2006 with 2005 (Figure 1), we see that re still are disjointed large components connected by a few central nodes. From 2007 onwards, as financial crisis begins to take hold, co-lending activity diminished
Ques1ons? 21 6/11/14