אלביט מערכות יבשה ותקשוב Defense Industry & Open Source & BigData
מרצה גרמן גברילוב germang@elbit.co.il אלביט מערכות יבשה ותקשוב מנהל מודיעין תחום סייבר
Defense Industry & Open Source & Big Data Big Data Open Source Defense Industry
Agenda צורך גידול בנפח מידע עולמי צורך במערכות מודיעניות מה זה?Big Data 3V Model of Big Data Scale up / Scale out CAP theorem סוגי פתרונות פרוייקט Apache Hadoop HDFS Map Reduce Hadoop Projects דוגמא לארכיטקטורה של מערכת מידע בעזרת Hadoop
צורך - גידול בנפח מידע עולמי Twitter produces over 340 million tweets per day, with over 500 million registered users as of 2012 Over 32 billion searches were performed last month on Twitter Facebook creates over 30 billion pieces of content ranging from web links, news, blogs, photo Zynga processes 1 petabyte of content for players every day More than 2 billion videos are watched on YouTube every day By 2015, nearly 3 billion people will be online, pushing the data created and shared to nearly 8 zettabytes.
צורך - גידול בנפח מידע עולמי
צורך - גידול בנפח מידע עולמי quantity of global data
צורך - צורך במערכות מודיעניות יכולת קליטה בזמן קצר נפחים גדולים של נתונים real-time( )near יכולת קליטה סוגים שונים של נתונים יכולת עיבוד נפחים גדולים של מידע יכולת הרצת אנליזות שונות מותאמות סוג מידע יכולת תחקור של הצגה של מידע בצורה ברורה, מהירה ונוחה הלקוח רוצה לדעת לקרוא את המידע הקיים בעולם בצורה נוחה
צורך - צורך במערכות מודיעניות
דוגמאות לתמונות שאנשים העלו בחשבון טוויטר
מה זה?Big Data What is data? Data is Information in raw or unorganized form such as alphabets, numeric or symbols. What is Big Data? Big Data refers to large datasets which are difficult to store, manage and analyze. Everyday, we create over 2.5 trillion byte of data so much that 90% of the data in the world today has been created in the last tow years alone.
מה זה?Big Data O Reilly Radar definition: Big data is when the size of the data itself becomes part of the problem EMC/IDC definition of big data: Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis. IBM says that three characteristics define big data: Volume (Terabytes -> Zettabytes) Variety (Structured -> Semi-structured -> Unstructured) Velocity (Batch -> Streaming Data)
3V Model of Big Data
ביזור מדיע בין מכונות Scale up / Vertical scaling Scale out / Horizontal scaling / Distributed systems To scale vertically means to add resources to a single node in a system, typically involving the addition of CPUs or memory to a single computer. To scale horizontally means to add more nodes to a system, such as adding a new computer to a distributed software application.
CAP theorem CA RDBMSs (MySql, ( Greenplum Vertica Aster Data CP Hbase MongoDB Terrastore BigTable MemcacheDB AP Cassandra CouchDB SimpleDB Dynamo
סוגי פתרונות Store type Key Value Stores Schema-less Description Conceptual Structures Key Value Column-oriented databases Storage by column Israel Target Weight 2.85 kg Price 24.00 $ Italia 1.23 kg 17.50 $ Turkey 3.76 kg 27.30 $ Graph Databases Uses nodes and edges to represent data. Data Node Data Node Data Node Document Oriented Databases Sharded RDBMS (MPP databases) Store documents that are semi-structured. Often XML databases. Key Structured Document (XML) RDBMS RDBMS RDBMS
סוגי פתרונות Type Performance Horizontal Scalability Flexibility in Data Variety Complexity of Operation Functionality Key-Value stores Berkeley Scalaris MemcacheDB high high high none variable (none) Column-oriented databases Cassandra HP Vertica BigTable Hbase OrientDB high high moderate low minimal Graph Databases Neo4j InfiniteGraph Titan OrientDB variable variable high high graph theory Document Oriented Databases CouchDB MongoDB SimpleDB Redis high variable (high) high low variable (low) Shard RDBMS (MPP) HP Vertica EMC Greenplum Aster Data variable variable low moderate relational
פרוייקט Apache Hadoop hadoop.apache.org The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model wikipedia.org Apache Hadoop is an open-source software framework that supports dataintensive distributed applications. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. Hadoop provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. It enables applications to work with thousands of computation-independent computers and petabytes of data.
פרוייקט Apache Hadoop Companies are provides Hadoop in they products IBM - InfoSphere BigInsights Oracle - Big Data Appliance EMC - Pivotal HD Microsoft HDInsights Others Organizations are using Hadoop to run large distributed computations Facebook.com Amazon.com Ancestry.com Akamai American Airlines AOL Apple ebay Hortonworks Federal Reserve Board of Governors Foursquare Yahoo! InMobi Intuit Joost Last.fm LinkedIn Microsoft NetApp Netflix Ooyala Riot Games The New York Times SAP AG SAS Institute StumbleUpon Twitter Yodlee Fox Interactive Media Gemvara Google Hewlett-Packard IBM
פרוייקט hdfs Apache Hadoop HDFS is a distributed, scalable, and portable file system. HDFS is designed to store a large amount of data in various servers/clusters.
פרוייקט map/reduce Apache Hadoop MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster.
פרוייקט Apache Hadoop Data Access / Query abilities Map Reduce Distributed processing Management tools Pig )simply query language( Hive )SQL like queries( Cascading )software abstraction layer ( Mahout )machine learning( Hama )scientific computation( Avro )data serialization system( Hadoop Map Reduce implementation Ambari (deploying, managing, and monitoring tool) Sqoop (transferring data tool) Oozie (workflow scheduler system) Zookeeper (coordination service) Flume (framework for populating Hadoop) Storage / Data structure Hadoop Distributed File System Hue (File Browser for HDFS) HBase (column oriented database) HCatalog (table/storage management service)
Hadoop Ecosystem
דוגמא לארכיטקטורה של מערכת מידע בעזרת Hadoop
סוף גידול בנפח מידע עולמי צורך של מערכות מודעיניות פתרונות Big Data מימוש בעזרת Apache Hadoop Thank You!