IT and Storage for Big ata Analytics Randy Kerns Senior Strategist valuator Group
verview Big data can mean two different things - Storage for large amounts of data - Analytics against very large amounts of data Usually from machine-tomachine data - Called pervasive computing So, what does this mean for storage?
What It Means for IT
The Storage Way to Say Big ata efined by architectural platform, big data storage is: Scale-out AS Global amespace File System AS gateway to SA and Scale-out SA efined by application, big data storage is: Storage for applications that handle large files and requires performance Storage for extremely large number of files xamples: Media & entertainment, oil & gas exploration, life sciences, etc.
The Analytics Way to Say Big ata Big data analytics is: - A term for business intelligence (BI) processes that are different from traditional data warehousing - The ability to tap unstructured data as a source for BI processes - Information delivered to users in real or near real-time (but not an absolute requirement) - Convergence of multiple data sources Latency introduced by storage, including networked storage, is often assiduously avoided Cost is minimized
ata Analytics Model Customer Profiles osql B HFS Logs, Tweets Location High Scale ata Reductions Predictions on Buying Behavior BI and Analytics PS Batch Low Latency 3) Input Into xpert System 4) Real-time: etermine Best ffer For This User 2b) Lookup Location osql B 2a)Lookup User Profile 1) Identify User
Why Should Storage Professionals Care? istributed computing for analytics (Hadoop, for example) is moving from science experiment to mission-critical As this happens, data encompassed by these applications becomes the responsibility of people who worry about: - Security - ata protection/disaster recovery/business continuance - ata governance and compliance - igital records management and archiving
Shared Storage for the Traditional ata Warehouse Archive LTP Files / XML data Log Files perational xtract, Transform, Load (TL) ata Warehouse Schedules Ad hoc Queries Reports ashboards otifications
istributed, Shared-othing Architectures for Big ata Analytics etwork Layer B8GMR3 1 Link 2 3 Link 4 5 Link 6 7 Link 8 Pwr Console Compute Layer C T R L 1 2 3 n Storage Layer AS AS AS AS AS
CAP Theorem It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: - Consistency (all nodes see the same data at the same time) - Availability (a guarantee that every request receives a response about whether it was successful or failed) - Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system) A distributed system can satisfy any two of these guarantees at the same time, but not all three
Issue for IT How to store information for big data - How much data is there????? - Where did this idea come from? What are the requirements Is it from analytics operations - Store original data capture in flight as part of the analytics operation? - Store as secondary process? - on t save anything, except results? What about Rental ata?
Shared Storage as Secondary Storage Is there a place for shared storage in shared-nothing? If so, what does it look like? etwork Layer Compute Layer B8GMR3 C T R L 1 Link 2 3 Link 4 5 Link 6 7 Link 8 1 2 3 Pwr Console n Storage Layer SA/AS
Shared Storage as Primary Storage etwork Layer B8GMR3 1 Link 2 3 Link 4 5 Link 6 7 Link 8 Pwr Console Compute Layer C T R L 1 2 3 n Storage Layer SA or AS, but more commonly Scale-out AS
Shared Primary/Secondary Storage Advantages - Can reduces latency for queries that span nodes - nhances system availability - Addresses the enterprise storage requirements Security ata protection/disaster recovery/business continuance ata governance and compliance igital records management and archiving isadvantages - Additional cost - Crosses a cultural boundary
Why ot Shared Storage?
Big ata Storage for Big ata Analytics Shared storage as secondary storage for big data analytics - ata Protection, atabase of Record, Archive - xamples: etapp and ParAccel, MC ata omain/vmax and Greenplum, RainStor Shared storage as primary storage for big data analytics - xamples: Calpont, Red Hat Gluster, IBM GPFS, exenta ZFS, Hadoop nodes in Virtual Machines
Is Hadoop a Storage evice? - It s a distributed computing platform YS - 1K node cluster w/ 1TB RAM per node = 1PB of very high performance storage - ata protection built-in (multiple data copies but not RAI) - HFS - mbedded, distributed file system (like scale-out AS)
HFS Hadoop File System Very large istributed File System (FS) 10K nodes, 100 million files, 10 PB Uses standard servers with direct attached storage Files are replicated to handle hardware failure 3 copies etect failures and recovers from them ptimized for batch processing ata locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth Runs in user space - heterogeneous S
Hadoop File System on Standard Servers Source: Matt Foley
Typical Hadoop Configuration etwork Layer B8GMR3 1 Link 2 3 Link 4 5 Link 6 7 Link 8 Pwr Console Compute Layer C T R L 1 2 3 n Storage Layer AS AS AS AS AS
Hadoop Key Milestones ec 2004 Google GFS paper published July 2005 MapReduce first used Feb 2006 Becomes Lucene subproject Apr 2007 Yahoo! on 1000-node cluster Jan 2008 Apache Top Level Project May 2009 Hadoop sorts a Petabyte in 17 hours Aug 2010 World s largest Hadoop cluster at Facebook - 2900 nodes - 30+ Petabytes
valuating Hadoop as a Storage evice Snapshots? Scale capacity and performance concurrently? SS and automated tiering? edupe? Insert your hot-button storage feature here:
valuating Hadoop as a Storage evice
IT and Big ata Analytics There will be big data Circumstances may vary. and change Participate early - ata scientists may not have same concerns or requirements - ecisions can limit choices Understand options - Products / software
Thank You! Questions? Randy Kerns: randy@evaluatorgroup.com Twitter: @rgkerns Blog: http://itknowledgeexchange.techtarget.com/storage-soup/