Ironfan Your Foundation for Flexible Big Data Infrastructure



Similar documents
Cisco Data Preparation

Hadoop in the Hybrid Cloud

Accenture Cloud Platform Unlocks Agility and Control

Automation and DevOps Best Practices. Rob Hirschfeld, Dell Matt Ray, Opscode

Virtualization Essentials

ediscovery and Search of Enterprise Data in the Cloud

10 Practical Tips for Cloud Optimization

How To Set Up Wiremock In Anhtml.Com On A Testnet On A Linux Server On A Microsoft Powerbook 2.5 (Powerbook) On A Powerbook 1.5 On A Macbook 2 (Powerbooks)

Databricks. A Primer

Changing the Equation on Big Data Spending

Databricks. A Primer

JAVA IN THE CLOUD PAAS PLATFORM IN COMPARISON

How To Handle Big Data With A Data Scientist

RightScale mycloud with Eucalyptus

Hadoop & Spark Using Amazon EMR

Optimally Manage the Data Center Using Systems Management Tools from Cisco and Microsoft

DevOps on AWS: Best Practices for Enterprise IT Teams

A Sumo Logic White Paper. Harnessing Continuous Intelligence to Enable the Modern DevOps Team

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hybrid Cloud Architecture: How to Streamline Hybrid Cloud Migration

The Virtualization Practice

Cloud Computing: Elastic, Scalable, On-Demand IT Services for Everyone. Table of Contents. Cloud.com White Paper April Executive Summary...

Building your Big Data Architecture on Amazon Web Services

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

IBM PureFlex System. The infrastructure system with integrated expertise

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Cisco Unified Data Center: The Foundation for Private Cloud Infrastructure

Private Clouds Can Be Complicated: The Challenges of Building and Operating a Microsoft Private Cloud

Building Success on Acquia Cloud:

Building the Business Case for Cloud: Real Ways Private Cloud Can Benefit Your Organization

Technology Enablement

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Cloud Lifecycle Management

Industrial Dr. Stefan Bungart

END TO END DATA CENTRE SOLUTIONS COMPANY PROFILE

White paper: Delivering Business Value with Apache Mesos

Achieve Economic Synergies by Managing Your Human Capital In The Cloud

Amazon Elastic Beanstalk

INTRODUCTION TO CLOUD COMPUTING CEN483 PARALLEL AND DISTRIBUTED SYSTEMS

VALUE PROPOSITION FOR SERVICE PROVIDERS. Helping Service Providers accelerate adoption of the cloud

Joe Young, Senior Windows Administrator, Hostway

The NREN s core activities are in providing network and associated services to its user community that usually comprises:

Guide to AWS. Brought to you by

Who moved my cloud? Part I: Introduction to Private, Public and Hybrid clouds and smooth migration

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Designing Apps for Amazon Web Services

WE RUN SEVERAL ON AWS BECAUSE WE CRITICAL APPLICATIONS CAN SCALE AND USE THE INFRASTRUCTURE EFFICIENTLY.

WHITEPAPER. Why Dependency Mapping is Critical for the Modern Data Center

ITIL Asset and Configuration. Management in the Cloud

Market Maturity. Cloud Definitions

Four Reasons Your Technical Team Will Love Acquia Cloud Site Factory

Using a Java Platform as a Service to Speed Development and Deployment Cycles

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

WHITE PAPER Redefining Monitoring for Today s Modern IT Infrastructures

Redefining Infrastructure Management for Today s Application Economy

Enlighten your transport network

Expand Your Infrastructure with the Elastic Cloud. Mark Ryland Chief Solutions Architect Jenn Steele Product Marketing Manager

FROM A RIGID ECOSYSTEM TO A LOGICAL AND FLEXIBLE ENTITY: THE SOFTWARE- DEFINED DATA CENTRE

Blazent IT Data Intelligence Technology:

Using SUSE Studio to Build and Deploy Applications on Amazon EC2. Guide. Solution Guide Cloud Computing.

Unisys ClearPath Forward Fabric Based Platform to Power the Weather Enterprise

Cymon.io. Open Threat Intelligence. 29 October 2015 Copyright 2015 esentire, Inc. 1

I D C M A R K E T S P O T L I G H T

Infopaper. Demystifying Platform as a Service

DevOps Course Content

Migration Scenario: Migrating Batch Processes to the AWS Cloud

GigaSpaces Real-Time Analytics for Big Data

Building an AWS-Compatible Hybrid Cloud with OpenStack

Analyzing Big Data with AWS

Data Mining in the Swamp

Cisco Virtualized Multiservice Data Center Reference Architecture: Building the Unified Data Center

SIMPLIFYING AND AUTOMATING MANAGEMENT ACROSS VIRTUALIZED/CLOUD-BASED INFRASTRUCTURES

Analytics In the Cloud

Cisco Unified Data Center

TOP 7 THINGS Every Executive Should Know About Cloud Computing EXECUTIVE BRIEF

Big Data 101: Harvest Real Value & Avoid Hollow Hype

Evolving Datacenter Architectures

Enabling Cloud Computing for Enterprise Web Applications:

CoIP (Cloud over IP): The Future of Hybrid Networking

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

OmniCube. SimpliVity OmniCube and Multi Federation ROBO Reference Architecture. White Paper. Authors: Bob Gropman

Cloud computing - Architecting in the cloud

Making Leaders Successful Every Day

Transcription:

Ironfan Your Foundation for Flexible Big Data Infrastructure Benefits With Ironfan, you can expect: Reduced cycle time. Provision servers in minutes not days. Improved visibility. Increased transparency means faster problem solving and sharing. Lower support costs. Experience fewer reactive support issues. Lower network costs. Only use the nodes you need for the job you are running. Infochimps brings the power of Big Data infrastructure to your fingertips. Traditional systems configuration is a time-consuming process, vulnerable to human error. Infochimps leverages the power and simplicity of Ironfan as its provisioning and deployment layer, allowing users to easily launch and orchestrate repeatable infrastructure. The Infochimps Platform reduces cycle time to provision a server from days or weeks to minutes, enabling simple scaling and rapid system evolution, dramatically lowering the cost of starting new data analysis jobs. Infochimps even enables continual monitoring of your system through automated machine provisioning. Spend your time finding insights, not building infrastructure. Lower risk, more agility. Deploy and manage a big data stack with minimal resources. 2012 Infochimps, Inc. All rights reserved. 1

Why Infochimps? Specialized. Ironfan, Infochimps systems configuration tool, leverages three years of internal development and external contributions to its code base. This specialized experience helps organizations reduce the initial adoption cost and experimentation necessary to produce well-tuned clusters. Integrated. Infochimps tool development and Big Data expertise means our team understands and is equipped with the tools to successfully navigate and troubleshoot the entire Big Data ecosystem of an organization. Flexible Cost. Infochimps Ironfan lets you take advantage of IaaS (Infrastructure as a Service) providers such as Amazon Web Services. This allows for all infrastructure costs to be treated as operating expenses (use what you need) and not capital expenditures (pay whether you need it or not). Switching from CapEx to OpEx can dramatically lower the funding barrier to adopting Big Data internally in an enterprise. Context. Perhaps best of all, the Infochimps Platform, enabled by Ironfan, can be used to provide context to an enterprise s internal data, whether through public opinion mining (via social networks), geo-located information, word corpus training for machine learning, and other commonly useful (but difficult to accumulate) data. All of these capabilities combine to make Infochimps a great choice for providing Big Data services to the budget and process-conscious enterprise customer. 2012 Infochimps, Inc. All rights reserved. 2

Understanding the Tools What is Chef? Chef is a configuration management system, designed to be a general purpose tool for building repeatable infrastructure. It uses a Ruby DSL (Domain Specific Language) allowing you to write out specifications (as cookbooks, roles, etc.) for infrastructure that is fully composable. Chef can be used in a number of ways, allowing it to fit into a variety of existing architectures. Its flexibility, however, means that it cannot as easily build higher-level abstractions on top of the architecture it provides. What is Ironfan? Ironfan, the foundation of The Infochimps Platform, is a systems provisioning and deployment tool. Ironfan automates not only machine configuration, but entire systems configuration to enable the entire Big Data stack, including tools for data ingestion, scraping, storage, computation, and monitoring. Ironfan builds on Chef, but is opinionated about its architecture, which allows broader integration between components. It assumes a source repository, a central Chef Server, and a modern POSIX-compliant operating system for a base image. Currently, it works best with Git, Amazon Web Services and Ubuntu 11.04, with exploration into other virtualization platforms (Vagrant, etc.) and operating systems (Centos, FreeBSD, etc.) ongoing, both inside and outside of Infochimps. 2012 Infochimps, Inc. All rights reserved. 3

Benefits for the Entire Team For Systems Administrators, Ironfan removes the guesswork from building systems, because it reduces the cycle time to build a server from days or weeks to minutes. Instead of following long lists of manual processes, a system administrator makes changes to their Ironfan homebase, and then ushers those changes into the appropriate systems with the Chef knife and client programs. This enables rapid iterative development, a practice of Agile programming shops for years. Up until recently, this kind of fast-paced development was unavailable to the average systems administrator. Ironfan also enables repeatable architecture, another powerful tool. Now, replacing malfunctioning components with completely new ones, built from scratch and loaded with data from live exports or backups is a simple, reliable, and rapid process, instead of a last-ditch solution. Finally, Ironfan allows you to make infrastructure inevitable: you can write definitions, which automatically attach new servers to your existing architecture, instead of wiring into central services like monitoring, log ingestion, or orchestration manually, without the attendant risk of human error. For Data Scientists or Business Intelligence Teams, Ironfan can currently build a Hadoop cluster from scratch in less than an hour with just a small handful of commands, and expand it in minutes with a single command. Other large scale cluster technologies (HBase, ElasticSearch, Redis, Flume, etc.) are just as simple to build. This dramatically reduces the cost of starting new data analysis jobs, allowing for greater experimentation. Because the underlying architecture is rented by machine-hour, jobs with predictable costs in machine-hours can be optimized for rapid execution without large increases in cost. Should the underlying processing time prove greater than anticipated, clusters can be scaled up while in use, to improve the chances of hitting deadlines. 2012 Infochimps, Inc. All rights reserved. 4

Benefits for the Entire Team For Systems Architects or Core Infrastructure Team, Ironfan allows you to build the repeatable architecture recommended by ITIL (Information Technology Infrastructure Library) for reliable IT infrastructure. It becomes simpler to scale or evolve systems rapidly. Ironfan takes the grunt-work out of distributing those changes, allowing architects to spend more of their focus on design details, instead of implementation details. Since everything is stored in source control, both architects and administrators can make changes to the infrastructure, confident that they are not obliterating important history. Also, the same code can be used to create development, staging, and production environments, the usual barriers to deployment caused by differences in the underlying architectures and deployment mechanisms are significantly reduced. Because starting new instances with Ironfan is trivial, and paid for by the hour, capacity can be managed as OpEx rather than CapEx. This also means that problems with huge capacity spikes can be considered; turning up a thousand nodes for three days, then turning them off again, is no longer a laughable fantasy. Migrations also become significantly easier, as new infrastructure can be spun up in parallel with the old, without a long term increase in expense. 2012 Infochimps, Inc. All rights reserved. 5

Case Study How Infochimps Uses Ironfan to Create TrstRank What is TrstRank? TrstRank is an Infochimps developed dataset and API that provides Twitter influence metrics. This API provides Twitter influence metrics with the click of a button! TrstRank measures Twitter user reputation, importance and influence in a far more robust way than counting the number of followers. It is a sophisticated measure of a user s relative importance within the entire Twitter network. Since the launch of Twitter, people have clamored for ways to access and slice and dice its data. One of the most common ways people use the Twitter data corpus is to measure a person s importance and influence. Klout is an example of one product that specializes in this kind of influencer data. A few years ago, we created our own special version of Klout, one that took advantage of our vast historical record of the relationships to create an accurate number describing how influential a Twitter user is. It s called TrstRank and it ranks a user on a scale of 1-10, with 10 being the most influential you can get. Coming up with such a number like TrstRank is no small task. Setting aside the issues of getting the data, there are some very real Big Data problems surrounding the product that require special tools for getting it done efficiently. And when you re a bootstrapped startup, like we were at the time, you have to be resourceful if you are going to get by. The biggest issue with pursuing a new data product like TrstRank is the same one any company faces when they decide to venture into new territory - the high risks of wasting time and money. Wasting Time One of the first problems you run into as a small team trying your hand at data science is the excess time spent on server and machine configuration, instead of focusing on modeling, algorithms, and manipulating the data. Ramp-up time for even the first phase of a project like TrstRank can be a whole day or more of engineering time. 2012 Infochimps, Inc. All rights reserved. 6

Case Study (continued) How Infochimps Uses Ironfan to Create TrstRank Wasting Money From our earliest days Infochimps has been based on Amazon Web Services (AWS) cloud, taking advantage of the flexibility and scalability it provides. With AWS, you pay for what you use, so you are always inclined to eliminate waste. In our early days we even created decision trees for when to shut down a cluster or not, depending on how many hours it was to be up but not used. This can set conflicting goals for the data scientist who would prefer to leave a cluster up overnight, even if it s unused, so they don t have to deal with setting everything up again the next day! Enter Ironfan We created Ironfan to solve our own problems of how to save time and money during our data science operations in the cloud. When we came up with the idea for TrstRank, it was a simple operation to spin up a cluster for early analysis and experimentation. We could validate some of our algorithms and ideas on a simple cluster before moving to something more heavyweight. Ironfan and TrstRank, Now Ironfan has continued as a key tool for our monthly TrstRank operation. We continue to scrape Twitter for follower information, and with the updated data every month we crunch the TrstRank numbers again. With Ironfan, we re able to run a multiple step operation on 8 billion tweets on clusters of 30 m1.xlarge EC2 machines, while only running the resources we need when they re needed. TrstRank takes 72 hours to complete, with resources being paid for commensurately. Without Ironfan, we d be looking at 2-3x the costs in time and money! 2012 Infochimps, Inc. All rights reserved. 7

About Infochimps Our mission is to make the world s data more accessible. Infochimps helps companies understand their data. We provide tools and services that connect their internal data, leverage the power of cloud computing and new technologies such as Hadoop, and provide a wealth of external datasets, which organizations can connect to their own data. Contact Us Infochimps, Inc. 1214 W 6th St. Suite 202 Austin, TX 78703 1-855-DATA-FUN (1-855-328-2386) www.infochimps.com info@infochimps.com Twitter: @infochimps Get a free Big Data consultation Let s talk Big Data in the enterprise! Get a free conference with the leading big data experts regarding your enterprise big data project. Meet with leading data scientists Flip Kromer and/or Dhruv Bansal to talk shop about your project objectives, design, infrastructure, tools, etc. Find out how other companies are solving similar problems. Learn best practices and get recommendations free. 2012 Infochimps, Inc. All rights reserved. 8