ebook Big Data Explained, Analysed, Solved

Transcription

1 ebook Big Data Explained, Analysed, Solved

2 2 What you will learn This ebook gives an overview of what big data is and its growing importance. It talks about some of the different kinds of big data, as well as some of the different things you would do with it. The functional section of this book discusses applications, tools, managed services and clouds, used together or separately, that will help you benefit most from big data. You can skip directly to any section and focus on what s most important to you, or read the book straight through. Canonical is involved in Big Data Canonical, the company behind Ubuntu, works closely with its partners on all aspects infrastructure and partner solutions to support storing, managing, and analysing big data.

3 3 The Author Bill Bauman, Strategy & Content, Canonical, began his technology career in processor development and has worked in systems engineering, sales, business development, and marketing roles. He holds patents on memory virtualization technologies and is published in the field of processor performance. Bill has a passion for emerging technologies and explaining how things work. He loves helping others benefit from modern technology. Bill Bauman Strategy & Content, Canonical

4 4 Contents Overview Functional Partnership What is Big Data, a general overview The increasing importance of Big Data Different types of Big Data Big Data analysis and action Do I need a cloud for Big Data? Design, deploy, and package Big Data solutions Juju Big Data Charms Juju Big Data Frameworks Ubuntu for Big Data systems OpenStack is a Big Data warehouse Canonical as a strategic partner for Big Data Conclusion About Canonical BootStack for Big Data 19 Ubuntu Advantage Storage for Big Data 20 Machine Containers for Big Data 21

5 5 Big Data general overview Big Data refers to extremely large sets of data that aren t easily stored or analysed by traditional methods. Typically the data is too large, varied in nature, or moves too fast, for traditional database systems to handle it. This often is referred to as volume, variety, and velocity. Traditional Data To understand big data, consider some examples of traditional data. Traditional data may be a database of clients, with their associated contact information. It could be a database of cars, years, makes, models. This sort of data will usually grow gradually in size and the types of data stored rarely changes. Traditional data is generally well-structured and fits predefined or predictable categories. Structured Database Predefined datasets Incremental, predictable growth

6 6 Big Data When we look at Big Data, typically the data is not so neatly organised. Some big data examples could be random spots on a map, documents, images, huge lists of named or unnamed individuals that have happened to all be in the same general area at a given time, or the millions of clicks on a web page in a given week. Big Data can be structured or unstructured, but generally the database and analysis tools are specially designed for a given purpose and to handle the tremendously large scale, size, velocity, and variety that most big data datasets represent. Often Unstructured Purpose-specific toolset The general purpose The reason that these gigantic data sets are being compiled and stored is so that we can analyse the data. Analysis includes pattern recognition, trends, associations, etc. The outcome of analysis is respective actions that would otherwise not be possible without big data. In the next section of this ebook, we go into further detail about big data analysis - why we do it, why it s important, and the sort of information for which we re looking. Rapid growth

7 7 The increasing importance of Big Data Collect Organisations of all sizes and functions are increasingly gathering more information about their interactions and transactions. They are also looking to third parties to provide additional data. Regardless of how they gather data and the types and quantities are increasing. In a modern, data-driven world, an organisation that isn t taking advantage of big data collection, analytics, and action, is likely going to become uncompetitive with those that are. Analyse The analysis of big data can have big returns. The ability to understand the types of data that are collected, to correlate one type of data with another, observe trends, identify outliers, and many other analytic functions, are increasingly valuable in organisations of all types. Without thorough analysis via the use of modern, big data analytics tools, it can be easy to miss or overlook important trends, shifts in perspective, or subtle changes in customer interaction. Through analysis, you can learn patterns and predict actions before they occur and even begin to direct them via actions discussed here in the Act section. Act The ability to do something with the data that is collected and analysed is the most compelling part of big data. Corporations can offer more compelling products and solutions. Governments can better predict and serve the needs of citizens. Even small business can identify short and long term trends in their sales and interactions with customers, as well as other businesses. All of these outcomes are about improved efficiencies and experiences for everyone involved, from the provider to the consumer.

8 8 Types of Big Data Big data can be structured or unstructured. New tools and datasets are blurring the lines that separate the two. Below are some common examples of big data types. Structured big data Remember, just because it is structured, does not mean it isn t big data. Structured big data could be compiled from millions or billions of data points, daily or even hourly. User Input This is data that is created via a prompt or requested action to a user. This could be a ratings system, a survey, a loyalty program, or any other prompt for the user to input specific data in specific fields that are then stored in a structured manner. Compilations Compiled big data is merging existing or otherwise disparate databases into a single dataset. For example, the data could include names, locations, demographics, account balances, credit scores, etc, all combined into a compiled big data dataset. Transactions Transactional big data is everything having to do with a transaction, including whether the transaction was even completed. The data could include what was purchased, how long it took, was it online or in-store, were other items typically purchased together.

9 9 Unstructured Big Data Big data is most commonly associated with unstructured data. Unstructured data, like photos and IoT datasets, were largely the genesis of modern big data. User-generated Content Every day, millions of Internet users post pictures, videos, short messages, audio, and more. Much of this data is completely unassociated with a category or field. Essentially, it is completely unstructured and it is the function of targeted big data applications to aggregate, cull, present, and analyze these datasets. Passive Data This is generally the data that is generated without specific intent or interaction from users. For example, cell phones are perpetually updating GPS coordinates of their users respective locations. Logistics information, bar code scans, delivery information, are all data that are passively updated but can provide valuable insights when analyzed.

10 10 Big Data analysis and action Predictive Analytics Probably the most common type of analysis, using past patterns or performance to determine future actions is one of the best known uses of big data. It s important to analyse data from a multitude of different perspectives and to include cross referenced, sometimes loosely-associated data, to establish the most comprehensive patterns and future predictions. Predictive analytics can also be bolstered by machine learning, whereby, over time, the system builds its own intelligence profile on a given a subject, individual, or topic. Descriptive Analytics The focus here is on metrics, a summary of what has happened. This could be views, clicks, counts, posts, etc. While descriptive metrics are not necessarily incredibly useful on their own, they are the underlying data points that feed more advanced analysis and actions. Descriptive analytics have been used for many years now, and are the foundation of the many graphs and charts we see on the Internet and in presentations today. Prescriptive Analytics Largely an intelligent evolution of predictive analytics, with a prescriptive approach, data analysis is used to determine recommended actions. Where predictive analytics looks at patterns and makes recommendations, prescriptive analytics looks at patterns, associates them with additional datasets, determines where individual data points coincide or there are recurring common descriptors or activities, and then prescribes a potential course of action or solution. Prescriptive analytics are generally underutilised but offer great potential to reduce time to market for solutions or assessment times for individuals in various fields.

11 11 Do I need a cloud for Big Data? Even though big data was born in the cloud, it doesn t mean you need a cloud to take advantage of big data solutions or to act on the data. The most important aspects of working with big data are that you have chosen the right tools and the right applications for your solution. Canonical can help you with both. Canonical has created an open source solution for system design and service modeling called Juju. Juju simplifies the process of designing your solution, then configuring, associating, and deploying the applications in it. Having a tool like Juju means that selecting the right big data applications for your needs is the most important remaining factor. For more information on Juju, see section Design, deploy, package Big Data solutions. Although it isn t necessary, a cloud can be tremendously beneficial to big data processing. The nature of big data is that it is constantly changing, and the purpose of that data, the analysis of that data, and the storage of that data can change just as quickly. A tool like Juju can help you keep up with the change in usage by deploying new big data charmed solutions. But Juju can t do it all. For system scalability and the ability to easily access different types of storage for different needs, a cloud is recommended. Juju can talk directly to both public and private cloud solutions, like AWS and Canonical OpenStack, respectively. For more on building your own private cloud, see the sections OpenStack is a Big Data warehouse and BootStack for Big Data later in this ebook.

12 12 Choosing the right applications There are many ways to go about application selection. Some people already know which big data processing solutions they want to use. Others are looking for advice, or looking to explore potential new solutions. In the Juju Big Data Charms section of this book, we outline many big data software solutions that are available, and give a brief description of their purpose. This is a great starting point to see what s out there, and Juju makes it easy to try them all. Additionally, in the BootStack for Big Data section of this book, we go into detail on how a BootStack cloud helps to start processing big data quickly and efficiently. Juju is the game-changing service modelling tool that lets you build entire cloud environments with only a few commands. BootStack is your OpenStack private cloud, running on your hardware, in your choice of datacentre with Canonical s experts responsible for design, deployment and availability.

13 13 Design, deploy, package Big Data solutions Whether in a cloud or on a dedicated system, managing all the applications in a big data solution is best handled by a tool that does more than static configuration management or orchestration deployment. Juju is a service modelling product from Canonical that gives you a blank canvas on which you can visually lay out all of your big data apps. Communications and data paths are defined as relationships between the applications by connecting the apps on your canvas. The visual solution design and all the application relationships can be deployed immediately, and exported and saved as a bundle for future use. Juju, Charms, and Bundles The use of Charms is what gives Juju its incredible capabilities to manage applications in complex infrastructures. Charms are intelligent scripts wrapped around big data applications that allow them to be dynamically configured and deployed without manual configuration. The abstraction of application relationship management by Juju s Charms is what allows big data solutions to be rapidly deployed and seamlessly scaled. Without the application abstraction that Juju provides, big data system services require manual intervention or iteration of inflexible, static configuration scripts any time the solution design needs to be updated or changed. Evolving the solution When it comes to big data processing, the solution is rarely static. Big data deployments evolve over time, and that often involves adding or removing components services. The same tool that you used to design and deploy the solution can be used to dynamically add and remove components within it. Juju s service modelling approach lets you evolve your solution and keep pace with the rapidly changing big data market.

14 14 Juju Big Data Charms Ingest & Messaging Message Processing Flume Kafka Message Queues RabbitMQ ZeroMQ Structured Data MySQL PostGreSQL Percona Cluster MariaDB Scale Out Storage Ceph Swift nosql Stack ElasticSearch LogStash Kibana Document Databases MongoDB CouchDB Couchbase Column & KV Cassandra / DSE quasardb memcached Redis Analytics / Search /Visualisation SpagoBI Saiku Storm Spark Datafari (ManifoldCF, SolR) Zeppelin ipython Notebook As discussed on the Design, deploy, package Big Data solutions page, these are a sample of the Charms available for big data. With Juju, you can readily deploy any combination of these Charms and define their configurations and data paths all from a graphical interface, CLI, or API.

15 15 Juju Big Data Frameworks Big data frameworks are available for deployment in Juju. You can deploy an entire Hadoop cluster with a Juju Charm bundle, or Spark, Docker, or Kubernetes, for example. The Charms listed on the Juju Big Data Charms page can all be associated with the frameworks listed here, as appropriate. All of these frameworks benefit from Juju s ability to automatically configure application data paths and relationships. Hadoop Hadoop Flavours Apache Hadoop Cloudera Hadoop YARN Hive Mahout HBase Pig ZooKeeper Flume Kafka Tez Storm Hue Spark Spark Spark Streaming Spark SQL SparkML GraphX Container Ecosystem & Orchestration Docker LXD / LXC Kubernetes Mesos

16 16 Ubuntu for Big Data systems Ubuntu Server is the most popular cloud operating system in use. There are many reasons why Ubuntu is so popular, but one of the primary reasons is that Canonical started to focus on OS scalability many years ago. When you re working with big data, you need a cloud-ready platform, like Ubuntu, that is designed for scalability and reliability. Ubuntu Server can be used as a traditional operating system. There are also optimised variants for low latency and other task-specific solutions, like big data processing. Where Ubuntu runs: On-premise, in your own cloud In an external, private cloud On public clouds, like AWS, Azure, Rackspace, Google Cloud Platform, IBM, and many others, please see the Ubuntu Certified Public Cloud page for more Ubuntu allows you to process your big data anywhere. Keep sensitive information in-house, leverage the public cloud for unpredictable workloads, and trusted private cloud partners for both.

17 17 How Ubuntu runs: The flexibility of Ubuntu to run anywhere on almost any architecture makes it the ideal platform choice to execute big data workloads. Bare metal server on - x86, ARM, POWER, or z Mainframe Virtual Machine on - KVM, VMware, Hyper-V, and other hypervisors Public cloud guest instance Private cloud guest instance Container on bare metal Container as a virtual machine Container as a cloud instance

18 18 OpenStack is a Big Data warehouse The section Do I need a cloud for Big Data in this book addresses some of the benefits of clouds for big data. Specifically, an OpenStack cloud is the most popular private cloud solution for big data. OpenStack is a community-based private cloud solution. It is not a single product, but a collection of individual projects designed to seamlessly interact to create a functional cloud. Canonical OpenStack is a productionready, supported OpenStack distribution, and more. The best way to build an OpenStack cloud is using Autopilot. Autopilot is a graphical installation tool that allows you to select the components of OpenStack you would like to install and deploys them for you. It can even deploy them with high availability. Autopilot is designed to work with an extended tool set beyond just OpenStack. MaaS, Metal as a Service, automates the configuration of the physical nodes in your OpenStack environment. Juju, discussed further in the Design, deploy, package Big Data solutions section of this ebook, allows you to automatically deploy applications and their respective relationships within your OpenStack cloud. Landscape manages the Autopilot experience, as well as the cloud itself, and the guest instances within it. The comprehensive tool set that comes with a Canonical OpenStack cloud makes it easier, faster, and more robust to deploy big data solutions - from the bare metal, to the platform operating system to the applications themselves. The base platform of Canonical OpenStack is Ubuntu. Ubuntu is not only the most popular cloud operating system, it is also the most popular OpenStack infrastructure operating system. Ubuntu runs on the OpenStack physical nodes, providing critical services like compute, networking, and storage. It is also the platform for your guest instances, whether they are LXD machine containers or virtual machines, where you run your big data applications. Combining OpenStack with Canonical s feature-rich tools and Ubuntu creates a scalable, reliable, automated platform for deploying and managing big data solutions for any type of analytics, monitoring, and more. Canonical even guarantees upgrade ability of your OpenStack Big Data cloud.

19 19 Big Data Cloud, quick and easy BootStack is a unique, managed Canonical OpenStack offering. It is unique in that you may choose to run the solution in your own datacenter, on your own hardware, or in a 3rdparty hosted facility, like IBM SoftLayer, an Ubuntu Certified Public Cloud partner. Canonical s engineers have years of OpenStack experience. With BootStack, you can leverage their knowledge of how-to and best practices and have a Canonical OpenStack cloud ready for big data processing in days. With BootStack, you focus on the data, and Canonical takes care of the infrastructure. Additionally, when you want, Canonical can transfer total control of your OpenStack environment to you. All of the tools that make Canonical OpenStack the platform of choice for big data are included in BootStack. Even better, they can be preconfigured for you and ready for use. As soon as your BootStack cloud is ready, you can start using all the big data solutions in the Juju Charm Store. You ll find the core big data solutions you expect and can even start discovering new big data solutions from all our Charm partners. BootStack is billed on a pay for use model. The model is similar to that of Ubuntu Advantage Storage. These unique and innovative price models are part of the initiative to make private cloud usage and consumption as easy to calculate and predict as that of public clouds. Whether you just want to try it out, don t have the in-house skills, or want to get up and running quickly, BootStack can provide the answer to a big data cloud. To learn more about BootStack, and use the BootStack calculator to calculate potential savings, visit the BootStack managed cloud page.

20 20 Ubuntu Advantage Storage Ubuntu Advantage Storage is a unique and ideal storage solution for big data storage and real-time processing. It is based on Software Defined Storage (SDS) solutions, allowing for flexibility and modern data management approaches. Choose the right technology Ceph, NexentaEdge, Swift and SwiftStack are all supported by Ubuntu Advantage Storage. That means, you choose the right technology for your solution, and it is all directly supported by Canonical. The hardware you choose to run the solution on is just as important, and Canonical s partners and engineers can help you with that, as well. Pay for what you use Another unique feature of Ubuntu Advantage Storage is its pay for use, metered model. As opposed to paying for all the storage in your datacenter, you just pay for the storage that s actively in use. Additionally, you don t pay for replicas or online backups. The cost savings compared to other SDS-based and managed storage solutions can be 2x to 3x, or even more. The pay for use model of Ubuntu Advantage Storage is similar to that of our managed OpenStack solution, BootStack. These unique and innovative price models are part of the initiative to make private cloud usage and consumption as easy to calculate and predict as that of public clouds. Used Unused Capacity Your Content Storage What you pay for Total Capacity New Unused Capacity Used Unused Capacity What you pay for Total Capacity Grow your capacity, without growing your bill Unused Capacity Redundant Data Used What you pay for Total Capacity Increase your redundancy, pay the same!

21 21 Machine Containers for Big Data Machine containers are a relatively new technology in the virtualisation ecosystem. Delivered by Ubuntu as a technology called LXD, they provide the management of traditional virtual machines without the system overhead. Many big data solutions execute optimally when run at bare metal speed. That can limit the use of virtualisation, though, and restrict system placement. By using LXD, multiple services can share a single system and all have direct hardware access. LXD isn t just about performance. There are big data workloads that run in public clouds as guest instances. Almost all of those instances are virtual machines. One of the benefits of LXD machine containers is that it provides process isolation and application mobility (live migration) to running processes. That means increased manageability for public cloud instances, as well as bare metal and private cloud solutions. Multiple services can share a single system and all have direct hardware access

22 22 Canonical as a strategic partner for Big Data Working with Canonical as your valued partner will maximise your success with big data. Some attributes to keep in mind and that Canonical delivers are: Scalability Application catalog Prebuilt, intergrated bundles Time to solution Your strategic big data partner should understand and have experience designing, building, deploying, and managing scalable infrastructures and big data applications. Ideally that partner brings with it an entire ecosystem of additional big data partners. Canonical works closely with a multitude of big data software and platform providers to ensure choice in solutions while maintaining quality and integrity in the overall stack. 24/7 Support Existing expertise Managed offerings...and more

23 23 Conclusion There are many kinds of big data. There are many big data applications, services, and solutions. Canonical has domain expertise, understands big data, has strong industry partnerships, and can provide a scalable, supported solution. Your data is important. You need to know how to store, process, and act on your data. The overview, explanations, and solutions outlined in this book will get you started or accelerate your journey to maximising the benefits of the data you have and the new data you will start collecting. Your best next step is to contact Canonical today. If you re excited to hear more and talk to us directly, you can reach us on our Contact Us page. To learn more about a managed solution for big data, download the paper BootStack Your Big Data Cloud. If you want to start trying things out immediately, we highly encourage you to visit Juju solutions for big data.

24 24 About Canonical At Canonical, we are passionate about the potential of open source software to transform business. For over a decade, we have supported the development of Ubuntu and promoted its adoption in the enterprise. By providing custom engineering, support contracts and training, we help clients in the telecoms and IT services industries to cut costs, improve efficiency and tighten security with Ubuntu and OpenStack. We work with hardware manufacturers like HP, Dell and Intel, to ensure the software we create can be delivered on the world s most popular devices. And we contribute thousands of manhours every year to projects like OpenStack, to ensure that the world s best open source software continues to fulfil its potential.