How To Analyze Big Data With Luajit And Luatjit

Similar documents
A Novel Cloud Based Elastic Framework for Big Data Preprocessing

SEO - Access Logs After Excel Fails...

Big Fast Data Hadoop acceleration with Flash. June 2013

Tutorial 0A Programming on the command line

Lua as a business logic language in high load application. Ilya Martynov ilya@iponweb.net CTO at IPONWEB

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

A visual DSL toolkit in Lua

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Big Data Explained. An introduction to Big Data Science.

Workshop on Hadoop with Big Data

Hadoop & its Usage at Facebook

Tuning Tableau Server for High Performance

Tutorial for Assignment 2.0

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Putting Apache Kafka to Use!

L1: Introduction to Hadoop

MapReduce and Hadoop Distributed File System

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Manifest for Big Data Pig, Hive & Jaql

MapReduce and Hadoop Distributed File System V I J A Y R A O

Ubuntu and Hadoop: the perfect match

Assignment # 1 (Cloud Computing Security)

Big Data With Hadoop

Computer Virtualization in Practice

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Chase Wu New Jersey Ins0tute of Technology

Hadoop. Bioinformatics Big Data

Hadoop Ecosystem B Y R A H I M A.

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

SYSTEM BACKUP AND RESTORE (AlienVault USM 4.8+)

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Background (

Big Data Processing with Google s MapReduce. Alexandru Costan

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Ansible in Depth WHITEPAPER. ansible.com

A Study of Data Management Technology for Handling Big Data

Automating Big Data Benchmarking for Different Architectures with ALOJA

Directions for VMware Ready Testing for Application Software

An Introduction to High Performance Computing in the Department

Linux System Administration on Red Hat

Open source Google-style large scale data analysis with Hadoop

BIG DATA What it is and how to use?

Hadoop and Map-Reduce. Swati Gore

Healthcare Big Data Exploration in Real-Time

OTM Performance OTM Users Conference Jim Mooney Vice President, Product Development August 11, 2015

Mobile Performance Testing Approaches and Challenges

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

HiBench Introduction. Carson Wang Software & Services Group

ADAM 5.5. System Requirements

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

CSE-E5430 Scalable Cloud Computing Lecture 2

Cloud Computing at Google. Architecture

IBM Software Hadoop Fundamentals

Hadoop Basics with InfoSphere BigInsights

Data Mining in the Swamp

Exar. Optimizing Hadoop Is Bigger Better?? March Exar Corporation Kato Road Fremont, CA

Efficiency of Web Based SAX XML Distributed Processing

Drupal in the Cloud. Scaling with Drupal and Amazon Web Services. Northern Virginia Drupal Meetup

CSCI6900 Assignment 2: Naïve Bayes on Hadoop

SUSE Manager in the Public Cloud. SUSE Manager Server in the Public Cloud

Creating Connection with Hive

Kafka & Redis for Big Data Solutions

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

A. Aiken & K. Olukotun PA3

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

ECT362 Installing Linux Virtual Machine in KL322

SURFsara HPC Cloud Workshop

Chapter 7. Using Hadoop Cluster and MapReduce

Enterprise and Standard Feature Compare

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Sentimental Analysis using Hadoop Phase 2: Week 2

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

How to scan/exploit a ssl based webserver. by xxradar. mailto:xxradar@radarhack.com. Version 1.

Driver Updater Manual

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

SAP HANA In-Memory Database Sizing Guideline

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Tutorial for Assignment 2.0

Original-page small file oriented EXT3 file storage system

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Transcription:

Ad-hoc Big-Data Analysis with Lua And LuaJIT Alexander Gladysh <ag@logiceditor.com> @agladysh Lua Workshop 2015 Stockholm 1 / 32

Outline Introduction The Problem A Solution Assumptions Examples The Tools Notes Questions? 2 / 32

Alexander Gladysh CTO, co-owner at LogicEditor In löve with Lua since 2005 3 / 32

The Problem You have a dataset to analyze, which is too large for "small-data" tools, and have no resources to setup and maintain (or pay for) the Hadoop, Google Big Query etc. but you have some processing power available. 4 / 32

Goal Pre-process the data so it can be handled by R or Excel or your favorite analytics tool (or Lua!). If the data is dynamic, then learn to pre-process it and build a data processing pipeline. 5 / 32

An Approach Use Lua! And (semi-)standard tools, available on Linux. Go minimalistic while exploring, avoid frameworks, Then move on to an industrial solution that fits your newly understood requirements, Or roll your own ecosystem! ;-) 6 / 32

Assumptions 7 / 32

Data Format Plain text Column-based (csv-like), optionally with free-form data in the end Typical example: web-server log files 8 / 32

Data Format Example: Raw Data 2015/10/15 16:35:30 [info] 14171#0: *901195 [lua] index:14: 95c1c06e626b47dfc705f8ee6695091a 109.74.197.145 *.example.com GET 123456.gif?q=0&step=0&ref= HTTP/1.1 example.com NB: This is a single, tab-separated line from a time-sorted file. 9 / 32

Data Format Example: Intermediate Data alpha.example.com 5 beta.example.com 7 gamma.example.com 1 NB: These are several tab-separated lines from a key-sorted file. 10 / 32

Hardware As usual, more is better: Cores, cache, memory speed and size, HDD speeds, networking speeds... But even a modest VM (or several) can be helpful. Your fancy gaming laptop is good too ;-) 11 / 32

OS Linux (Ubuntu) Server. This approach will, of course, work for other setups. 12 / 32

Filesystem Ideally, have data copies on each processing node, using identical layouts. Fast network should work too. 13 / 32

Examples 14 / 32

Bash Script Example time pv /path/to/uid-time-url-post.gz \ pigz -cdp 4 \ cut -d$ \t -f 1,3 \ parallel --gnu --progress -P 10 --pipe --block=16m \ $(cat <<"EOF" luajit ~me/url-to-normalized-domain.lua EOF ) \ LC_ALL=C sort -u -t$ \t -k2 --parallel 6 -S20% \ luajit ~me/reduce-value-counter.lua \ LC_ALL=C sort -t$ \t -nrk2 --parallel 6 -S20% \ pigz -cp4 >/path/to/domain-uniqs_count-merged.gz 15 / 32

Lua Script Example: url-to-normalized-domain.lua for l in io.lines() do local key, value = l:match("^([^\t]+)\t(.*)") if value then value = url_to_normalized_domain(value) end if key and value then io.write(key, "\t", value, "\n") end end 16 / 32

Lua Script Example: reduce-value-counter.lua 1/3 -- Assumes input sorted by VALUE -- a foo --> foo 3 -- a foo bar 2 -- b foo quo 1 -- a bar -- c bar -- d quo 17 / 32

Lua Script Example: reduce-value-counter.lua 2/3 local last_key = nil, accum = 0 local flush = function(key) if last_key then io.write(last_key, "\t", accum, "\n") end accum = 0 last_key = key -- may be nil end 18 / 32

Lua Script Example: reduce-value-counter.lua 3/3 for l in io.lines() do -- Note reverse order! local value, key = l:match("^(.-)\t(.*)$") assert(key and value) if key ~= last_key then flush(key) collectgarbage("step") end accum = accum + 1 end flush() 19 / 32

Tying It All Together Basically: You work with sorted data, mapping and reducing it line-by-line, in parallel where at all possible, while trying to use as much of available hardware resources as practical, and without running out of memory. 20 / 32

The Tools 21 / 32

The Tools parallel sort, uniq, grep cut, join, comm pv compression utilities LuaJIT 22 / 32

LuaJIT? Up to a point: 2.1 helps to speed things up, FFI bogs down development speed. Go plain Lua first (run it with LuaJIT), then roll your own ecosystem as needed ;-) 23 / 32

Parallel xargs for parallel computation can run your jobs in parallel on a single machine or on a "cluster" 24 / 32

Compression gzip: default, bad lz4: fast, large files pigz: fast, parallelizable xz: good compression, slow...and many more, be on lookout for new formats! 25 / 32

GNU sort Tricks LC_ALL=C \ sort -t$ \t --parallel 4 -S60% \ -k3,3nr -k2,2 -k1,1nr Disable locale. Specify delimiter. Note that parallel x4 with 60% memory will consume 0.6 * log(4) = 120% of memory. When doing multi-key sort, specify parameters after key number. 26 / 32

grep http://stackoverflow.com/questions/9066609/fastest-possible-grep 27 / 32

Notes and Remarks 28 / 32

Why Lua? Perl, AWK are traditional alternatives to Lua, but, if you re not very disciplined and experienced, they are much less maintainable. 29 / 32

Start Small! Always run your scripts on small representative excerpts from your datasets, not only while developing them locally, but on actual data-processing nodes too. Saves time and helps you learn the bottlenecks. Sometimes large run still blows in your face though: Monitor resource utilization at run-time. 30 / 32

Discipline! Many moving parts, large turn-around times, hard to keep tabs. Keep journal: Write down what you run and what time it took. Store actual versions of your scripts in a source control system. Don t forget to sanity-check the results you get! 31 / 32

Questions? Alexander Gladysh, ag@logiceditor.com 32 / 32