Big Data Drupal Commercial Open Source Big Data Tool Chain
How did I prepare?
MapReduce Field Work
About Me Nicholas Roberts 10+ years web Webmaster, Project & Product Manager Australian Sonoma County www.niccolox.org Niccolo on drupal.org
What Is Big Data? Big data sets Technology Batched or Streamed Hadoop (Spark) Planet scale data HDFS Google search index HBase Facebook social graph MapReduce Oozie Zookeeper Intelligence
How Big? 1,000 gigabytes is a terabyte. 1,000 terabytes is a petabyte. 1,000 petabytes is an exabyte. 1,000 exabytes is a zettabyte. 1,000 zettabytes is a yottabyte.
Google Oregon
Utah Data Center
Big Data Drupal? Intelligent Automation, Inc NEIMiner NEIMiner, which consists of four components. The NEI modeling framework defines the scope of NEI modeling and the strategy of integrating NEI models to form a layered, comprehensive predictability similar to the Framework for Risk Analysis of Multi-Media Environmental Systems (FRAMES). The data integration layer brings together heterogeneous data sources related to NEI via automatic web services and web scraping technologies. The data management and access layer reuses and extends a popular Content Management System (CMS), Drupal, and consists of modules that model and enable interactions over a complex data structure for NEI related bibliography and characterization data. The model discovery and composition layer provides an analysis capability for NEI data
Software Developer DESIRED SKILLS: Exprience in database modeling and access methodologies, SQL and persistent object technologies Ability to utilize cloud computing and Hadoop technologies Expertise with LAMP, PhP and Drupal Knowledge of Graphical User interface design and GIS systems Experience in Mobile app development on Android OS JOB DUTIES: Support software development for projects in multiple areas of data mining, informatics, human-computer interface, mobile app development and cloud computing https://home2.eease.adp.com/recruit2/?id=6069912&t=1
BigDataDrupal.com
BigDataDrupal.org
Tool-chain Proxmox KVM vm server Debian 7 Solr 3.6 Jetty Aegir BOA Cloudera Nutch 1.6 MapReduce 4 Nutch jobs Hadoop Hue search facets Aegir BOA Open Outreach ApacheSolr Nutch Multisite Views
Proxmox KVM / OpenVZ visualization server Debian 7 based distro Bare-metal installer or Debian 7 packages CPU socket commercial licenses AFGL open source
Cloudera 50% engineering donated open-source Doug Cutting Jeff Hammerbacher Cloudera Manager! Hadoop ecosystem distro Hue Impala Etc etc etc
Hadoop The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Google Yahoo Facebook Twitter Amazon NSA
MapReduce parallel processing of large data sets
Hadoop MapReduce Job UI
Hue: Hadoop UI
Nutch Apache Nutch is a highly extensible and scalable open source web crawler Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing. Nutch 2.x: inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.
Aegir Drupal hosting Drupal Drush based Provision Aegir BOA commercial open source installer Open Hosting Platform As A Service Omega8cc
Open Outreach
Drupal & ApacheSolr Nutch 1.6 > Solr 3.6 > Drupal ApacheSolr Integration Module ApacheSolr Examples ApacheSolr Nutch Multisite FacetAPI FaceAPI Pretty Paths ApacheSolr Views
Future Search API Cloudera CDK / Kiji Remote Entities API Kettle / Pentaho Nutch 2.x / HBase Mulesoft Drupal 8 RapidMiner HyperDrupal R GuzzlePHP Bonita Soft REST / Thrift API
Thanks & Credits Mitchell Tannenbaum Forest Mar Chris McCafferty Peter Wolanin David Stuart Ryan Szrama Doug Cutting
Contacts Nicholas Roberts Www.niccolox.org Niccolo.roberts@gmail.com 1. 510 684 8264 Sonoma County Www.BigDataDrupal.com Www.BigDataDrupal.org