Is Hadoop Enterprise ready? Building Hadoop cluster Krzysztof Adamski
Agenda About ISP Team Architecture Automated Hadoop deployment Monitoring Security Q&A
About ING Services Polska
ISP Service Catalogue
ISP promotes ambitious goals 490 489.3 85% 4 100% 86% Headcount FTEs WPC (actuals 2014) Process maturity assessed by Ernst & Young Average systems availability (actuals 2014) General IT controls of KPIs on target
ISP has been growing as a solid business partner 35+ Business partners 191 SLAs 18 Countries 8.1 Customer satisfaction (1-10 scale, Q4 2014) Security monitoring, Remote management, System hosting, Security services Services
The team
The A team Don t hire, train them! Break out of the silo mentality DevOps Agile Let them choose their own tools Automation http://www.pragmatictestlabs.com/
Architecture
Hadoop deployment options
Cloud vs on-premise Legal and Regulatory Issues (e.g. data locality, limited responsibility) Network speed (we are talking BIG data) Time to market Initial costs http://www.softwarefit.com/cloud-erp-vs-on-premise-erp/
Basic network principles Machines should be on an isolated network from the rest of the data center Machines should have static IPs Reverse DNS should be setup Top-of-the-rack switches hadoop servers are quite chatty Multi-homed networks are tricky
VLAN configuration example VLAN Fabric NIC Port Function Failover vlan160_mgmt A eth0 Management, User connectivity Fabric failover to B vlan12_hdfs B eth1 Hadoop Fabric failover to A vlan11_data A eth2 SAN/NAS access, ETL Fabric failover to B Cisco reference architecture
ToR vs Cisco ref. architecture
Linux general recommendations Use FQDNs required by Ambari, Kerberos Disable IPTables since we are within isolated network Disable SELinux enabling it can be very challenging Set swappiness to 1 Set ulimits to 64k Disable Transparent Huge Pages Disable atime Enable NTP JBOD for hadoop drives RAID1 for system drives (if dedicated) http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache-hadoop-clusters-like-a-boss/
What else do we need? Code repository e.g. Stash, GitLab Open Source package repository for Python (pip), Perl (cpan), R (cran), Maven Repository Manager Integration tools e.g Jenkins Stepping stone (edge) server Other RDBMS to store aggregates e.g. MySQL, PostgreSQL Data scientists server RStudio, Ipython etc.
Did you know?
Hadoop DR strategy No inherent cross data center replication DistCp can be used for large inter/intra-cluster copying Data can be ingested into two separate hadoop clusters Wandisco Non-Stop Hadoop https://www.wandisco.com/system/files/documentation/wd-datasheet-nonstop-hadoop-hortonworks-web.pdf
Automated deployment
RHEL Kickstart installation Bladelogic jobs to provision software components e.g. monitoring agents, security monitoring components Bladelogic jobs to harden RHEL security according to best practicies Red Hat Satellite as package distribution and versioning center
UCS Manager - organisation Let Hadoop team manager servers themself create organization Create server profile template Create profiles from a template
UCS Manager fabric interconnect
Ambari
Ambari
Ambari HA wizard
Ambari blueprints
Ambari blueprint example { "configurations" : [ { "configuration-type" : { "property-name" : "property-value", "property-name2" : "property-value" } }, { "configuration-type2" : { "property-name" : "property-value" } }... ], "host_groups" : [ { "name" : "host-group-name", "components" : [... https://cwiki.apache.org/confluence/display/ambari/blueprints
Ambari REST API curl -u admin:$password -i -H 'X-Requested-By: ambari' -X PUT -d '{"RequestInfo": {"context" :"Start HDFS via REST"}, "Body": {"ServiceInfo": {"state": "STARTED"}}}' http://ambari_server_host:8080/api/v1/clusters /CLUSTER_NAME/services/HDFS curl -u admin:$password -H 'X-Requested-By: ambari' -X GET "http://ambari_server_host:8080/api/v1/clusters/ing_hdp/comp onents/?servicecomponentinfo/category.in(slave,master)&host_ components/hostroles/host_name=clusternode&fields=host_compo nents/hostroles/component_name,host_components/hostroles/sta te https://cwiki.apache.org/confluence/display/ambari/api+usage+scenarios%2c+troubleshooting%2c+and+other+faqs
Leverage docker http://blog.sequenceiq.com/blog/2014/07/25/cloudbreak-technology/
Did you know? Upgrading hadoop stack can be still a painful (80 man pages) process Ref. http://docs.hortonworks.com/hdpdocuments/ambari- 1.7.0.0/Ambari_Upgrade_v170/Ambari_Upgrade_v170.pdf
Monitoring
Hadoop Availability Monitoring (service health)
Hadoop metrics monitoring http://hakunamapdata.com/ganglia-configuration-for-a-small-hadoop-cluster-and-some-troubleshooting/
Did you know? Check your region and language settings ;)
Security
Hadoop security Hadoop is not a single product, choose your components wisely Up until recently there was no single point for user managment Maintaining ACL in HDFS is a painful process No out of the box Active Directory integration http://blogs.gartner.com/merv-adrian/2014/01/21/security-for-hadoop-dont-look-now/
Hadoop ring of defense
Apache Knox Gateway
Is there anything we can do? 1. Do not store sensitive data within Hadoop 2. Separate Hadoop environment in a separate network zone (dedicated vlan/s, firewall filtered traffic) 3. Kerberize cluster environment a) Watch for unkerberized components b) Keep your keytabs safe 4. LDAP for central user managment 5. Manager your ACLs start simple with POSIX groups 6. Auditting
IPA At the most basic level, Red Hat Identity Management is a domain controller for Linux and Unix machines.
IPA server client communication
IPA
Did you know? IPA 3 for RHEL 6 has issues when installing using external CA option
Central user and policy managment
Ranger
Where to continue from here? hadoop distribution best practicies Reference architecture papers http://docs.hortonworks.com/hdpdocuments/hdp2/hdp- 2.2.0/Cluster_Plan_Gd_v22/Cluster_Plan_Gd_v22.pdf http://hortonworks.com/get-started/ http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache-hadoopclusters-like-a-boss/ http://www.slideshare.net/vinnies12/hadoop-security-today-tomorrowapache-knox http://www.slideshare.net/hadoop_summit/radia-srinivasjune261120amroom210c http://www.slideshare.net/kevinminder/knoxhadoopsummit20140505v6pub http://blog.sequenceiq.com/blog/2014/12/04/multinode-ambari-1-7-0/
Interesting books and docs
Q&A krzysztof.adamski@ingservicespolska.pl http://pl.linkedin.com/in/adamskikrzysztof