Kickstart & Puppet @ Booking Kristian Köhntopp, booking.com
What Booking does Facilitates Hotel Room Bookings between Travelers and Hotels. Just that.
Booking Data Hotel Base Data, Brochures, Reviews & Score, Availability by Room, Rate and Date. A large history of stuff.
Booking Tech Frontends w/ Linux, Apache, mod_perl, With different functional classes. Databases MySQL, Also differentiation. Lots of Infrastructure systems.
Booking Size FE to DB rate of ~ 4-6 to 1. About 160 slaves, about a dozen schemata. About 1000 hosts. Growing fast.
Building a new DC Build a Business Continuity Facility! You are not allowed to touch! Completely automated installation and configuration.
ServerDB MAC addresses pre-announced by vendor. Or gathered from OOB maintenance interface for installed machines. Enter it into ServerDB, Assign function and status.
pxebooting Generate a PXE Boot config and KS file. pxeboot the box first time, Boot order: disk, net, Menu as additional safeguard unless marked in ServerDB.
pxelinux.cfg pxelinux loads pxelinux.cfg/01-$mac. aftpd has been patched: Call script for nonexisting files, Script acts on ServerDB flags.
pxelinux.cfg [root@bkbuild-01 bin]# tftp_generator --file kstest Serving pxelinux.cfg file for 00:1E:68:0F:46:F8/kstest # Generated from data in the serverdb # See https://wiki/ /ServerDB PROMPT 1 TIMEOUT 50 DEFAULT co5-x86_64 LABEL local LOCALBOOT 0x80 LABEL co5-x86_64 kernel vmlinuz-co5-x86_64 append initrd=initrd-co5-x86_64 lang=us pci=bfsort nofb text devfs=nomount ramdisk_size=7168 network ksdevice=eth0 ks=http:// /kick/kstest.dqs.lhr1.booking.com
Kickstart Load.ks file via http. Dynamically generated in Apache from ServerDB.
Kickstart part /boot --fstype ext3 --size 100 --asprimary part swap --size 1000 part pv.01 --size=100 --grow volgroup VolGroup00 pv.01 logvol / --fstype ext3 --name=root --vgname=volgroup00 --size=100 --grow %post /bin/rm -f /etc/yum.repos.d/* /bin/cat > /etc/yum.repos.d/booking.repo <<EOF yum -y install puppet ruby-rdoc /sbin/chkconfig --level 345 puppet on
Overrides If a file exists, the scripts are not called: At pxeboot level, At kickstart level. Alternative: Set state to live or standby in ServerDB: You get the menu.
Lessons so far Automate everything. Use a database. Provide an easy way out: Optimize the common case, Forward special cases.
Puppet Migrate to puppet gradually: Run puppetd everywhere: Existing hosts & new hosts. Have it do nothing at first. Roll out node-by-node, service-byservice.
Puppet Right now: 318 nodes in site.pp. 141 databases in site.pp. LDAP planned.
Migration to Puppet Test a new service definition. Roll out to individual nodes via site.pp. If fine, make part of base::common, if applicable to all nodes.
base::common Common services: Cron, Nagios, nsswitch, LDAP, NTP, Puppet, resolver, ssh, SNMP, sudo, sysctl, syslog. Package Management and common packages.
Differentiation Apache (lots of flavors). Service definitions according to function. Databases (partial): MySQL and Merlin deploys, requires storage configuration. Memcaches.
Differentiation node "mc01lb-01.prod.lhr1.booking.com" { include "s_lb" } node "sc01static-01.prod.lhr1.booking.com" {include "s_webstatic::static" } node "mc01avrdb-02.prod.lhr1.booking.com" { include "s_db::avrdb" }
Differentiation Service definitions vary wildly in size: Load balancer: 10 lines. Database: 541 lines. Not even complete yet. About 2 dozen services. About 2 dozen modules.
Benefits Works. Pretty. Crossplatform. Deploy time from poweron: 20 min through Kickstart. Additional 6 to 20min through puppet.
Possible problems In creating puppet structure, we ran into a number of obstacles. For some of these, solutions exist. For others, workarounds are needed.
Problems: Conceptual Declarative Syntax: Tell Puppet what you want, not how it is done. Hard to do for some services. Task: Generate a my.cnf. No way out? Generator script Deploy.
Problems: Facter Facter Server/Template Node. Facts are scalars. Templating at the server. Task: Generate a my.cnf, Manage lvm facts.
Problems: Performance Puppet performs as if it was written in Ruby. mod_ruby is a must. splay does not help a lot.
Problems: Large files As a file transfer service, puppet sucks. Task: Deploy one of several 18M.bin files for Merlin, run a bunch of setup scripts. Lazy solution: Filebucket OOM. Pseudo-RPM yum. Fixed in upcoming release.
Problems: Instability Logrotate during puppet run: Puppet crashes. High load during facter run: Crashing facts are cashed Server poisoned. All of these are Heisenbugs.
Problems: Ordering Puppet reorders and could parallelize. Dependencies must be declared. That is hard to do and debug. Parse puppet and drop into graphviz: --graph option.