High-Volume Writes with PostgreSQL
Major parameters to set shared_buffers: 512MB to 8GB checkpoint_segments: 16 to 256 effective_cache_size: typically ¾ RAM wal_buffers: typically 16MB Auto-tuned in 9.1
Checkpoints Dirty data in buffer must be flushed WAL segments are 16MB Requested checkpoint checkpoint_segments of writes Timed checkpoint checkpoint_timeout (5 minute default)
Checkpoint spikes 8.3 added Spread Checkpoints Aims to finish at 50% of progress fsync flush to disk at end of checkpoint Optimal behavior: OS wrote data out before fsync call Spreading sync out didn t work usefully Spikes still happen
A bad checkpoint LOG: checkpoint complete: wrote 127961 buffers (12.2%); 0 transaction log file(s)added, 1818 removed, 0 recycled; write=80.190 s, sync=359.823 s, total=520.913 s
A funding checkpoint LOG: checkpoint complete: wrote 141563 buffers (13.5%); 0 transaction log file(s) added, 1109 removed, 257 recycled; write=944.601 s, sync=10635.130 s, total=11613.953 s
Types of writes Checkpoint write: most efficient Background writer write: still good Backend write, fsync Fine if aborbed by background writer Write will be cached by OS later Backend write, BGW queue filled backend does fsync itself Very bad, multi-hour checkpoints possible Improved in 9.1
bgwriter monitoring $ psql x c "select * from pg_stat_bgwriter" checkpoints_timed 0 checkpoints_req 4 buffers_checkpoint 6 buffers_clean 0 maxwritten_clean 0 buffers_backend 654685 buffers_backend_fsync 84 buffers_alloc 1225
Time analysis $ psql c "select now(),* from pg_stat_bgwriter" Sample two points Buffers are 8K each (normally) Compute time delta, value delta Buffers allocated: read MB/s Sum of buffers written: write MB/s Compute or graph Munin has an example
bgwriter trends
Cache refill
Linux tuning ext3 on old kernels does blocky fsync dirty_ratio lowers write cache size in % Kernel 2.6.29 is finer grained dirty_bytes sets exact amount of RAM Cannot go too far OS write caching is expected VACUUM slows a lot: 50% drop possible
VACUUM Cleans up after UPDATE and DELETE The hidden cost of MVCC Must happen eventually Frozen ID cleanup
Autovacuum Cleans up after dead rows Also updates database stats Large tables: 20% change required autovacuum_vacuum_scale_factor=20
VACUUM Overhead Intensive when it happens Focus on smoothing and scheduling Putting it off makes it worse Dead rows add invisible overhead Table bloat can be very large Thresholds can be per-table
Index Bloating Indexes can become less efficient after deletes VACUUM FULL before 9.0 makes this worse REINDEX helps, but it locks the table CREATE INDEX can run CONCURRENTLY Rename: simulate REINDEX CONCURRENTLY All transactions must end to finish CLUSTER does a full table rebuild Same fresh performance as after dump/reload Full table lock to do it
VACUUM Gone Wrong Aim at a target peak performance VACUUM isn't accounted for Just survive peak load? You won't survive VACUUM
VACUUM monitoring Watch pg_stat_user_tables timestamps Beware long-running transactions log_autovacuum_min_duration Sizes of tables/indexes critical too
Improving efficiency maintenance_work_mem: up to 2GB shared_buffers & checkpoint_segments (again) Hardware write caches Tune read-ahead
VACUUM Cost Limits vacuum_cost_page_hit = 1 vacuum_cost_page_miss = 10 vacuum_cost_page_dirty = 20 vacuum_cost_limit = 200 autovacuum_vacuum_cost_delay = 20ms
autovacuum Cost Basics Every 20 ms = 50 runs/second Each run accumulates 200 cost units 200 * 50 = 10000 cost / second
Costs and Disk I/O 20ms = 10000 cost/second All misses @ 10 cost? 10000 / 10 = 1000 reads/second 1000*8192/(1024*1024)=7.8MB/s read All dirty @ 20 cost? 10000 / 20 = 500 writes/second 500*8192/(1024*1024)=3.9 MB/s write Halve the delay to 10ms? Doubles the rate: 17.2 MB/s / 7.8 MB/s Double the delay to 40ms? Halves the rate: 3.9 MB/s / 1.95 MB/s
Submission for 9.2 Displaying accumulated autovacuum cost In November CommitFest Easily applies to older versions Not very risky to production Just adds some logging Useful for learning how to tune costs
Sample logging output LOG: automatic vacuum of table "pgbench.public.pgbench_accounts": index scans: 1 pages: 0 removed, 163935 remain tuples: 2000000 removed, 2928356 remain buffer usage: 117393 hits, 123351 misses, 102684 dirtied, 2.168 MiB/s write rate system usage: CPU 2.54s/6.27u sec elapsed 369.99 sec
Common tricks Manual VACUUM during slower periods Make sure to set vacuum_cost_delay Start with daily Break down by table size Alternate fast/slow configurations Two postgresql.conf files, or edit script Swap/change using cron or pgagent Aggressive freezing
Write to disk, slow way Data page change to pg_xlog WAL Checkpoint pushes page to disk Hint bits update page for faster visibility Autovacuum marks free space Freeze old transaction IDs
Manually maintained path Data page change to pg_xlog WAL Checkpoint pushes page to disk Manually freeze old transaction Ids Tweak vacuum_freeze_min_age and/or vacuum_freeze_table_age
Hardware for fast writes Log checkpoints to catch spikes Battery-backed write cache Fast commits Beware volatile write caches http://wiki.postgresql.org/wiki/reliable_writes
Hard Drive Latency Type Latency-ms Transactions/Sec 5400 RPM 11.1 90 7200 RPM 8.3 120 10K RPM 6.0 167
Latency driving TPS
Partitioning Time-series data splits most easily Monthly partitions typical Setup is manual and requires some code Queries can only exclude partitions Old partitions don't need vacuum Once frozen, they're ignored Indexes are smaller Less used indexes fade from cache Oldest data can be truncated No deletion VACUUM cleanup!
Skytools Proven to handle write scaling Database access wrapped in functions PL/Proxy routes to appropriate notes Any, all, etc. Replication used for shared data Rebalancing is tricky Pause feature in pgbouncer helps Hard to retrofit to existing system
Special thanks Some monitoring samples provided by: Track, measure, and improve your fitness Clients for Android and iphone http://runkeeper.com/
PostgreSQL Books http://www.2ndquadrant.com/books/
Questions Slides at 2ndQuadrant.com Resources / Talks