Batch Processing How- To Or the The Single Threaded Batch Processing Paradigm Stefan Rufer, Netcetera Matthias Markwalder, SIX Card Solutions 6840
Speakers 2 > Stefan Rufer Studied business IT at the University of Applied Sciences in Bern Senior Software Engineer at Netcetera Main int erest: Server side application development using JEE > Mat thias Markwalder Graduat ed from ETH Zurich Senior Developer + Framework Responsible at SIX Card Solutions Main int erest: High performance and qualit y batch processing
Why are we here? 3 > Let's learn how to bake an omelet.
AGENDA 4 > What do we do > Sharing our ex perience > Wrap up + Q&A
What do we do 5 > Credit / debit card t ransact ion processing > Backoffice bat ch processing application 24x 7x 365 > 1.7 Mio card transactions a day > Volume will double by end of 2010 be ready > Migrated from Forté UDS to JEE > More agile code base now
How do we do it 6 > Transactional integrity at any time > Custom batch processing framework (not Spring Batch) > 1 controller builds the jobs 35 workers process the steps of jobs (or as many as you want and your system can take) > 1 application server (12 cores) > 1 database server (12 cores, 1.5TB SAN)
Batch Processing Basics 7 > It s simple, but parallel: Read file(s) Process a bit Write file(s) > Terminology from Spring Bat ch
AGENDA 8 > What do we do > Sharing our ex perience > Wrap up + Q&A
Bake an omelet 9 > 200g flour, 3 eggs, 2 dl milk, 2 dl water, ½ table spoon salt > Stir well, wait 30min ( ) > Stir again > Put little butter in heated pan > Add 1dl dough > Bake until slightly brown, flip over, bake again half as long > Put cheese / marmalade / apfelmus /... on top, fold > Enjoy
Jobs run in parallel 10 > Load balancing > Complete yesterdays reports while doing today's business How to achieve > Use batch scheduling application that cont rols your entire processing. > Read/ modify categorization of jobs
Load limitations 12 > Load balancing > Generate 70 reports, but max 20 in parallel How to achieve > Number of workers one job can use > Priorities of the steps of a job
Decouple controller + workers 13 > Scalabilit y > SETI@home
Step trees, Sequential, Fail on Exception 14 > Avoid structuring st eps in code > Collect dat a, afterwards write a file. How to achieve > Sequent ial ex ecution > Fail on exception (rollback entire st ep)
Step trees, Parallel, Continue on Exception 15 > Minimize work left > Process 30'000 t ransactions in 3 steps. How to achieve > Parallel ex ecut ion > Continue on exception (still rollback entire step)
Parallelize reading 16 > Speedup > A file of 200'000 credit card authorisations and transactions have to be read into database. How to achieve > Cut input file in pieces of 10'000 lines each. btw: perl, sort are unbeaten for this... > Process each piece in a parallel step.
Parallelize processing 17 > Speedup > Summarize accounting data and store result in database again. How to achieve > Group data in chunks of 10'000 and process each chunk in a parallel step. > Choose grouping criteria carefully: No overlapping data areas Pass along data that you had to read for the grouping process
Parallelize processing how to group 18 > Structuring your data in parallelizable chunks > Load balancing > Parallelize processing by client as data is distinct by design. How to achieve > Group by client > Group by keys: Ranges or ids Ranges (1..5) can grow very large Keys (1, 2, 3, 4, 5) can become very many
Parallelize writing 19 > Transact ional int egrit y while writing files. > Easy recovery while writing files. > Collect data for the payment file. How to achieve > Collect data in parallel and write to a staging table. > Staging table content very close to target file format. > In a last step dump entire content of staging table to file.
Different processes write in parallel 20 > Don't lock out each other > Account informat ion changes while account balance grows. How to achieve > No optimistic locking > Modify deltas on sums and counters > Keep dist inct fields for different parallel jobs > Be aware of deadlock potential
Avoid insert and update in same table and step 21 > Speedup > Avoid DB locks > Summary rows in same table as the raw data. How to achieve > Normalize your database.
Let the database work for you 22 > Simple code > Speedup > Sorting or joining arrays in memory. How to achieve > Code review. > Book SQL course.
Read long, write short 23 > Keep lock contention on database minimal > Keep transactional DB overhead minimal > Fully process the whole batch of 1 000 records before starting to write to DB. How to achieve > 1 (one) "writing" database transaction per step. interface IModifyingStepRunner { void preparedata(); void writedata(); }
This omelet did not taste like grandma's! 24 > Despite following the recipe, there are the hidden corners > Let's have a look at some pitfalls
Don't forget to catch Error 25 > Application integrity delegated to DB > OutOfMemoryError caused half of a batch to be committed. Fatal as rerun can not fix inconsistency. How to fix try { result = action.dointransaction(status); } catch (Throwable err) { transactionmanager.rollback(status); throw err; } transactionmanager.commit(status);
Use BufferedReader / BufferedWriter 26 > Speedup (file reading time cut in half) > Forgot t o use BufferedReader in file reading framework. How to fix > Code review. > Profile if performance "feels not right".
Use 1 thread only 27 > Simplicity for t he programmer > Safet y (no concurrent access) > Singleton, synchronized blocks, static variables, stateful step runners we had it all... How to achieve > Configure framework to use one JVM per worker.
Cache wisely 28 > Speedup > Limit memory use > Tax rates do not change during a processing day, cache it long. > Customer data will be reused if processing transaction of same customer cache it short. How to achieve > Cache per worker > Cache lifetimes: Worker / step / on demand
Support JDBC batch operations 29 > Speedup List<Booking> bookings = new ArrayList<Booking>();... bookingdao.update(bookings); How to achieve > Enhance your database layer with a built- in JDBC batch facility. > Execute batch after 1000 items added. > Automatically re- run failed batch using single JDBC statements
Structured patching 30 > Risk management > Stay agile in production > Bug found, fixed and unit tested. Deploy to production asap. How to achieve > Eclipse- wizard to create patch (all files involved to fix a bug) > Pat ch- script t hat applies.class file/ SQL script/ whatever...
Never, ever, update primary keys 31 > Good database design > Speedup > Homemade library always wrote entire row to database. How to fix > Only writ e changed fields (dirt y flags). > Make primary keys immutable on your objects.
AGENDA 32 > What do we do > Sharing our ex perience > Wrap up + Q&A
Future 33 > Scalability is an issue with a single database server. Partitioning options used, but not to the end. Will Moore's law save us again? > Processing double the volume still to be proven...
If you remember just three things... 34 Java batch processing works and is cool :- ) Trade- offs: > Do not stock the work, start. > Single threaded, many JVMs. > Designing for scalability, stability needs experts. http:/ / www.google.ch/ search?q= how+ to+ flip+ an+ omelet
Stefan Rufer Netcetera AG stefan.rufer@netcet era.ch www.netcet era.ch matt hias.markwalder@six- Matthias Markwalder group.com SIX Card Solutions www.six- group.com
Links / References 36 > http:/ / en.wikipedia.org/ wiki/ Batch_processing > http:/ / static.springframework.org/ spring- batch/ > http:/ / www.bmc.com/ products/ offering/ control- m.html > http:/ / www.javaspecialists.eu/ And to really learn how to bake fine omelets, buy a book: > http:/ / de.wikipedia.org/ wiki/ Marianne_Kaltenbach > http:/ / www.oreilly.de/ catalog/ geeksckbkger/
Other batch processing frameworks (public only) 37 > http:/ / www.bmap4j.org/ > http:/ / freshmeat.net/ projects/ jppf > http:/ / hadoop.apache.org/