COCOMO II and Big Data Rachchabhorn Wongsaroj*, Jo Ann Lane, Supannika Koolmanojwong, Barry Boehm *Bank of Thailand and Center for Systems and Software Engineering Computer Science Department, Viterbi School of Engineering University of Southern California 28 th International Forum on COCOMO and System/Software Cost Modeling
Outline Big Data Concept COCOMO II Cost factor COCOMO II Cost factor and Big Data Future Works 2
Big Data concept Big Data Datasets whose size are beyond the ability of typical database software tools to capture, store, manage, and analyze (McKinsey Global Institute) 3V s concepts of Big Data (IBM) Volume -- The amounts of data generated Variety -- The different data types and sources Velocity -- The speed of data is generated in/out and moves around 3 Source: IBM
Big Data concept Volume People to People Variety Machine to Machine People to Machine Velocity 8 Billion messages/day 845M active users 20 Hours of video uploaded every minute 340Million Tweets/day 140M active users Source: IBM 4
Big Data Landscape Source: Sajal Das, Keith Marzullo Source: IBM 5
Big Data Landscape (cont.) Source: blogs.forbes.com/davefeinleib 6
Big Data problems World interconnection Data Quality Data Quantity Lots of data is being created & collected Data Data Timely Variety 7
COCOMO Black Box Model product size estimate product, process, platform, and personnel attributes reuse, maintenance, and increment parameters COCOMO II development, maintenance cost and schedule estimates cost, schedule distribution by phase, activity, increment organizational project data recalibration to organizational data 8
COCOMO II Cost factor Significant factors of development cost: scale drivers are sources of exponential effort variation cost drivers are sources of linear effort variation product, platform, personnel and project attributes effort multipliers associated with cost driver ratings Each factor is rated between very low and very high per rating guidelines 9
Scale Drivers Precedentedness (PREC) Degree to which system is new and past experience applies Development Flexibility (FLEX) Need to conform with specified requirements Architecture/Risk Resolution (RESL) Degree of design thoroughness and risk elimination Team Cohesion (TEAM) Need to synchronize stakeholders and minimize conflict Process Maturity (PMAT) SEI CMM process maturity rating 10 10
Scale Drivers Precedentedness (PREC) Degree to which system is new and past experience applies Development Flexibility (FLEX) Need to conform with specified requirements Architecture/Risk Resolution (RESL) Degree of design thoroughness and risk elimination Team Cohesion (TEAM) Need to synchronize stakeholders and minimize conflict Process Maturity (PMAT) SEI CMM process maturity rating 11 (c) USC 11 CSSE
Scale Drivers Scale Factors (W i ) Very Low Low Nominal High Very High Extra High Precedentedness (PREC) Development Flexibility (FLEX) Architecture/Risk Resolution (RESL)* Team Cohesion (TEAM) Process Maturity (PMAT) thoroughly unprecedented rigorous largely unprecedented occasional relaxation somewhat unprecedented some relaxation generally familiar general conformity little (20%) some (40%) often (60%) generally (75%) very difficult interactions some difficult interactions basically cooperative interactions largely cooperative largely familiar some conformity mostly (90%) highly cooperative Weighted average of Yes answers to CMM Maturity Questionnaire * % significant module interfaces specified, % significant risks eliminated throughly familiar general goals full (100%) seamless interactions 12 12
Precedentedness (PREC) Elaboration of the PREC rating scales: Feature Very Low Nominal / High Extra High Precedentedness Organizational understanding of product objectives Experience in working with related software systems Concurrent development of associated new hardware and operational procedures Need for innovative data processing architectures, algorithms General Considerable Thorough Moderate Considerable Extensive Extensive Moderate Some Considerable Some Minimal 13 13
Cost Drivers Product Factors Reliability (RELY) Data (DATA) Complexity (CPLX) Reusability (RUSE) Documentation (DOCU) Platform Factors Time constraint (TIME) Storage constraint (STOR) Platform volatility (PVOL) Personnel Factors Analyst capability (ACAP) Program capability (PCAP) Applications experience (APEX) Platform experience (PLEX) Language and tool experience (LTEX) Personnel continuity (PCON) Project Factors Software tools (TOOL) Multisite development (SITE) Required schedule (SCED) 14
Cost Drivers and Big Data Product Factors Reliability (RELY) Data (DATA) Complexity (CPLX) Reusability (RUSE) Documentation (DOCU) Platform Factors Time constraint (TIME) Storage constraint (STOR) Platform volatility (PVOL) Personnel Factors Analyst capability (ACAP) Program capability (PCAP) Applications experience (APEX) Platform experience (PLEX) Language and tool experience (LTEX) Personnel continuity (PCON) Project Factors Software tools (TOOL) Multisite development (SITE) Required schedule (SCED) 15
Product Factors (cont d) Required Software Reliability (RELY) Measures the extent to which the software must perform its intended function over a period of time. Ask: what is the effect of a software failure? Very Low Low Nominal High Very High Extra High RELY Descriptors slight inconvenience low, easily recoverable losses moderate, easily recoverable losses high financial loss risk to human life 16
Big Data Landscape 17 Source: Sajal Das, Keith Marzullo Source: IBM
Product Factors (cont d) Data Base Size (DATA) Captures the effect large data requirements have on development to generate test data that will be used to exercise the program. Calculate the data/program size ratio (D/P): D P DataBaseSize( Bytes ) Program Size( SLOC) IBM: Data Base Size of Big Data -> Scale from terabytes to zettabytes Very Low Low Nominal High Very High Extra High DATA DB bytes/ Pgm SLOC < 10 10 D/P < 100 100 D/P < 1000 D/P > 1000 18
19
20 Source: (c)2012 Enterprise Strategy Group
Product Factors (cont d) Product Complexity (CPLX) Complexity is divided into five areas: control operations, computational operations, device-dependent operations, data management operations, and user interface management operations. Select the area or combination of areas that characterize the product or a sub-system of the product. 21
Product Factors (cont d) Module Complexity Ratings vs. Type of Module Use a subjective weighted average of the attributes, weighted by their relative product importance. Control Operations Computational Operations Very Low Low Nominal High Very High Extra High Straightline code with a few nonnested structured programming operators: DOs, CASEs, IFTHENELSEs. Simple module composition via procedure calls or simple scripts. Evaluation of simple expressions: e.g., A=B+C*(D-E) Straightforward nesting of structured programming operators. Mostly simple predicates. Evaluation of moderate-level expressions: e.g., D=SQRT(B**2-4.*A*C) Mostly simple nesting. Some intermodule control. Decision tables. Simple callbacks or message passing, including middlewaresupported distributed processing. Use of standard math and statistical routines. Basic matrix/vector operations. Highly nested structured programming operators with many compound predicates. Queue and stack control. Homogeneous, dist. processing. Single processor soft realtime ctl. Basic numerical analysis: multivariate interpolation, ordinary differential eqns. Basic truncation, roundoff concerns. Reentrant and recursive coding. Fixed-priority interrupt handling. Task synchronization, complex callbacks, heterogeneous dist. processing. Singleprocessor hard realtime ctl. Difficult but structured numerical analysis: near-singular matrix equations, partial differential eqns. Simple parallelization. Multiple resource scheduling with dynamically changing priorities. Microcodelevel control. Distributed hard realtime control. Difficult and unstructured numerical analysis: highly accurate analysis of noisy, stochastic data. Complex parallelization. 22
Product Factors (cont d) Devicedependent Operations Data Management Operations User Interface Management Very Low Low Nominal High Very High Extra High Simple read, write statements with simple formats. Simple arrays in main memory. Simple COTS- DB queries, updates. Simple input forms, report generators. No cognizance needed of particular processor or I/O device characteristics. I/O done at GET/PUT level. Single file subsetting with no data structure changes, no edits, no intermediate files. Moderately complex COTS-DB queries, updates. Use of simple graphic user interface (GUI) builders. I/O processing includes device selection, status checking and error processing. Multi-file input and single file output. Simple structural changes, simple edits. Complex COTS-DB queries, updates. Simple use of widget set. Operations at physical I/O level (physical storage address translations; seeks, reads, etc.). Optimized I/O overlap. Simple triggers activated by data stream contents. Complex data restructuring. Widget set development and extension. Simple voice I/O, multimedia. Routines for interrupt diagnosis, servicing, masking. Communication line handling. Performance-intensive embedded systems. Distributed database coordination. Complex triggers. Search optimization. Moderately complex 2D/3D, dynamic graphics, multimedia. Device timingdependent coding, micro-programmed operations. Performancecritical embedded systems. Highly coupled, dynamic relational and object structures. Natural language data management. Complex multimedia, virtual reality. 23
Source: (c)2012 Enterprise Strategy Group
25 25
Platform Factors Execution Time Constraint (TIME) Measures the constraint imposed upon a system in terms of the percentage of available execution time expected to be used by the system consuming the execution time resource. Very Low Low Nominal High Very High Extra High TIME 50% use of available execution time 70% 85% 95% http://www.parstream.com/product/ 26 26
Source: (c)2012 Enterprise Strategy Group
Platform Factors Main Storage Constraint (STOR) Measures the degree of main storage constraint imposed on a software system or subsystem. Very Low Low Nominal High Very High Extra High STOR 50% use of available storage 70% 85% 95% The largest big data practitioners Google, Facebook, Apple, etc run what are known as hyper scale computing environments. 28 28
Big Data Storage The key requirements of big data storage are that: Must be capable of handling large volumes of data Must be scalable to growth Must provide the input/output operations per second (IOPS) to deliver data to analytic tools 29
Personnel Factors Analyst Capability (ACAP) Analysts work on requirements, high level design and detailed design. Consider analysis and design ability, efficiency and thoroughness, and the ability to communicate and cooperate. Very Low Low Nominal High Very High Extra High ACAP 15th percentile 35th percentile 55th percentile 75th percentile 90th percentile Programmer Capability (PCAP) Evaluate the capability of the programmers as a team rather than as individuals. Consider ability, efficiency and thoroughness, and the ability to communicate and cooperate. Very Low Low Nominal High Very High Extra High PCAP 15th percentile 35th percentile 55th percentile 75th percentile 90th percentile 30 30
Personnel Factors (cont d) Applications Experience (AEXP) Assess the project team's equivalent level of experience with this type of application. Very Low Low Nominal High Very High Extra High AEXP 2 months 6 months 1 year 3 years 6 years 31 31
32 Source: (c)2012 Enterprise Strategy Group 32
Personnel Factors (cont d) Platform Experience (PEXP) Assess the project team's equivalent level of experience with this platform including the OS, graphical user interface, database, networking, and distributed middleware. Very Low Low Nominal High Very High Extra High PEXP 2 months 6 months 1 year 3 years 6 year 33 33
34 Source: (c)2012 Enterprise Strategy Group 34
35 Source: (c)2012 Enterprise Strategy Group 35
Conclusion - Scale Drivers and Big Data Scale Drivers Precedentedness (PREC) Development Flexibility (FLEX) Architecture/Risk Resolution (RESL) Team Cohesion (TEAM) Process Maturity (PMAT) COCOMO II Coverage 36 (c) USC 36 CSSE
Conclusion - Cost Drivers and Big Data Cost Drivers Reliability (RELY) COCOMO II Coverage / Future Work Data (DATA) Need to define EXTRA HIGH Cost rating For terabytes to zettabytes data project Complexity (CPLX) but need more detail for Big Data - custom developed solution (25% of all projects) Reusability (RUSE) Documentation (DOCU) Time constraint (TIME) Storage constraint (STOR) Platform volatility (PVOL) 37 (c) USC 37 CSSE
Conclusion - Cost Drivers and Big Data Cost Drivers COCOMO II Coverage / Future Work Analyst capability (ACAP) Program capability (PCAP) Applications experience (APEX) Platform experience (PLEX) Language and tool experience (LTEX) Personnel continuity (PCON) Software tools (TOOL) Multisite development (SITE) Required schedule (SCED) 38 (c) USC 38 CSSE
Reference Barry W. Boehm, et al (2000), Software Cost Estimation With COCOMO II, Prentice Hall, New Jersey. Barry W. Boehm (1981), Software Engineering Economics, Prentice Hall, New Jersey. McKinsey Global Institute, Big data: The next frontier for innovation, competition, and productivity, June 2011 (www.mckinsey.com/mgi) Zikopoulos, P., and Eaton, C. (2011). Understanding big data: Analytics for enterprise class hadoop and streaming data, McGraw-Hill Osborne Media. Enterprise Strategy Group, Research Report : The Convergence of Big Data Processing and Integrated Infrastructure http://en.wikipedia.org/wiki/big_data 39 (c) USC 39 CSSE