The KPMG-NL Big Data team 16 March 2015
Core analysis tools SQL Anaconda SciPy Matplotlib CERN C++ for advanced data science Statistical tools widely used in social sciences The development line
ETL ETL RAW ETL XML CSV WEB Video Audio
KPMG development Already existing open source Code Repository Git server (e.g. GitHub for OSS components E.g. Apache server - Package Repository Add-on KPMG services + tools + libs Add-on open-source services Ambari core services Gitlabs LDAP Hadoop Core analytics MongoDB JBoss Storm Future tools TWiki StormSD Archiva Jenkins Apache Sonar Hive Ganglia Ambari Open source component adopted as installation platform Service deployment model KAVE gathers together a toolkit of pre-existing third-party open-source software components. These software components are governed by their own licenses which KAVE installer does not modify or supersede, please consult the originating authors. These components altogether have a mixture of the following licenses: Apache 2.0, GPL 2.0, AGPL and LGPL, ZPL, MIT, PSF, BSD and some BSD-like simple licenses. For scipy and ipython see: http://docs.continuum.io/anaconda/licenses.html.
Topic Impact Chance Mitigation Insertion of malicious code by malicious third-party Reputational risk if software fails to perform adequately Reputational risk by association with other open-source providers Risk of withdrawal or lack of maintenance of baseline product Risk of using as-is limited software if it includes infringing content High Zero Select for open-source software with a wide user base or security critical function, and this will then have been scrutinized by thousands of people, experts in their field. Do not initially permit contributions to our software directly without our own review process. In principle this is much harder to do in OSS than in proprietary software. Medium Low Legal aspects can be handled with explicit limited liability licensing and explicit contracts should engagements revolve around KAVE. We use this software ourselves, for our own engagements, and at each stage we use our professional judgment about the performance of the tools included: and so we would be the first to notice shortfalls in functionality. Additionally we can gain reputation by contributing to existing open-source products with bug reports and feature requests. Low Low Installing an open source product does not in-and-of-itself associate us to any individual or entity which contributed to that product, however it is necessary to consider carefully any current reputation of organizations so associated and we use our professional judgment based on known software quality and company history. Additionally we can gain reputation by becoming part of the community. Medium Low The KAVE understands that the tools needed for Big Data will evolve with time. Should a more-widely used alternative come along at a later date we will adopt it. For now we choose tools which are considered mainstream and in heavy use, with an active user base and active contributions. In our opinion historically the risk of OSS is smaller than the withdrawal risk of proprietary software. High Low The Apache foundation has very strict rules for becoming an Apache product which include verifying existing conflicts such as copyright infringement. We base our installer on Ambari which was adopted as an Apache product, and prefer Apache products over others if there is a possible choice. However, we recognize that historically speaking organizations have sued individuals for unknown/unintentional or debatable infringement. Again, by selecting products which are already in use by large companies we can be assured that the risk must be minimal here. Reverse reputational risk by generating revenue from an Open Source product without contributing High High (if we don t release the software) Ethically speaking, if KPMG is using an open-source product and generating revenue off other s work, our team feels we are obliged to contribute to the community in some way, and so we intend to release our platform as open source, under an Apache-2.0 license
Branch-based development Test-driven development Services we add Feature-based releases Merging by central authority Updating paths Integration testing Self-hosting Agile development Prioritization by users
Hadoop Storm Named project leaders Web servers 150 projects Mesos 835 committers Facebook Twitter Defined project structure Yahoo Billions of end users of their products Google Strict consensus-driven project management Establishment Candidate Acceptance Podling engagement Project rejection temination rejection Boilerplate for each file includes copyright owner: Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/license-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Full license available at http://www.apache.org/licenses/license-2.0 Apache License, Version 2.0 Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, nonexclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
The development line Core analysis tools SQL Anaconda SciPy Matplotlib CERN C++ for advanced data science Statistical tools widely used in social sciences
RELEASING SPRINTING PLANNING Added to backlog Bug report Improve feature New feature Categorized Categorize Trivial Not Trivial Review with team Schedule in sprint Consider TDD Define priority Identify dependencies New features: 1. Exploratory install 2. Product demo 3. Assess against KAVE principles Develop Review Integration test Merge Develop on most appropriate branch 1. Implementation 2. Developer tests 3. Automated tests a. Fast review stage b. Fast (unit) testing stage c. Trivial merge Development loop Diverge a feature-specific branch (don t ever develop on the master) 1. Implementation 2. Developer tests 3. Automated tests a. Code review by different person b. Reviewer tests c. Automated tests Product demo of changes Release of new version Packaging and release
2015 KPMG Advisory N.V., registered with the trade register in the Netherlands under number 33263682, a member firm of the KPMG network of independent member firms affiliated with KPMG International Cooperative ( KPMG International ), a Swiss entity. All rights reserved. Printed in the Netherlands. The KPMG name, logo and cutting through complexity are registered trademarks of KPMG International. Produced by Create Graphics Document number CRT039089