Best Practices in Scalable Web Development

Transcription

1 MASARYK UNIVERSITY FACULTY OF INFORMATICS Best Practices in Scalable Web Development MASTER THESIS Martin Novák May, 2014 Brno, Czech Republic

2 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Thesis supervisor: doc. RNDr. Tomáš Pitner, Ph.D. ii

3 Acknowledgement At first place I want to thank the university for becoming my home and alma mater in past years providing unforgettable experience, professional knowledge and insights that I can now use in real software industry. My gratitude belongs especially to doc. RNDr. Tomáš Pitner, Ph.D. for his trust, support and leadership not only for my master thesis but also during whole studies and in our activities in Lab of Software Architectures and Information Systems, LaSArIS. iii

4 Abstract The thesis covers the interconnections among the best practices for scalable web development on different levels from backend architecture design up to modern frontend clients for end users. The backend architecture design incorporates practices connected to modern environments of cloud computing and how to be designing a scalable application that is able to leverage both distributed and single server environments. We describe the available storage options that include file system, relational databases and modern non-sql engines in relation to different requirements on performance and reliability. On a higher level we move to service oriented architecture that allows a modern application design that counts with simple expansibility, connections among different systems and providing a service to different clients with high performance and security. On a client level we discuss requirements on access through a computer web browser application, through a mobile web browser and through a custom mobile application. iv

5 Keywords Cloud computing, distributed server environment, service oriented architecture, SOA, storages, caching strategies, web applications, web services, REST, SOAP, JavaScript MVC, mobile application, web browser application v

6 Contents Introduction Background and motivation Aim of the thesis Requirements Thesis outline... 3 Web Application Development Development Environments Development Environment QA Environment and Staging Environment Production Environment Backup Environment Testing and Continuous Delivery Continuous Integration Automated Testing Continuous Delivery Process Impact of Development Best Practices on Requirements Web Application Architectures Distributed application design Data Layer Compute Layer Internet Layer Users Tier Architecture Service Oriented Architecture Web Services Web Service Orchestration vi

7 3.6 Impact of Architectures on Requirements Server Environments Single Server environment Advantages of Single Server Environment Disadvantages of Single Server Environment Server Housing Cloud Computing Advantages of Cloud Computing Disadvantages of Cloud Computing Data Persistency Relational Databases PostgreSQL MySQL Oracle Database Microsoft SQL Server Conclusion Big Data Big Data Origins Examples of Big Data usage No-SQL Column Storage Document Storage Key-value Storage Graph Storage Recommendations Frontend client application design Common Web Application Frontend JavaScript MVC Responsive Design vii

8 6.4 Mobile Applications Conclusion Bibliography viii

9 Chapter 1 Introduction 1.1 Background and motivation Modern web development is challenged by new technologies and methodologies on all levels and it is demanding to be able to apprehend how new approaches connect together to deliver more reliable solutions and bring profit to its creators. For a background developer the biggest revolution lies in cloud computing as a way of allowing even small companies to be developing a scalable solution for a distributed server environment bringing in an advantage of technologies formerly available only for large companies but also bringing a challenge of higher demand on right system design. On the other end of a frontend developer there are new trials of tackling both web browser and mobile applications together with different resolutions and screen sizes bringing the need of responsive graphical design implemented to provide the best experience on different devices. Web applications does not live on lonely islands apart from each other but provide public, protected or private interfaces to connect with each other to share data and services. Service oriented architecture brings a natural way of build a system or services on top of other various services working together to create a value for end user as well as providing a guide how to create your service so it can easily work together with other services or be used by different kinds of front end clients. The motivation is to describe how all those new modern approaches should be designed to work together to take the advantage from each of them to deliver more value with minimized risks and maximized reliability that could never be achieved other way. 1.2 Aim of the thesis The thesis analyses available technologies and methodologies to bring best practices on all levels of web application development and examines how they can work together to bring higher synergic value. There are a high number of possibilities for technologies to be used and it is important not to investigate only the possibilities of the technologies but also the level of available documentation, support and stability that are very key to companies that search for a modern approach but can t afford a risk of an unreliable solution. For our work we consider a general core of web application systems that does not include any specific features to solve unique business requirements. In a real application such core 1

10 would have to be customized and optimized to serve a needs of a specific solution as there is a difference between ERP system 1 and social network application. Modern web development does not depend on using the latest buzz technology available but rather on implementing advanced methods and architectures to enhance the development process itself such as service oriented architecture and continuous delivery that are in the focus of large software companies but even small startup companies can take advantage of them. We consider only these requirements: Requirements Scalability The ability to handle seamlessly growing traffic over time is important for growing companies. Increasing hardware resources without changing the number of nodes as vertical scaling strategy is limited. Advanced scalability can be obtained by focusing on architectural design supporting distributed environment and building independent services with horizontal scaling. [6] Reliability and Availability Reliability as an attribute ensuring trust that the system keeps working over time in terms of integrity and consistency is important for overall system quality and customers trust into company s quality of services. Modern services also need to focus on guaranteeing accessibility of the system even in difficult conditions such as hardware malfunction. High reliability and availability could be hard to achieve with limited finances of small companies but can be enhanced by taking advantage of third party offerings and cloud computing that bring certified reliable platforms including high security, data safety, automated failovers and backups as well as protection through using various GEO locations. [7] Expansibility and Maintainability Business companies need the ability to grow the system in reaction to new requirements and be able to easily maintain it over time. This ability is highly dependable on using best practices related to architectural system design. In the best situation the development team needs to create independent modules and/or services that are described by their application interface but other part of the systems see it as abstracted functionality and are not impacted by their internal changes or deprecating as long as original interface is supported. This provides the ability to replace easily whole parts or layers of the system improving its features without negative impacts. [8] 1 ERP system Enterprise Resource Planning system 2

11 Support of both distributed and single server environments Modern web applications need to be designed to grow with the growing number of users that can rapidly increase thank to potential virality of new startup [9]. Therefor such a system should use technologies and architecture that at the beginning will easily run under low budget on a single server or web hosting and then will be able to expand into distributed environment without a difficulty with higher latency [10] or inability to use local data storage that needs to be centralized. Therefor to optimize the costs but still support growth, web application should be designed to support both distributed and single server environments. Web services available through http interface Successful web systems do not live separately but connect through services to create increased value through a network effect. Web services can be directly connected to a business model or be focused on general improvement of architecture by separation of functionality through independent services and dividing backend logic from frontend generating. [11] Support of different sizes of screens Screen sizes and resolution was growing naturally with computer technology evolvement but with the introduction of mobile technologies today web application developers face a challenge providing their contents on a various types of devices with different screen resolutions and pixel densities. [12][13] Support of mobile devices Companies offering their services in a form of a web application are challenged with a need of making their service accessible through a native mobile application as a new trend shows a growing consumption through cell phones and tablets. Original assumption that HTML5 web would replace native application is not supported by market development [14]. Since 2013 Facebook has more people accessing it through a mobile device than from a desktop. [15] 1.3 Thesis outline The thesis consists of seven chapters. Chapter 1 Introduction describes background and motivation of the work, introduces system requirements and shortly discusses the thesis outline. Chapter 2 Web application development covers setups of development environment of a web application project and introduces concepts of Continuous Integration and Continuous Delivery. 3

12 Chapter 3 Web Application Architectures outlines connection among various architectural approaches and development best practices including 3-Tier architecture, Model-View- Controller practice and Service Oriented Architecture. Chapter 4 Server Environments describes differences in application design and effect related to original requirements depending on different server placement in connection to architecture. Chapter 5 Data Persistency outlines challenges of Relational Database Management Systems, Big Data problems of volume, velocity and variety and connection to No-SQL databases as well as recommendation for data persistency in different use cases. Chapter 6 Frontend client application design covers the current requirements on web browser clients regarding JavaScript MVC frameworks, responsive design with support of mobile devices also in relation to Service Oriented Architecture. Chapter 6 Conclusion summarizes the best practices of scalable web application development also in relation to requirements described in chapter 1. 4

13 Chapter 2 Web Application Development In a second chapter we describe recommendations for setting up development environment and what methods to use to achieve higher reliability and efficiency through introduction of Continuous Integration (CI) and Continuous Delivery (CD). 2.1 Development Environments Even as a single independent developer it is not recommended to develop and test code on a production server. The most basic setup therefor means one development server for development and testing and one production server. Advanced setup then can be as shown below Development Environment Figure 1 Development Environments Development environment serves for basic development of a web application with a local copy. It can be either placed directly on a developer s computer or on a virtual machine. Development environment needs a strategy to cover parallel work of many developers working together through versioning and potential locking of code. Daily code check-ins support overall better practices in development [16] as they force the programmer not to work on large pieces of untested code but promote agile development of smaller testable pieces. In agile development testing is directly connected with the code development instead of waterfall approach where phases happen separately after each other. Taking advantage of this practice supports continuous development of runnable software which easily integrates with Continuous Integration and Continuous Delivery instead of developing code in a black box where we can see the result only after large pieces are finished. [17] 5

14 2.1.2 QA Environment and Staging Environment Quality Assurance (QA) is responsible for testing on several levels either manually or automatically through various kinds of test suites. Separate QA machine need depends on organizational processes and can be replaced by a staging environment. Testing should be done in an environment that is as similar to production as possible. Different testing strategies for production deployment can be introduced. One of the best practices dictates to first test on a staging server with staging storages and remote mock services. [18] After the testing passes we stay on a staging server but switch storages and services to production to test full compatibility with production environment and after that we deploy to production. With the same logic we can use a separate set of hidden production server and be deploying to them first. After hidden machines are tested we switch them through load balancer with the current production. This way we can also easily switch between versions as a failover strategy. This approach is often directly supported by Cloud Computing Platform as a Service offers. [19] Production Environment Production environment is directly accessible to users and connected to production data. With proper system design we should be able to run several servers in parallel with the same code to achieve higher performance. The minimum recommended number of servers is two to ensure that in case, that once server fails, load balancer can automatically redirect traffic to the second server. [20] Load balancers are a critical part of distributed architecture improving availability of web application which distributes loads across the machines and handle simultaneous connections which in the end is exactly what allows horizontal scaling of the system. Load balancer can be either a software or physical hardware device and use various algorithms including randomizer, round robin or selection based on various criteria such as CPU load, memory usage, etc. [6] 6

15 Figure 2 - Load Balancer Backup Environment All production data and servers should be replicated to backup environment that should possibly be placed in a different geo location to also account to natural disasters that can destroy main datacenter. [21] Backup strategies should include these important factors: Understand what you can afford to lose and every time you introduce a new piece to your system include a strategy for backup and failover. It is important especially for No-SQL types of databases that can be difficult to backup. What can be automated for backups should be automated. Availability and reliability of business services is highly important and only automation for backups and failovers can provide fast issue mitigation. Always test also the backups to make sure that they are reliable. The worst time to find out that something is not being backed-up properly is when you production servers are down. Test different scenarios of different part of the environment failing not only a complete switch. Defining a disaster recovery plan and business continuity strategy needs to be done on company-wide level to define which systems take priority before others. Executives should understand what happens and how when something goes down. Otherwise IT trying to fix it will get under even more pressure from people inquiring on them during manual failovers. Strategies for switching to backup environment in reaction to failures of any kind vary depending on whole environment structure and its remote parts. It can be decided manually to 7

16 switch or automatically based on events fired by server monitoring. Proper server monitoring is an important recommendation in any cases. Mitigation process for server failures can look as below. Figure 3 - Failure Mitigation Process 2.2 Testing and Continuous Delivery Continuous Delivery (CD) is a design best practice to achieve safe automated regular deployment on new software versions. Undividable parts of CD are Continuous Integration (CI) and Automated Testing (AT) Continuous Integration CI is a software engineering best practice usually connected to test driven development. Its basic principle relies that every piece of code that is checked-in by a developer into version control system triggers automated tests that ensure it is working and compatible with the rest of software and therefor code is continuously integrated and every version in version control system is potentially runnable. [22] 8

17 Figure 4 - Continuous Integration As described in a part of thesis covering development environment, check-ins should be done daily in agile development approach. Therefor CI becomes important piece daily integrating created software that allows to detect problem early and greatly improve whole system reliability. [23] Continuous Integration needs agile development process for its maximum effectiveness. Rather than waterfall development, agile approach will allow parallel development of code and its tests and fast reaction to errors and changes in requirements. [24] Automated Testing Test automation is leveraging usage of automated test suits that compare the real software output with predicted outcomes improving especially testing of repeatable objects (for example testing a long set of web redirects). [24] A Quality Assurance processes should define how code is tested and the same person that develops the code should not write test cases. As a part of CI and CD automated tests are necessary to allow code to be build and deployed automatically. Important question is code coverage and speed. In large system going through all testing stages can be very time consuming and becomes a bottleneck to development therefor a right testing strategy is necessary. Automated testing does not consist only from unit testing but should cover also integration testing or behavioral testing and other categories. Behavioral testing is able to automatically walk through the website using its natural web interface in a sequence as user would did it and then check the contents or critical messages are present on the page and therefor ensuring the behavior. [25] Behavioral testing is important especially for quality assurance of features that 9

18 are connected to user use cases over functions of the web application that provides the most business value to the customer and therefor have important impact on reliability Continuous Delivery Process Deployment pipeline as on an image below shows what steps are followed between a code being checked into a version control system and a new software release. First steps after a check in should be fully automated. CD system should be able to run a build and unit tests written on the code as well as other automated acceptance tests as part of continuous integration. [26] Although manual testing is very important for CD, it is not forbidden to also perform manual tests steps and it is expected that user acceptance tests (depending on company process) is performed and final approvals before release are given manually by a responsible manager. Figure 5 Continuous Delivery CD also depends on more general company development processes that are in place and generally correlates with agile methodologies. In SCRUM development the result of every sprint should be a releasable package that the product owner can decide to release and CD can be the exact tool to support it. It is also important to understand that CD does not shorten sprints in SCRUM. One of the basic ideas of SCRUM is that every item should be finished as releasable based on definition of done and using CD the company can be releasing every day and still have 14 days sprints. [27] 10

19 2.3 Impact of Development Best Practices on Requirements Development environment best practices of daily check-ins supporting agile continuous development of small pieces improves the quality of software which leads to overall better reliability. Another improvement quality and reliability can be also achieve through separating QA environment and introducing best practices using mock objects and taking advantage of support for different testing environment provided directly by Platform as a Service Cloud Computing offers. Staging server than becomes the final frontier where a testing is performed on a production-like environment. Load balancers supporting distributed environment are important to allow maximal availability of the service being able to divide the load or even mitigate a failure of one of the machines. Backup and failover strategies are a very important piece related to system reliability to make disaster recovery plan and business continuity strategy possible. Continuous Integration and Continuous Delivery are approaches that improved the reliability and maintainability of our systems throughout its development, release and maintenance. 11

20 Chapter 3 Web Application Architectures This part of the thesis describes the best practices for web systems architecture by introducing the design for distributed environments dealing with separation of layers from the point of technology. Then we introduce 3-Tier architecture, describe its connection with Model-View- Controller pattern and extend it by using Service Oriented Architecture for the best addressing of our requirements. 3.1 Distributed application design Even in case that we are not going to currently use cloud computing offers which are basically an example of a distributed environment [28] or presently planning to migrate to distributed server environment, our web application should be designed to support best practices to improve its performance and scalability that will allow it to be used in distributed environment in the future. Thinking of distributed environment brings the need to address these design concerns: Latency on a single server there is low latency between computing unit, database and physical storage but in distributed environment such latencies can have crucial impact on application performance and therefor need to be addressed by caching strategies and distribution of application layers. [10] Data centralization - multiple servers will need to be able to share data and user session and therefor the application needs to be designed to be able to access all data from a centralized repository rather than just a server where the computing unit is placed. This also allows easy scaling of performance through simple cloning of production servers, which are designed to process server request but does not need to continuously hold a session with a user. [1] As an example of server layers of web application we divide it into data layer, compute layer, Internet layer and users and describe how each layer should be designed to take advantage of possible architecture. 12

21 Figure 6 Distributed Application Design Data Layer We use data layer as a centralized storage accessed by compute layer designed for maximal performance. As an example of three different data storages we use RDBMS PostgreSQL and column No-SQL databases Memcached and HBase. RDBMS is used to store relational data used by application. Advantages of RDBMS are in ability to capture complex interconnected structures of data and its reliability. For example RDBMS is ideal to store information about user licenses, accounts, etc. [29] HBase or column No-SQL storages should be used to deal with BigData problems. An example could be a system that captures gigabytes or even terabytes of data through logs and diagnostics. HBase than can allow parallel processing in a cluster and optimized storage. [30] The results of processing than can be stored in RDBMS. Or in distributed server setting we can use column storage to have a centralized place for application logs. Other types of data can use different specialized storages, such as document storage for ERP document subsystem etc. Memcached allows only key-value storing of data but thanks to in memory use is much faster than any type of storage. [31] Application logic should be designed to use this kind of storage for caching of frequently used data including centralized user sessions. Multiuser system that wants to improve parallel access of many users to the same data can be designed to access such data from Memcached storage that is invalidated in predefined period. Therefor for example 100 times users access data from memory, once it gets invalidated and updated and then again 100 accesses are obtained from memory storage. 13

22 Storing binary files We have several options to store binary data. RDBMS allows storing such data but is usually considered as the worst choice because such data do not take advantage of relational schema but take a lot of space. Specialized No-SQL databases as document database provide advantage when we need to store data with additional information (tags, metadata ) and we need to perform operations over them. [32] The most common choice for storing data is still some kind of physical or virtual hard drive. In our own settings we can share a disk on a dedicated server or we can use cloud technologies Amazon S3, Windows Azure Blob Storage and others. A challenge of using these storages is usually in high latency. Therefor application logic should be design not to access these data through compute layer if not necessary but only reference them. To improve performance we can use Content Delivery Network as described in Internet Layer section. [33] Unique binary files should not be stored on compute servers because they are not available to other servers and get lost when the virtual machine is deleted. But virtual machines can store them as a caching strategy for files that are process on a compute layer. In MVC a Model decides where and how are which data stored and controls caching strategies Compute Layer Compute layer should contain whole application logic processing client requests but without holding any status between different requests and without storing user sessions locally. Load balancers can be set up to always direct one user to one particular server to allow persistence through a session but avoiding it brings advantage of independence that allows improved load balancing, information sharing, simple cloning and failovers without risks of losing unsaved data. In case that one of machines enters any kind of error state, load balancer can simply redirect traffic and restart it or completely remove and create a new one. All clients access computer layer only through load balancer. [6] Application architecture can support model of dividing different logics on different groups of machines with different binary source code. Therefor we can have one set of machines that processes client requests and other set of machines that run in a cycle processing data etc. In MVC logic an application is divided into Model, View and Controller or MVP: Model, View, Presenter. Model contains logic for accessing and storing data usually through ORM. Controller/Presenter accesses data only through model and does not need any information how and where such data are stored. Processed information is then returned as a view. [34] 14

23 Using Service Oriented Architecture logic everything should be accessible as a service through REST or SOAP interfaces and consumed by a client as described in detail in an independent chapter later. Compute layer should be designed for maximal performance. Usual bottlenecks are in waiting for data and therefor caching and right data layer architecture are of the essence Internet Layer Internet layer covers dependency on external services. In Service Oriented Architecture we can use internal services or depend on services available publicly or privately through Internet as remote services. How to incorporate such services in described in a chapter discussing SOA. To server static content (images, documents, videos ) to users with high availability and performance globally we should use Content Delivery Network (CDN). CDNs consist of datacenters spread across the world to ensure geo availability and use replication of files for load balancing. Therefor if one of you videos becomes suddenly popular in India, CDN will replicate the file to several copies on different servers and will balance load to ensure seamless experience for all users. [33] Static content should not be stored on the same servers that perform compute. Servers are usually set up to add unnecessary session management and their bandwidth and performance gets under avoidable stress. For servers with low amount of parallel users it does not make a difference but with higher load it is necessity Users When thinking of users today s developers need to keep in mind that customers use their web application using very different interfaces from large home PC screens to very small smart phones. There are many techniques to achieve optimal results for all users and it is only recommended to use Service Oriented Architecture so your application becomes and stays independent on end consumer of its service whether it is a JavaScript based webpage or an application on a tablet device Tier Architecture 3-Tier architecture model for web application comes with a separation of application logic into three layers typically represented by presentation tier, domain logic tier and data storage tier. [35] Presentation tier represents frontend generating and user interface. It creates a request on domain logic tier and translates the result back to the user. 15

24 Logic tier coordinates processing of output and input of data between presentation tier and data tier. Data tier takes care of data persistency and accessing different databases and storages as abstraction for domain logic tier. 3-Tier architecture is nowadays commonly used in software development as it also allows the separation of roles for backend and frontend developers but it comes with disadvantages of not enforcing complete separation of presentation layer and supporting higher modularity. 3-Tier architecture does not have direct support of creation of services but can be extended of it. [36] Bug this separation of layers improves reliability, scalability and modularity. Model-View-Controller (MVC) can and should be used together with 3-Tier architecture. They don t completely map because MVC is triangular. Controller and Views should be considered a part of presentation layer because in the end they manage the presentation of data which are processed through model on domain logic tier. Model also expands into data access logic that is covered in data tier in 3-Tier architecture. [37] Figure 7 3-Tier Architecture with MVC 16

25 3.3 Service Oriented Architecture Service Oriented Architecture (SOA) looks at a whole system from a perspective of interconnected services where on one side we have a system producing a service and on the other side we have a consumer that uses the system which can be another system or a client that displays data and interface to its users. [38] SOA is not a direct competition to 3-Tier architecture but rather expands it by adding additional rules for complete separation of presentation tier from other tiers, breaking functionality into separate independent services and having those communicate through messaging and application interfaces. [36] In October 2009 SOA Manifesto was published with these basic principles: [39] Business value over technical strategy Strategic goals over project-specific benefits Intrinsic interoperability over custom integration Shared services over specific-purpose implementations Flexibility over optimization Evolutionary refinement over pursuit of initial perfection 3.4 Web Services Web service is an internet based interface that abstracts its backend and provides its API accessible usually through REST or Simple Object Access Protocol (SOAP) to be consumed by another system or a client. SOAP is used more frequently with system to system integration because of automated service discovery. [39] REST is more frequently used by end clients because of lower bandwidth which is important for example especially with mobile devices running over cellular network. [40] One service can depend on other services but such dependency should not be visible to the end consumer. One service should always represent one piece of independent functionality. In traditional web application 3-Tier design the backend is direct or indirect producer of frontend user interface and therefor both sides of designed are coupled together. In SOA backend is producing only a service and it can be aware of consuming client but the service lies independently. Both approaches can use common Model-View-Controller (MVC) design with the difference being in view where traditionally it would be interactive web page but in SOA it is the output of the service (XML, JSON, YAML ). One of disadvantages in using services can be in difficulty finding out what other systems/clients are dependent on it and therefor its API should never be changed directly but rather marked as deprecated but kept supported if possible. 17

26 Nowadays services are easy to develop as they are directly supported in programming frameworks for all major programming languages. Also client frameworks such as JavaScript MVVCs are usually designed to seamlessly click with REST interface. Integrated Development Environments support SOAP discovery and based on it produce prepared code. [41] SOA can be used to designed full backend server and frontend then can be generated by a separate server using lightweight backend consuming services on a model level of MVC. When designing a system or client dependent on external services never forget a fall back and design scenario for how the system should react if the service becomes unavailable or unreliable. 3.5 Web Service Orchestration Web Service Orchestration adds a system layer that deals with dependency among services in a term of their overall workflow. Orchestration deals with aligning business or system needs with the application, data, services and underlying infrastructure that allows for maximum reliability of the system and possibly adds measures for centralized management of resource pool with monitoring. Orchestration then allows for multiple services to be exposed as single high level service to consumer. Because services are created as independent pieces, orchestration is necessary to create a business logic in which the services can cooperate usually through some messaging service such as enterprise service bus (ESB). [42] An example of usage of orchestration can be a high level service to create a new system user in a larger system. User uses a web interface and inputs user information. Through service API that request is delivered to central messaging service and orchestration invokes other services to make a record in a database but also to check for background information through mail address search and calling to Customer Relationship Management system (CRM) to connect data for marketing purposes. Testing scenarios for orchestration again should include situations of different underlying services being unavailable. In cases of failure centralized messaging service should not be losing data but be able to wait for depending service to come online again. All major programming languages have their systems and frameworks available to ease the development of orchestration. For example.net framework is using BizTalk and Java EE offers several options such as Mule WSB or Apache Synapse. 18

27 3.6 Impact of Architectures on Requirements Distributed application design brings a set of best practices with direct impact on organizing technologies to accommodate different requirements. Reliability is added through separating of functional units that clears the design and makes each part easier to replace in a failover strategy. It also introduces better availability and supports horizontal scalability through division of functionalities into variable nodes. Therefor if the compute power becomes a bottleneck we can easily scale it independently and it can still be served by the same data layer. Other improvements are then introduced through different technologies. Granularity of scaling application can be even improved by usage of independent services which than can live on separate machines. Less demanded services then can be grouped on a single server and highly demanded services can be operated through own machines. No-SQL database can improve the availability thanks to its performance and reliability through simpler data organization. Content Delivery Network than improves availability and scalability through supporting easy distribution of static content based on actual demand. 3-Tier architecture than brings a recommendation of distributing systems into different layers instead of holding different logic together which improves again both reliability and scalability of the system. That is further improved by using the Model-View-Controller pattern and by extending to use Service Oriented Architecture. Service Oriented Architecture is also the best solution for tackling requirement of web application being available through service interface and to support different consumers both in web browser and through a native mobile application. By using best practices for architecture we can easily achieve great maintainability and expandability because be lower the complexity of resolving issues or addressing new requirements. Because the solutions are modular and independent we can easily replace, improve or add new parts with minimal or no impact on its surroundings. 19

28 Chapter 4 Server Environments Fourth chapter directly relates to architectures but looks at them from the perspective of physical and virtual server placements and the impact on its choices on our predefined requirements. It describes simple single server environment, introduces virtualization and own server management in server housing and continues to practices used with modern Cloud Computing offerings. 4.1 Single Server environment Single server environment is most popular with independent developers and small companies because of a low cost and easier development with lower demand on architecture with a small web application. With a single server environment it is important to think of failover strategy. A good practice is to have a second server that synchronizes data and has the same server configuration. Physically such a server should be placed as far from the first server as possible to account for natural disasters. Failover strategy then should actually support switching to the backup server, which is possible for example through automated load balancer which can capture unresponsiveness of production server. [6] Advantages of Single Server Environment Easier development which does not need to take performance and scalability into account but is highly limited Low latency between software server, database engine and other storages and therefor it is easy to achieve better availability but with limitations Disadvantages of Single Server Environment Server performance can be extended only by obtaining a stronger server which is insufficient for a web solutions that has to handle very large traffic Does not adapt easily to hardware or software failures impacting reliability Hard to migrate to a distributed environment impacting maintainability 20

29 4.2 Server Housing In this case the company or organization itself is managing its servers and can create own distributed environment through means of virtualization (as nowadays by a server we rarely mean a physical machine but rather a virtual software device). Using this approach the starting costs are higher but long-term costs can be lower and company can benefit from complete control over own environment, which also removes common concern about security in Cloud Computing. [43] Sever virtualization is a method of software emulation of hardware creating of virtual pool of resources which are dynamically available and support horizontal scalability. [44] Server Housing with virtualization also provide easier expansibility for distributed system. Growth costs are always higher with purchase of new physical devices but those can seamlessly grow on these servers, which can also host different virtual machines for development, quality assurance, staging and backup. Although for backup server the recommendation remains to preferably use a different GEO location. 4.3 Cloud Computing Cloud computing is a distributed server environment to be commonly available even for independent developers and small companies. The notion is to provide seemingly unlimited pool of virtual resources from data storage to CPU power. It is a form of distributed computing with benefits of utilization, remote provisioning of scalable resources with reduce investment and proportional costs, increased scalability, availability and reliability. [1] Cloud uses abstraction and virtualization to hide details of system specifications from users and developers. Virtualization pools and shares scalable resources that can be provisioned with enabled multi-tenancy. [2] Cloud computing is traditionally divided into these categories: Infrastructure as a service (IaaS) Examples: Amazon Web Services, Rackspace Platform as a service (PaaS) Examples: Google App Engine, Windows Azure Software as a service (SaaS) Examples: Google Docs, Windows Live All three types are connected by principles of abstractions and virtualization. Consumer of the service does not deal with any underlying structures and only consumes virtual units of a service, which depends on its type. Therefor for example in case of IaaS the consumer can flexibly decide to run two servers, one database and utilize 1 TB of data storage and tomorrow it can be simply expanded to 6 servers without the need to understand all the infrastructure and technology behind such service. There are already many offers to choose from for IaaS and Paas. Some of the most known are Amazon Web Services, Windows Azure, OpenShift and Google Engine. Nowadays the basics of 21

30 those services are very similar in both terms of pricing and technology. The richest offer might be found with Amazon Web Services, which operates since If you are using Microsoft technologies the best integration and price offer will come with Microsoft s Windows Azure. Cloud Computing offerings come with higher reliability that is impossible to achieve with a low badged in small companies. For example Microsoft provides a % monthly service level agreement for compute service. But even Cloud Computing does not provide 100 % reliability. There were cases for example on February 29, 2012 Windows Azure did not anticipate a leap year and run into serious issues causing unavailable services. [6][7] Advantages of Cloud Computing On-demand service allows consumer to utilize simply large scale of virtual resources at any time Resource pooling provides dynamic allocation and reallocation of virtual resources through abstraction from virtual machines to data storages Rapid elasticity permits scaling resources up and down fast and seemingly without limits Cloud datacenters are considered highly reliable through automated failovers and load balancers especially in compare to set-ups available to small companies. Lower costs are offered to consumer through high efficiency of large datacenters and usually by removing a need to by expensive server software, security, safety and maintenance Low prices provide low barrier of entry and therefor make advanced distributed server technology available even to small companies and independent developers Disadvantages of Cloud Computing For large companies that can afford own datacenters, cloud offers may not provide financial advantage Consumer does not control physical storage of possibly sensitive data and privacy and security are therefore the main concerns of using Cloud Computing that may be in conflict with national law in some countries (for example Czech Republic does not allow storing users credit card data outside of national borders) 22

31 Chapter 5 Data Persistency This chapter will describe various types of relational database management systems as well as No-SQL solutions. We move to describe the Big Data challenge of volume, velocity and variety of data and identify categories of No-SQL databases with their use cases. Very important are recommendations mentioned in the end of the chapter that dive into comparison of use cases for ACID, BASE and CAP theorem. 5.1 Relational Databases Concept of relational database is known for over 40 years and therefor is considered very dependable. It is based on organizing data into tables and describing connections (relations) between them through keys. Datasets are organized through structures and indexes that allow very fast access to specific data through queries. Four most used relational database management systems (RDBMS) are PostgreSQL, MySQL, Oracle Database and Microsoft SQL Server. [45] PostgreSQL PostgreSQL 2 is multiplatform open source RDBMS supporting both Windows and Linux platforms. As an open source it rivals especially with other open source RDBMS MySQL and although it is less popular, PorstgreSQL is considered faster and more reliable system. PostgreSQL is most commonly used together with PHP, Ruby on Rails and Java and it is a part of cloud Amazon Relational Database Service since November [46] MySQL MySQL 3 is multiplatform open source RDBMS supporting both Windows and Linux platforms. It is the most popular open source RDBMS as part of LAMP platform owned by Oracle since [47] Against common believe MySQL for years supports all important RDBSM features including transactions, procedures, triggers, views, full-text search and clustering. MySQL was a first database engine supported by cloud Amazon Relational Database Service since November [46]It is most commonly used together with PHP

32 5.1.3 Oracle Database Oracle Database 4 is multiplatform proprietary RDBMS supporting both Windows and Linux platforms. It is traditionally considered the fastest and most reliable database engine in the world. Oracle is most commonly used together with Java on Linux platform and it is a part of cloud Amazon Relational Database Service since June [46] Microsoft SQL Server Microsoft SQL Server 5 is Windows-only proprietary RDBMS. It is considered reliable and advanced engine used mainly together with ASP/.NET. SQL Server has been customized as Azure SQL for Windows Azure cloud computing platform and it is also a part of Amazon Relational Database Service since May [46] Conclusion All RDBMS contain similar features and without any specific demands without large performance pressure it would be recommended to use MySQL with PHP or PostgreSQL with PHP and any other language. For mid-sized projects and especially projects on Microsoft technologies it is recommended to use Microsoft SQL server, which is optimized for Microsoft based solution and accounts for enterprise requirements. Oracle database is recommended together with Java for high-demanding enterprise applications, which can justify higher prices. 5.2 Big Data Big data is data that exceeds the processing capacity of conventional database systems in one or more categories of volume, velocity or variety. [48] Therefor Big Data is focusing on processing large volumes of data (financial statistics, weather forecasting, etc.), processing increased rate at which data flows (fast moving trading data, real time network monitoring, etc.) as well as processing data with high diversity (raw feeds, network flow, etc.). In technology is this trend closely connected with No-SQL (Not Only SQL) databases that for example allow real time processing of large amount of data in cluster to identify important information

33 5.2.1 Big Data Origins Companies analyzing Big Data focus on either internal data or external data. Based on study done by IBM and University of Oxford we can identify these main sources: [49] Internal o Transactions 88 % o Log data 73 % o s 57 % External o Social media 43 % o Audio 38 % o Photos and videos % Examples of Big Data usage Categories are used from Big Data Infographic [50] and extended of further description. Marketing o Determining marketing campaign effectiveness Capturing large data from a wide marketing campaign and running factordependent analysis to determine its effectiveness. o Determining marketing channel effectiveness Monitoring a marketing channel in its run and analyzing for statistically significant data to determine channel effectiveness or provide processed visual output. o Tailoring marketing campaigns and promotional offers Running multifactor analysis over customer data measuring responses to find similarities and improve overall optimization of campaigns and promotional offers. Customer Service o Identifying customers who are at risk of dropping our product/service Analysis of Big Data to find emerging patterns that collate with behavior of a customer at risk of dropping out a product or service o Analyzing behavior of customers using the company s website to see which pages are most and least useful Monitoring of data beyond limited visiting statistics to find collations and patterns that are used to describe and improve a value of specific business critical pages. o Identifying patterns in customer complaints 25

34 Big Data analysis searching of patterns collating product data with topic of customer complaints through wide statistics of large number of users. Research & Development o Monitoring product quality Analysis of data provided by installed product in programs of software improvement that collect quality statistics and relations between multiple factors. o Identifying customer needs for new products and enhancements to existing products Monitoring of user behavior to determine a situation when he/she is most likely to purchase a new product in determined categories to provide direct offerings or optimize marketing campaigns. o Testing new product designs Capturing user data as responses to new designs to find patterns relevant to their satisfaction in compare with alternatives. Human Resources o Improving employee retention by determining who is most likely to leave and trying to discourage them Providing the option to monitor and analyze company data in search of employees that are likely to leave the company based on previous cases. o Identifying effectiveness of recruiting campaigns Capturing large data from a wide marketing campaign and running factordependent analysis to determine its effectiveness for recruiting. o Determining employees to promote and provide other rewards Analysis of company data based on comparison with previous cases to identify employees which skills could be leveraged on a higher position or to be motivated through other rewards. Sales o Identifying customers with the most value/potential value Processing of Big Data captured in a customer relationship management system to determine customer value based on previous behavior patterns taking in account multiple factors including network effects of customers bringing new customers. o Identifying cross-selling opportunities 26

35 Analysis of customer environment based on available data to determine opportunities to offer additional products or services that would bring supplementary value to the customer. o Determining optimal sales approaches/techniques Monitoring of statistics available through sales to find which patterns lead to sales optimization and which lower chances of purchase. Manufacturing o Product quality/defect tracking Analyzing a capturing data that were previously connected to patterns leading to defect or tracking historical data of product quality. o Supply planning Forecasting market behaviors based on Big Data to improve effectiveness of supply planning. o Manufacturing process defect tracking Tracking statistical data of manufactured products to determine trends of improving or lowering quality. Logistics/Distribution o Monitoring product shipments Real time monitoring of shipments in big companies with a high number of simultaneous deliveries. o Determining locations of inventory shrinkage Monitoring inventory situation to capture real time forecasts of potential drops. o Identifying spikes in logistics costs, and where and why they are occurring Monitoring patterns in logistics data that correlate with increasing or fallings cost depending on various statistics considering locations, weather, time, employees and others. Finance o Budgeting/forecasting/planning Valuable cash flow analysis to find new patterns and collations in data that provide possibilities of forecasts for future development. o Measuring risk Tracking risks and determining their values for risk management. o Determining financing amounts of customers 27

36 Based on behavior of large numbers of current customers analyze options for potential prospects 5.3 No-SQL No-SQL databases are an alternative to traditional databases providing mechanisms for Big Data. The data structure varies depending on particular No-SQL storage to server a specific needs usually focused on simplicity of design, scaling and performance. No-SQL databases gained their popularity together with Cloud Computing and are being offered by all major cloud platforms as part of Big Data to tackle the problems of volume, velocity and veriety. [5] In compare to relational databases No-SQL storages are still fast developing and their implementation can mean risks of undocumented behavior, breaking changes between versions and instability. In general No-SQL storages are considered by design less reliable than relational databases. No-SQL databases can be classified into 4 categories: Column (HBase), Document (MongoDB), Key-value (Memcached), Graph (Neo4j). [51] Column Storage An example of column based No-SQL storage is Apache HBase 6 running on Apache Hadoop 7 Distributed Filesystem. HBase includes compression, in-memory operation, and Bloom filters on a per-column basis. The aim of HBase as BigTable database is fault-tolerant storage of large amounts of sparse data that allows finding 0.1 % of information in 99.9 % unimportant records to deal with Big Data volume challenge. Hadoop is often used as a part of Cloud Computing offerings. Windows Azure for example provides HDInsight 8 based on Apache Hadoop

37 Figure 8 Map-Reduce An example of how processing work can be seen on an image on above. High volume of data is entering clusters to be first processed in parallel using mapping of input key-value pairs to produce output key-value pairs of processed data. Reducer then works to reduce values, which share a key to a smaller set of values. Output data then can be passed to a traditional relational database, which can use the information for Business Intelligence. [52] Map reduce is recommended to handle Big Data velocity problem Document Storage Document oriented databases focus on storing various kinds of documents such as XML and JSON or even PDF and Excel and so on. Different implementations vary in organizing and grouping information using collections, metadata, etc. [5] An example of such storage is MongoDB 9 that uses JSON-like documents with dynamic schemes called BSON. It allows to search by field, range queries and regular expression and supports indexing. MapReduce processing can usually be used similarly to Column Storages. Performance is improved using scaling through automated sharding for load balancing supporting multiple servers. Replication improves availability and throughput of data. Each replica then can be either a primary (performs all writes and reads) or secondary (copy, failover, reads). [53]

38 Figure 9 - Replication Key-value Storage Key-value storages are optimized for maximal performance using fast access to data through keys or to make easy storing large amount of simple structure of data. An example of key-value storage is Memcached 10 memory distributed caching system. Keys are up to 250 bytes long and values can be at most 1 megabyte in size. Figure 10 - Caching As example of cloud based storages optimized for large amounts of data we can use AWS DynamoDB 11 or Azure Storage Table 12. Their advantage lies in being far cheaper with the same amount of data and being very fast for predefined querying. [54]

39 5.3.4 Graph Storage Graph databases are used to store data with a variable number of relations between them such as social relations or network topologies. Every element contains a direct pointer to its adjacent elements and does not need index lookups. Figure 11 Graph Dataset An example of such database is a Neo4j 13. Graph can start with a single property and grow to a few million but it should be organized and distributed into multiple nodes with explicit relationships. Algorithm navigates from starting nodes to related nodes being able to answer questions such as What paradigm of programming languages my friends prefers and what popular language under this paradigm he does not use. Graph databases are ideal to handle data variety with complex properties and relationships. [55] Recommendations Unless your application needs to server specific needs with a pressure on large amounts of data and performance it is still preferable to use reliable relational database if possible. It might seem that RBDMS cannot scale but they certainly can scale too using similar methods for sharding, partitioning and replicating to master and slave structures. [56] No-SQL databases are specifically designed to tackle explicit challenges of Big Data related to volume, velocity and variety. That does not mean that RDBMS cannot handle large volumes of data but Column Database designed specifically for this purpose can do it efficiently or actually play the role of preprocessing for a RDBMS behind it if needed. In specific cases application can be served better by a specialized No-SQL storage. The most simple to implement are key-value based storages such as Memcached that can be easily used