Efficiency of Web Based SAX XML Distributed Processing



Similar documents
Efficiency Considerations of PERL and Python in Distributed Processing

EFFICIENCY CONSIDERATIONS BETWEEN COMMON WEB APPLICATIONS USING THE SOAP PROTOCOL

Web Pages. Static Web Pages SHTML

ASP.NET: THE NEW PARADIGM FOR WEB APPLICATION DEVELOPMENT

1. Overview of the Java Language

Virtual Credit Card Processing System

CatDV Pro Workgroup Serve r

Fig (1) (a) Server-side scripting with PHP. (b) Client-side scripting with JavaScript.

Glassfish, JAVA EE, Servlets, JSP, EJB

Internet Technologies_1. Doc. Ing. František Huňka, CSc.

ASP &.NET. Microsoft's Solution for Dynamic Web Development. Mohammad Ali Choudhry Milad Armeen Husain Zeerapurwala Campbell Ma Seul Kee Yoon

Web Development Frameworks

LinuxWorld Conference & Expo Server Farms and XML Web Services

A Generic Database Web Service

SOFT 437. Software Performance Analysis. Ch 5:Web Applications and Other Distributed Systems

Chapter 13 Computer Programs and Programming Languages. Discovering Computers Your Interactive Guide to the Digital World

Developing XML Solutions with JavaServer Pages Technology

Apache Jakarta Tomcat

GWIF: A Generic Web Application Integration Framework

Oracle9i Application Server: Options for Running Active Server Pages. An Oracle White Paper July 2001

Automatic Configuration and Service Discovery for Networked Smart Devices

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

Course Name: Course in JSP Course Code: P5

Web Application Development

Short notes on webpage programming languages

PHP Skills and Techniques

If your organization is not already

What is a Web service?

ISM/ISC Middleware Module

White paper. IBM WebSphere Application Server architecture

CentOS Linux 5.2 and Apache 2.2 vs. Microsoft Windows Web Server 2008 and IIS 7.0 when Serving Static and PHP Content

Web Analytics Understand your web visitors without web logs or page tags and keep all your data inside your firewall.

Lesson 7 - Website Administration

ICE Trade Vault. Public User & Technology Guide June 6, 2014

Performance Comparison of Database Access over the Internet - Java Servlets vs CGI. T. Andrew Yang Ralph F. Grove

EVALUATION OF SERVER-SIDE TECHNOLOGY FOR WEB DEPLOYMENT

GenericServ, a Generic Server for Web Application Development

Contents. Client-server and multi-tier architectures. The Java 2 Enterprise Edition (J2EE) platform

MASTER THESIS. TITLE: Analysis and evaluation of high performance web servers

Research on the Model of Enterprise Application Integration with Web Services

Credits: Some of the slides are based on material adapted from

4 Understanding. Web Applications IN THIS CHAPTER. 4.1 Understand Web page development. 4.2 Understand Microsoft ASP.NET Web application development

Trollhättan, Sweden

XML Processing and Web Services. Chapter 17

FROM RELATIONAL TO OBJECT DATABASE MANAGEMENT SYSTEMS

Web Hosting Features. Small Office Premium. Small Office. Basic Premium. Enterprise. Basic. General

Galina Bogdanova, Todor Todorov, Dimitar Blagoev, Mirena Todorova

E-commerce. Web Servers Hardware and Software

A Performance Comparison of Web Development Technologies to Distribute Multimedia across an Intranet

Instructor: Betty O Neil

ACM Crossroads Student Magazine The ACM's First Electronic Publication

What Is the Java TM 2 Platform, Enterprise Edition?

INTRODUCTION TO JAVA PROGRAMMING LANGUAGE

WEB PROGRAMMING DESKTOP INTERACTIVITY AND PROCESSING

A Comparison of Software Architectures for E-Business Applications

Web-JISIS Reference Manual

HP Education Services

How To Install Linux Titan

Enterprise Application Integration

Client/server is a network architecture that divides functions into client and server

In this chapter, we lay the foundation for all our further discussions. We start

THE CHALLENGE OF ADMINISTERING WEBSITES OR APPLICATIONS THAT REQUIRE 24/7 ACCESSIBILITY

COM 440 Distributed Systems Project List Summary

Chapter 1 Programming Languages for Web Applications

Web. Services. Web Technologies. Today. Web. Technologies. Internet WWW. Protocols TCP/IP HTTP. Apache. Next Time. Lecture # Apache.

Xtreeme Search Engine Studio Help Xtreeme

Application Servers G Session 2 - Main Theme Page-Based Application Servers. Dr. Jean-Claude Franchitti

Capacity Planning Guide for Adobe LiveCycle Data Services 2.6

QuickSpecs. QuickSpecs. Description. HP OpenVMS Application Modernization and Integration Infrastructure Package,Version 2.3

Tutorial: Building a Web Application with Struts

Functions of NOS Overview of NOS Characteristics Differences Between PC and a NOS Multiuser, Multitasking, and Multiprocessor Systems NOS Server

Pemrograman Web. 1. Pengenalan Web Server. M. Udin Harun Al Rasyid, S.Kom, Ph.D

3-Tier Architecture. 3-Tier Architecture. Prepared By. Channu Kambalyal. Page 1 of 19

LAMP Server A Brief Overview

How To Understand Programming Languages And Programming Languages

An introduction to creating JSF applications in Rational Application Developer Version 8.0

Detailed Table of Contents


PHP. Introduction. Significance. Discussion I. What Is PHP?

Inmagic Content Server Workgroup Configuration Technical Guidelines

A Tool for Evaluation and Optimization of Web Application Performance

INTRODUCTION TO WEB TECHNOLOGY

Language Evaluation Criteria. Evaluation Criteria: Readability. Evaluation Criteria: Writability. ICOM 4036 Programming Languages

REVIEW PAPER ON PERFORMANCE OF RESTFUL WEB SERVICES

Network operating systems typically are used to run computers that act as servers. They provide the capabilities required for network operation.

Applets, RMI, JDBC Exam Review

4.2 Understand Microsoft ASP.NET Web Application Development

Java Application Developer Certificate Program Competencies

Transcription:

Efficiency of Web Based SAX XML Distributed Processing R. Eggen Computer and Information Sciences Department University of North Florida Jacksonville, FL, USA A. Basic Computer and Information Sciences Department University of North Florida Jacksonville, FL, USA Abstract: Given the interest and need for Web processing and programming of distributed systems, researchers are exploring methods and techniques for facilitating programming of such systems. In this paper we explore the performance characteristics of XML processing using SAX for scripting languages PERL and PHP as compared to C (CGI) and Java (Servlets). We use SAX to process XML files via a Web based interface in each of the four environments. Keywords: PERL, PHP, Java Servlets, C CGI, Distributed Parallel Processing 1 Introduction Distributed parallel processing belongs to one of the most important areas of research in our modern computing world. From the Web to large scale numerical algorithms, distributed and parallel methods dominate distributed processing applications.[1] It is important for researchers and software engineers to have a wealth of tools available to them for programming such distributed systems. In this paper, we explore the efficiency of extensible markup language (XML) based parallel communications and processing of common Web languages. We consider pratical extraction report language (PERL), PHP, common gateway interface (CGI) with C and Java Servlets. We compare the speed and efficiency of these vehicles to demonstrate the efficiency of each implementation of Simple API for XML (SAX) to distribute data and determine how quickly each processes data after the distribution. This study is highly relevant for a typical Web based communication and corresponding processing. 2 Fundamentals A distributed system is a collection of independent computers interconnected by a network and capable of cooperating to achieve the solution of a problem. Cooperation is required by any Web access. The computers are considered loosely coupled if they do not share memory and have individual processors. Important programming strides have been made in recent years to provide applications programming environments that make distributed computing more common and easier to implement than in the past. The message passing interface (MPI), parallel virtual machine (PVM), Common Object Request Broker Architecture (CORBA), Java s Remote Method Invocation (RMI), PERL s SOAP interface, and others have extended ordinary languages to provide methods, functions, or procedures facilitating distributed computing. Web based computing is now common. PERL, PHP, CGI and Servlets are typical tools used in this environment. Communication and processing occur when Web sites are

accessed in a variety of ways, some of which are distributed processing applications. In this paper, we study the basis of Web distributed communication over a network by processing XML files that distribute a set of integers to a variety of nodes in a network. The nodes process their data, ultimately returning the processed data to the client. We use PERL, PHP, CGI, and Servlets with SAX to process the XML files. 3 PERL PERL is an open source, cross-platform programming language capable of performing a wide variety of applications. PERL is used in both the public and private sectors and is supported by UNIX, Macintosh, Windows, Linux and many more operating systems. PERL comprises the best features of many languages, including C, awk, sed, sh and BASIC. PERL works with SAX, HTML, and other markup languages to support Unicode. PERL supports both procedural and object oriented paradigms. PERL is extensible. Because of its excellent text processing capabilities, PERL is perhaps the most widely used Web programming language. [2][3] 4 PHP PHP is a widely-used general-purpose scripting language that is especially suited for Web development and can be embedded into HTML. [4] PHP is an HTML-embedded scripting language. The syntax is very similar to that of C, Java and Perl with a couple of unique PHP-specific features thrown in. The goal of the language is to allow Web developers to write dynamically generated pages quickly.[5] PHP is a complete programming language unlike the hypertext markup language. 5 CGI CGI is a standard way for a Web server to pass a Web user's request to an application program and to receive data which is forwarded to the user. When the user requests a Web page (for example, by clicking on a highlighted word or entering a Web site address), the server sends back the requested page. However, when a user fills out a form on a Web page and sends it in, it usually needs to be processed by an application program. The Web server typically passes the form information to a small application program that processes the data and may send back a confirmation message. CGI defines this method of sending data back and forth. It is part of the Web's Hypertext Transfer Protocol (HTTP). [6] 6 SERVLETS JavaServer Pages (JSP) is used to dynamically generate HTML information presented on the World Wide Web via XML. The JSP syntax adds additional XML tags, called JSP actions, to be used to invoke built-in functionality. Additionally, the technology allows for the creation of JSP tag libraries that act as extensions to the standard HTML or XML tags. Tag libraries provide a platform independent way of extending the capabilities of a Web server. The term Web server can mean one of two things:

1. a computer responsible for serving Web pages, mostly HTML documents, via the HTTP protocol to clients, mostly Web browsers; 2. a software program that is working as a daemon serving Web documents. Every Web server (sense 1) is running a Web server program (sense 2). The most common Web or HTTP server programs are: Apache HTTP Server from the Apache Software Foundation Internet Information Server (IIS) from Microsoft Zeus Web Server from Zeus Technology Sun ONE - Sun Microsystems JSPs are compiled into Servlets. The Java Servlet API allows a software developer to add dynamic content to a Web server using the Java platform. The generated content is commonly HTML, but may be other data such as XML. Servlets are the Java counterpart to dynamic Web content technologies such as CGI. However, unlike CGI, (but like PHP), it has the ability to maintain state after many server transactions. This is done using HTTP Cookies, session variables, or URL rewriting. A JSP compiler is a program that parses JavaServer Pages (JSPs), and transforms them into executable Java Servlets. A program of this type is usually embedded into an application server and run automatically the first time a JSP is accessed, but pages may also be precompiled for better performance, or compiled as a part of the build process to test for errors. [7] 7 SAX Each of the technologies were implemented using the Simple Application Programmer Interface for XML (SAX). SAX was the first widely adopted API for XML in Java, and is a de facto standard. The current version is SAX 2.0.1, and there are versions for several programming language environments other than Java including PERL, C and PHP (as used for this paper). [8] We considered each of these technologies, implemented them on a Beowulf cluster and measured their relative performance both in terms of communication alone as well as data processing. The project is Web enabled creating a very friendly interface. 8 Hardware and Operating System The hardware consists of a Beowulf cluster of computers all running RedHat linux v9. The machines are 0.5 ghz machines with 512 megabytes of main memory connected by gigabit fast ethernet. 9 Programming a Distributed System Programming a distributed system, regardless of the programming environment used, involves several fundamentals. Typically, one of three scenarios are used: boss/worker where the boss distributes a portion of the work uniquely, worker crew where each processor does essentially the same

processing on different portions of the data, often referred to as single instruction multiple data (SIMD), and a pipeline where each worker processes a portion of the data, passing that partial solution to the next worker. In this research we use the boss/worker environment as shown in Figure 1. A boss process defines the task to be accomplished and decides the method of distribution. Several worker processes do the actual work. The worker starts by listening to the socket for the arrival of work from the boss. The boss is the controlling process that has the responsibility of dividing the task appropriately. A portion of the task is sent to each of the workers. The boss waits to receive the completed tasks. Each of the workers execute in parallel. They each receive their task from the boss without interaction with other workers. Worker Worker... Worker Boss Figure 1 Communication Dependency 10 Evaluation The purpose of this research is to determine the relative efficiencies of each of the technologies for a Web based distributed application. We chose a simple sorting algorithm. Sorting is well understood and ubiquitous in computing and Web queries. We programmed an n 2 sort so reasonable time would be taken on each of the server machines, creating a measure of both communication and computation times. Boss and workers are programmed using PERL, PHP, CGI and Java, as previously explained. Each application is implemented with exactly the same functionality. The boss machine creates sets of integers whose cardinality vary between 2 and 128,000. Each set was then evenly divided and distributed among the worker machines. The 2 measures just communication time since exactly one integer is sent to each worker node. Each worker sorts their portion of the data, yielding a highly parallel solution. The communication from boss to worker to boss was handled by creating an XML file, processing the file, communicating with the workers, performing the work, and returning the results. As is often the case, the boss did not participate in the primary task, but rather distributed the work and waited for results. Timings included dividing and distributing the data, sorting each data set, receiving the sorted results from each worker, and merging to arrive at a totally sorted set of data. 11 Results Datasets of sizes 2-4-8, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, and 128000, were distributed, sorted, and timed. Since the machines used in this experiment were not dedicated to the researchers, times were taken at 3:00AM when network traffic and users were at a minimum. These machines are lightly used, so unplanned impact on communication is virtually nonexistent. The authors have carefully checked the algorithms, each is correct and implemented in a consistent manner. The

results for the complete processing are in Table 1. The results for processing just the XML file using SAX are shown in Table 2. Table 1 Efficiency Timings TIME IN SECONDS DATA PHP PERL CGI JSP PHP PERL CGI JSP PHP PERL CGI JSP SIZE Total Total Total Total Total Total Total Total Total Total Total Total 2 nodes 4 nodes 8 nodes 2,4,8 0.02 0.06 0.02 0.02 0.05 0.11 0.03 0.04 0.17 0.16 0.07 0.07 500 0.06 0.11 0.20 0.05 0.08 0.13 0.05 0.07 0.23 0.17 0.19 0.10 1000 0.12 0.18 0.06 0.09 0.13 0.19 0.07 0.11 0.15 0.23 0.10 0.13 2000 0.24 0.33 0.14 0.20 0.23 0.32 0.19 0.18 0.26 0.34 0.19 0.22 4000 0.65 0.78 0.16 0.55 0.48 0.62 0.27 0.39 0.46 0.61 0.25 0.36 8000 2.11 2.35 1.13 1.92 1.30 1.53 0.11 1.11 0.96 1.20 0.15 0.77 16000 7.83 8.31 7.01 7.50 4.26 4.67 3.16 3.85 2.61 4.00 1.12 2.20 32000 31.69 33.02 29.10 30.96 15.78 16.59 14.04 14.94 8.48 9.51 8.54 7.70 64000 154.90 157.45 153.16 152.95 63.70 65.61 60.14 62.21 31.47 32.96 28.01 29.87 128000 805.82 795.39 796.42 794.56 307.82 314.07 303.50 305.38 128.03 130.30 121.17 123.80 Table 2 Time for SAX to Process the XML file TIME IN SECONDS DATA PHP PERL CGI JSP SIZE SAX SAX SAX SAX 2 0.00 0.04 0.00 0.00 500 0.03 0.08 0.00 0.02 1000 0.06 0.11 0.00 0.03 2000 0.11 0.19 0.01 0.07 4000 0.22 0.34 0.01 0.13 8000 0.45 0.64 0.02 0.25 16000 0.89 1.23 0.04 0.50 32000 1.78 2.43 0.08 0.99 64000 3.73 4.89 0.17 2.01 128000 7.17 9.61 0.34 4.06 12 Summary and Conclusions Surprisingly, all four technologies processed the data in similar times. Interestingly, each scaled very nicely. When the number of processors doubled the time decreased by nearly half. This implies that processing time dominated, as we would expect. The first row of Table 1 essentially sends one integer to each node, thereby timing only the communication. When two workers were present, two integers were sent, one to each worker. When four workers were present, four integers were sent, again one to each node and similarly for eight workers. When more worker nodes are present, additional time is required since communication is between more computers. PERL and PHP ran slower in this area, which is somewhat surprising. PERL and PHP do not process data quickly as determined by the author s experiments in other research, but normally established communication quickly.

Interestingly, each technology processed large amounts of data at nearly the same rate. This, too, is a surprising result. Table 2 shows the time to process the XML files. Here we see a major discrepancy in timing among the technologies. CGI ran approximately 30 times faster as did JSP until large amounts of data were present. This indicates that for general Web processing where little data manipulation is involved, CGI and JSP are preferable over PERL or PHP. I find these results very interesting and not what I would have guessed. 13 Future Research There are emerging technologies for Web processing including C# and the.net framework. Since Java s JSP and the older technology of CGI performed well, an interesting study would be to compare them to the.net environment. 14 References [1] Liu, M. Distributed Computing. Boston: Addison Wesley, 2004. [2] The Perl Directory. The Perl Foundation, 2005 <http://www.perl. org/>. [3] Wall, L., Christiansen, T., and Orwant, J. Programming Perl. Sebastopol, CA:O Reilly, 2000. [4] The PHP Group. 2005 <http://www.php.net/>. [5] The PHP Manual. PHP. 2005 <http://us3.php.net/manual/en/faq>. [6] Search SQLServer.com. Techtarget. 2005 <http://searchdatabase.techtarget.com/>. [7] TheFreeDictionary.com. Farlex, Inc. 2005 <http://encyclopedia.thefree dictionary.>. [8] AboutSax. SourceForge.net. 2005 <http://www.saxproject.org/>.