Goals of Today s Lecture EE 122: The World Wide Web Ion Stoica TAs: Junda Liu, DK Moon, David Zats http://inst.eecs.berkeley.edu/~ee122/ (Materials with thanks to Vern Paxson, Jennifer Rexford, and colleagues at UC Berkeley) Main ingredients of the Web URIs, HTML, HTTP Key properties of HTTP Request-response, stateless, and resource meta-data Performance of HTTP Parallel connections, persistent connections, pipelining Web components, proxies, and servers Caching vs. replication 1 2 The Web History (I) The Web History (II) Vannevar Bush (1890-1974) 1945: Vannevar Bush, Memex: "a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility" Ted Nelson 1967, Ted Nelson, Xanadu: A world-wide publishing network that would allow information to be stored not as separate files but as connected literature Owners of documents would be automatically paid via electronic means for the virtual copying of their documents Coined the term Hypertext (See http://www.iath.virginia.edu/elab/hfl0051.html) Memex 3 4 The Web History (III) Web Components Tim Berners-Lee World Wide Web (WWW): a distributed database of pages linked through Hypertext Transport Protocol (HTTP) First HTTP implementation - 1990 Tim Berners-Lee at CERN HTTP/0.9 1991 Simple GET command for the Web HTTP/1.0 1992 Client/ information, simple caching HTTP/1.1-1996 5 Content Objects Send requests / Receive responses s Receive requests / Send responses Store or generate the responses Proxies Placed between clients and servers Act as a server for the client, and a client to the server Provide extra functions Caching, anonymization, logging, transcoding, filtering access Explicit or transparent ( interception ) 6 1
HTML A Web page has several components Base HTML file Referenced objects (e.g., images) Content: How? HyperText Markup Language (HTML) Representation of hypertext documents in ASCII format Web browsers interpret HTML when rendering a page Several functions: Format text, reference images, embed hyperlinks (HREF) Straight-forward to learn Syntax easy to understand Authoring programs can auto-generate HTML Source almost always available 7 URI Content: How? Uniform Resource Identifier (URI) Uniform Resource Locator (URL) Provides a means to get the resource http://www.ietf.org/rfc/rfc3986.txt! Uniform Resource Name (URN) Names a resource independent of how to get it urn:ietf:rfc:3986 is a standard URN for RFC 3986 8 URL Syntax Content: How? protocol://hostname[:port]/directorypath/resource (e.g., http://inst.eecs.berkeley.edu/~ee122/fa08/index.html)! protocol hostname port directory path resource http, ftp, https, smtp, rtsp, etc. Fully Qualified Domain Name (FQDN), IP address Defaults to protocol s standard port e.g. http: 80/tcp https: 443/tcp Hierarchical, often reflecting file system Identifies the desired resource Can also extend to program executions: http://us.f413.mail.yahoo.com/ym/showletter?box= %40B %40Bulk&MsgId=2604_1744106_29699_1123_1261_0_28917 9 _3552_1289957100&Search=&Nhead=f&YY=31454&order=do wn&sort=date&pos=0&view=a&head=b HTTP Client-: How? HyperText Transfer Protocol (HTTP) Client-server protocol for transferring resources Important properties: Request-response protocol Resource metadata Stateless ASCII format % telnet www.cs.berkeley.edu 80! GET /istoica/ HTTP/1.0! 10 <blank line, i.e., CRLF> HTTP Big Picture Finish display page Client Request image 1 Transfer image 1 Request image 2 Transfer image 2 Request text Transfer text Client-to- Communication HTTP Request Message Request line: method, resource, and protocol version Request headers: provide information or modify request Body: optional data (e.g., to POST data to the server) request line GET /somedir/page.html HTTP/1.1 Host: www.someschool.edu header User-agent: Mozilla/4.0 lines Connection: close Accept-language: fr (blank line) 11 carriage return line feed indicates end of message 12 2
Client-to- Communication HTTP Request Message Request line: method, resource, and protocol version Request headers: provide information or modify request Body: optional data (e.g., to POST data to the server) Request methods include: GET: Return current value of resource, run program, HEAD: Return the meta-data associated with a resource POST: Update resource, provide input to a program, Headers include: Useful info for the server (e.g. desired language) 13 -to-client Communication HTTP Response Message Status line: protocol version, status code, status phrase Response headers: provide information Body: optional data status line (protocol, status code, status phrase) header lines data e.g., requested HTML file HTTP/1.1 200 OK Connection close Date: Thu, 06 Aug 2006 12:00:15 GMT : Apache/1.3.0 (Unix) Last-Modified: Mon, 22 Jun 2006... Content-Length: 6821 Content-Type: text/html (blank line) data data data data data... 14 -to-client Communication HTTP Response Message Status line: protocol version, status code, status phrase Response headers: provide information Body: optional data Response code classes Similar to other ASCII app. protocols like SMTP Code Class Example 1xx Informational 100 Continue! 2xx Success 200 OK! 3xx Redirection 304 Not Modified! 4xx Client error 404 Not Found! Web : Generating a Response Return a file URL matches a file (e.g., /www/index.html) returns file as the response generates appropriate response header Generate response dynamically URL triggers a program on the server runs program and sends output to client Return meta-data with no body 5xx error 503 Service Unavailable! 15 16 HTTP Resource Meta-Data HTTP is Stateless Client-: How? Meta-data Info about a resource A separate entity Examples: Size of a resource Last modification time Type of the content Data format classification e.g., Content-Type: text/html Enables browser to automatically launch an appropriate viewer Borrowed from e-mail s Multipurpose Internet Mail Extensions (MIME) Usage example: Conditional GET Request Client requests object If-modified-since If object hasn t changed, server returns HTTP/1.1 304 Not Modified No body in the server s response, only a header 17 Stateless protocol Each request-response exchange treated independently s not required to retain state This is good Improves scalability on the server-side Don t have to retain info across requests Can handle higher rate of requests Order of requests doesn t matter This is bad Some applications need persistent state Need to uniquely identify user or store temporary info e.g., Shopping cart, user preferences and profiles, usage tracking, 18 3
State in a Stateless Protocol: Cookies Client-side state maintenance Client stores small(?) state on behalf of server Client sends state in future requests to the server Can provide authentication Request Response Set-Cookie: XYZ Request Cookie: XYZ 19 Putting All Together Client : How? Client- Request-Response HTTP Stateless Get state with cookies Content URI/URL HTML Meta-data 20 Web Browser Is the client Generates HTTP requests User types URL, clicks a hyperlink or bookmark, clicks reload or submit Automatically downloads embedded images Submits the requests (fetches content) Via one or more HTTP connections Presents the response Parses HTML and renders the Web page Invokes helper applications (e.g., Acrobat, MediaPlayer) Maintains cache Stores recently-viewed objects and ensures freshness Web Browser History 1990, WorldWideWeb, Tim Berners-Lee, NeXT computer 1993, Mosaic, Marc Andreessen and Eric Bina 1994, Netscape 1995, Internet Explorer. 21 22 Web Serve Handle client request: 1. Accept a TCP connection 2. Read and parse the HTTP request message 3. Translate the URI to a resource 4. Determine whether the request is authorized 5. Generate and transmit the response Web site vs. Web server Web site: one or more Web pages and objects united to provide the user an experience of a coherent collection Web server: program that satisfies client requests for Web resources 23 5 Minute Break Questions Before We Proceed? 24 4
HTTP Performance Fetch HTTP Items: Stop & Wait Most Web pages have multiple objects ( items ) e.g., HTML file and a bunch of embedded images How do you retrieve those objects? One item at a time Client Start fetching page Request item 1 Transfer item 1 Request item 2 Transfer item 2 Time What transport behavior does this remind you of? Finish; display page Request item 3 Transfer item 3 2 RTTs per object 25 26 Concurrent Requests & Responses Pipelined Requests & Responses Use multiple connections in parallel Does not necessarily maintain order of responses Client = Why? = Why? Network = Why? R1 T1 Is this fair? N parallel connections use bandwidth N times more aggressively than just one What s a reasonable/fair limit as traffic competes with that of other users? R2 T2 R3 T3 27 Batch requests and responses Reduce connection overhead Multiple requests sent in a single batch Small items (common) can also share segments Maintains order of responses Item 1 always arrives before item 2 How is this different from concurrent requests/responses? Client Request 1 Request 2 Request 3 Transfer 1 Transfer 2 Transfer 3 28 Persistent Connections Enables multiple transfers per connection Maintain TCP connection across multiple requests Including transfers subsequent to current page Client or server can tear down connection Performance advantages: Avoid overhead of connection set-up and tear-down Allow TCP to learn more accurate RTT estimate Allow TCP congestion window to increase i.e., leverage previously discovered bandwidth Default in HTTP/1.1 Can use this to batch requests on a single connection Improving HTTP Performance Many clients transfer same information Generates redundant server and network load experience unnecessary latency Example: 5 objects, RTT=50ms 29 30 5
Caching Caching on the Client How? Modifier to GET requests: If-modified-since returns not modified if resource not modified since specified time Response header: Expires how long it s safe to cache the resource No-cache ignore all caches; always get resource directly from server 31 Example: Conditional GET Request Return resource only if it has changed at the server Save server resources! Request from client to server: GET /~ee122/fa08/ HTTP/1.1! Host: inst.eecs.berkeley.edu! User-Agent: Mozilla/4.03! If-Modified-Since: Sun, 27 Aug 2006 22:25:50 GMT! <CRLF>! How? Client specifies if-modified-since time in request compares this against last modified time of desired resource returns 304 Not Modified if resource has not changed. or a 200 OK with the latest version otherwise 32 Caching with Reverse Proxies Cache documents close to server decrease server load Typically done by content providers Only works for static content Caching with Forward Proxies Cache documents close to clients reduce network traffic and decrease latency Typically done by ISPs or corporate LANs Reverse proxies Reverse proxies Forward proxies 33 34 Caching w/ Content Distribution Networks Caching with CDNs (cont.) Integrate forward and reverse caching functionality One overlay network (usually) administered by one entity e.g., Akamai Provide document caching Pull: Direct result of clients requests Push: Expectation of high access rate Also do some processing Handle dynamic web pages Forward proxies CDN Transcoding 35 36 6
Example: Akamai Example: Akamai Akamai creates new domain names for each client content provider. e.g., a128.g.akamai.net The CDN s DNS servers are authoritative for the new domains The client content provider modifies its content so that embedded URLs reference the new domains. DNS server for www.cnn.com www.cnn.com Akamaizes its content. get http://www.cnn.com lookup a128.g.akamai.net local DNS server akamai.net DNS servers Akamaize content, e.g.: http://www.cnn.com/imageof-the-day.gif becomes http://a128.g.akamai.net/image- Akamaized response object has inline URLs for secondary content at a128.g.akamai.net of-the-day.gif and other Akamai-managed DNS names. 37 38 a Akamai servers store/ cache secondary content for Akamaized services. b c Caching vs. Replication Why move content closer to users? Reduce latency for the user Reduce load on the network and the server How? Caching Replicate content on demand after a request Store the response message locally for future use Challenges: May need to verify if the response has changed and some responses are not cacheable Replication Planned replication of content in multiple locations Update of resources handled outside of HTTP Can replicate scripts that create dynamic responses 39 Conclusions Key ideas underlying the Web Uniform Resource Identifier (URI), Locator (URL) HyperText Markup Language (HTML) HyperText Transfer Protocol (HTTP) Browser helper applications based on content type Performance implications Concurrent connections, pipelining, persistent conns. Main Web components, servers, proxies, CDNs Next lecture: drilling down to the link layer K & R 5.1-5.3, 5.4.1 40 7