Lektion 2: Web als Graph / Web als System Helmar Burkhart Informatik Universität Basel Helmar.Burkhart@... WT-2-1 Lernziele und Inhalt Web als Graph erkennen Grundelemente von sozialen Netzwerken sehen http verstehen httpunit anwenden Graphpaket Python/NetworkX Internet Grundlagen http Protokoll http Live httpunit WT-2-2 1
WWW as a Graph http://www.aharef.info/static/htmlgraph/ Web pages connected by links Social web connected by people http://mlg.ucd.ie/summer WT-2-3 WWW Architecture Quelle: Andrew S. Tannenbaum Computer Networks http://www.youtube.com/watch?v=4vxpazla0zc WT-2-4 2
OSI Reference Model WT-2-5 Iden<fica<on of Informa<on Access via URI (Uniform Resource Iden4fier) URN (Uniform Resource Name) URL (Uniform Resource Locator) h<p://www.w3.org/addressing/ General Format of URL: <url>::= httpaddress ftpaddress <httpaddress>::= http:// hostport [/path] [?search] h<p://www.w3.org/addressing/url/5_bnf.html WT-2-6 3
URI Syntax http://domänenname/verzeichnis+datei https://domänenname/verzeichnis+datei ftp://domänenname/verzeichnis+datei file:///verzeichnis+datei news:domänenname news:name@domänenname mailto:name@domänenname WT-2-7 Domain Name System Host name to IP address transla4on. Distributed data base via name servers. Performance: by caching. Reliability: by redundancy. RFC 1034,1035 (1987) h<p://de.wikipedia.org/wiki/domain_name_system WT-2-8 4
DNS Hierarchy. generic Top Level Domains.com.org.ch.de country code Top Level Domains TLD-Verwaltung durch IANA(www.iana.org/).unibas subdomain.cs "subdomain" fgb alias für Hostrechner WT-2-9 DNS Data Base Consists of Ressource Record with format name, value, type, <l. Type A: address 0 name: Symbolic name host 0 value: IP Address Type CNAME: Canonical name 0 name: alias for host 0 value: name of host Type NS: Name server 0 name: Symbolic name of domain 0 value: Authora:ve same server Type SOA: Start of authority 0 name: Symbolic name of domain 0 value: Administrator WT-2-10 5
Sample DNS Database ; zone 'unibas.ch' last serial 1427 ; from 131.152.1.1 at Fri Dec 14 12:10:52 2001 $ORIGIN ch. unibas IN SOA iser.urz.unibas.ch. zimak1.ubaclu.unibas.ch. ( 1428 7200 3600 604800 86400 ) IN NS iser.urz.unibas.ch. IN NS maser.urz.unibas.ch. $ORIGIN ifi.unibas.ch. eudora IN A 131.152.85.65 pepper IN A 131.152.85.88 molly IN A 131.152.85.83 volley IN A 131.152.85.87 www IN CNAME eudora.ifi.unibas.ch. http://www.kloth.net/services/dig-de.php WT-2-11 Resolu<on of URL http://www.sample.net:8888/web/ex.html DNS-Lookup 156.111.1.1 8888 web/ex.html ARP Port 00:05:f8:22:1c:4a Filesystem WT-2-12 6
Basic Browser Func<ons Client Server Reformat the URL entered as a valid HTTP request. Establish a TCP connec4on using IP address of server; crea4on of a socket. Send request message to web server and wait. Server sends response message to client Server closes connec4on Display document which means rendering for HTML. WT-2-13 Typical Status Messages Resolving host www.example.org Requested IP address from DNS; wai4ng for response. Connec:ng to www.example.org Crea4ng TCP connec4on to server Wai:ng for www.example.org Sent HTTP request; wai4ng for response Transferring data from www.example.org HTTP response has begun; but has not completed. Done HTTP response has been received; further processing may be needed before document will be displayed. WT-2-14 7
HTTP Protocol HTTP = HyperText Transfer Protocol HTTP takes place through TCP/IP sockets (default port 80). HTTP is a stateless protocol. HTTP is used to transmit ressources (files or server side script output). HTTP/1.0 (1990, RFC 1945), /1.1 (1997, RFC 2616) References: hep://www.w3.org/protocols/ hep://www.freeprogrammingresources.com/hep.html HTTP Made Really Easy (James Marshall 1997). WT-2-15 Message Format The format of request and response messages are similar: initial request/response line zero or more header lines a blank line (CRLF) optional message body GET / HTTP/1.0 Host: www.unibas.ch User-Agent: Mozilla/4.0 HTTP/1.1 200 OK Content-Length: 2579 Content-Type: text/html <HTML><HEAD>. WT-2-16 8
Ini<al Request and Response Line A request line has three parts: HTTP_method_name Request-URI (path of ressource) HTTP_Version_identification Uppercase Uppercase A response line also has three parts: HTTP_Version_identification 3-digits response status code Reason phrase 2xx success 4xx client error WT-2-17 Header Lines and Message Body Header lines are typically 1 line per header with the format Header_Name: Value HTTP/1.0 defines 16 headers, none are required. Examples: Host, Accept, User-Agent, From, Server, Last-Modified. Message Body is the requested ressource sent to the client. Typical header lines that describe the body are Content-Type: MIME type such as text/html and image/gif Content-length: Number of bytes WT-2-18 9
HTTP Methods WT-2-19 Message headers WT-2-20 10
HTTP Methods GET: Retrieve informa4on iden4fied by request URI. HEAD: Server must not return a message body (validity check, last modifica4on, etc.). POST: Send data to a server. 0 There is a (large) block of data to be sent 0 Request URI is a program to handle data 0 HTML Forms are usually sent this way. URL encoding: Form data are pairs of name and value stringed together: name1=value1&name2=value2& WT-2-21 HTTP Live via browser: Firefox or IExplorer heps://addons.mozilla.org/en US/firefox/addon/hEpfox/ hep://www.hepwatch.com/ via web viewer 0 hep://www.rexswain.com/hepview.html via standalone Java program 0 hep://www.hepunit.org/ WT-2-22 11
HTTPUnit The center of H<pUnit is the WebConversa4on class, which takes the place of a browser talking to a single site. It is responsible for maintaining session context. To use it, one must create a request and ask the WebConversa4on for a response. WebConversation wc = new WebConversation(); WebRequest req = new GetMethodWebRequest( "http://www.informatik.unibas.ch" ); WebResponse resp = wc.getresponse( req ); http://www.httpunit.org WT-2-23 HTTP 1.1 Extensions Superset of HTTP/1.0 0 from 16 to 46 headers 0 "Condi:onal get" if header such as If-Modified-Since is used. 0 "Par:al get" if header includes a range header field If-Range. Support of mul4ple domains 0 always include Host: Chunked encoding: 0 Long script output can be sent in chunks. Persistent Connec4on 0 Connec:on is not automa:cally closed Cache support 0 Valida:on model WT-2-24 12
WebDAV Web based Distributed Authoring and Versioning. Extension of HTTP/1.1 WebDAV (Distributed Authoring Protocol) PROPFIND, PROPPATCH, LOCK, UNLOCK, MKCOL, COPY, MOVE HTTP (HyperText Transfer Protocol) GET, HEAD, POST, OPTIONS, PUT, DELETE, TRACE Open Standard for an internet based management of files. Applica4ons: Virtual Internet Storage (idisk), Collabora4ve Edi4ng, Versioning. h<p://www.w3.org/jigsaw/doc/user/webdav.html WT-2-25 Internet Security Confidential data (e.g., online banking) require authentication and encryption techniques. Application HTTPS SSL/TLS TCP IP Data Link Physical SSL: Secure Sockets Layer TLS: Transport Layer Security Handshake protocols using certificates. HTTPS: HTTP over TLS/SSL. WT-2-26 13