You are on page 1of 112

CSE 135

Server Side Web Languages

Lecture # 2

HTTP Overview

CSE 135

Server Side Web Languages

Lecture # 2

Its all about the network


If you want to really do Web programming right you will need to know the ins and outs of HTTP
If the network has problems you/users have problems much more than you are probably aware

Sadly most dont know as much as they think they do


easily demoed by perf and security problems A few tests
URLs case sensitive? Length? GET vs POST? Cookies Layer 8 Error Correction the meat layer

CSE 135

Server Side Web Languages

Lecture # 2

HTTP Intro
HTTP (Hyper Text Transfer Protocol)
Its an application layer protocols similar to SMTP, POP, IMAP, NNTP, FTP, etc. Simple protocol that defines the standard way that clients request data from Web servers and how these server respond Typically it is running on top of TCP/IP

Three versions have been used (0.9,1.0,1.1) and two are still commonly used
RFC 1945 HTTP 1.0 (1996) RFC 2616 HTTP 1.1 (1999)

CSE 135

Server Side Web Languages

Lecture # 2

HTTP and TCP/IP


HTTP sits atop the TCP/IP Protocol Stack
Application Layer

HTTP TCP IP Network Interfaces

Transport Layer

Network Layer Data Link Layer

CSE 135

Server Side Web Languages

Lecture # 2

HTTP and TCP/IP, contd.


IP provides packets that are routed based on source and destination IP addresses TCP provides segments that ride inside the IP packets and add connection information based on source and destination ports
The ports let TCP carry multiple protocols that connect services running on default ports
HTTP on port 80 HTTP with SSL (HTTPS) on port 443 FTP on port 21 SMTP on port 25 SSH on port 22

CSE 135

Server Side Web Languages

Lecture # 2

HTTP and TCP/IP, contd.


TCP also provides mechanisms to make the connection a reliable bit pipe
3-way handshake, sequence numbers, checksums, control flags A data stream is chopped up into chunks that are reassembled, complete and in correct order on the other endpoint of the connection

TCP segments, riding inside IP packets, carry the chunks of data


When HTTP is the Application Layer protocol on top of the stack, these chunks of data are the contents of the HTTP Message

CSE 135

Server Side Web Languages

Lecture # 2

HTTP over TCP/IP Examples

GET /index.html HTTP/1.1<CRLF> Host: www.hostname.com Con

HTTP Messages data stream is chopped up into chunks small enough to fit in a TCP segment

The chunks ride inside TCP segments used to reassemble them correctly on the other end of the connection

The segments are shipped to the right destination inside IP datagrams

CSE 135

Server Side Web Languages

Lecture # 2

HTTP over TCP/IP Issues?


HTTP/1.0 opens and closes a new TCP connection for each operation. Since most Web objects are small, this practice means a high fraction of packets are simply TCP Add the previous point to TCPs slow start congestion control mechanism and you find HTTP/1.0 operations use TCP at its least efficient. HTTP 1.1 addresses these concerns with persistent connections using Keep-Alive See http://www.w3.org/Protocols/HTTP/Performance for papers and information on HTTP performance issues

CSE 135

Server Side Web Languages

Lecture # 2

Basic HTTP Request/Response Cycle

HTTP Request

HTTP Response HTTP Client Asks for resource by its URL: http://www.foo.com/page.html

Resource /page.html HTTP Server www.foo.com

CSE 135

Server Side Web Languages

Lecture # 2

HTTP Request/Response Chain Example 2


LAN HTTP Client DMZ Internet HTTP Server

Proxy

Transparent Proxies

Reverse Proxy Network at Hosting Provider

Local DNS

External Root DNS DNS Servers Servers

CSE 135

Server Side Web Languages

Lecture # 2

Types and Uses of Proxy Servers


Proxies are HTTP Intermediaries All act as both clients and servers Major types of proxies can be distinguished by where they live and how they get traffic
Explicit Transparent/Intercepting Reverse/Surrogate

Three primary uses for proxies


1. Security 2. Performance 3. Content Filtering

CSE 135

Server Side Web Languages

Lecture # 2

HTTP Requests
HTTP requests and responses are both types of Internet Messages (RFC 822) , and share a general format:
A Start Line, followed by a CRLF
Request Line for requests Status Line for responses

Zero or more Message Headers


field-name : [field-value] CRLF

An empty line
Two CRLFs mark the end of the Headers

An optional Message Body if there is a payload


All or part of the Entity Body or Entity

CSE 135

Server Side Web Languages

Lecture # 2

Making a simple HTTP request


You can do the last example with a tool or just use telnet to access the default HTTP port (80)
C:\>telnet www.google.com 80

Ask for a resource using a minimal request syntax:


GET / HTTP/1.1 <CRLF> Host: www.google.com <CRLF><CRLF>

Note: A Host header is required for HTTP 1.1 connections, though not for HTTP 1.0

CSE 135

Server Side Web Languages

Lecture # 2

HTTP Request Example #1

CSE 135

Server Side Web Languages

Lecture # 2

HTTP Request Example #2

CSE 135

Server Side Web Languages

Lecture # 2

A Closer Look at the Request Line


Consists of three major parts
The Request Method followed by a space
Methods in HTTP 1.1 include: GET, POST, HEAD, TRACE, OPTIONS, PUT, DELETE and CONNECT GET, POST, and HEAD are the most common Extension methods such as those specified by WebDav (RFC 2518)

The Request URI followed by a space


The URL associated with the resource to be fetched or acted upon

The HTTP Version followed by the CRLF


0.9, 1.0, 1.1

CSE 135

Server Side Web Languages

Lecture # 2

A Closer Look at the Request Methods


GET
By far most common method Retrieves a resource from the server Supports passing of query string arguments

HEAD
Retrieves only the Headers associated with a resource but not the entity itself Highly useful for protocol analysis, diagnostics

POST
Allows passing of data in entity rather than URL Can transmit of far larger arguments that GET Arguments not displayed on the URL

CSE 135

Server Side Web Languages

Lecture # 2

More Request Methods


OPTIONS
Shows methods available for use on the resource (if given a path) or the host (if given a *)

TRACE
Diagnostic method for assessing the impact of proxies along the request-response chain

PUT, DELETE
Used in HTTP publishing (e.g., WebDav)

CONNECT
A common extension method for Tunneling other protocols through HTTP

Theres even more methods if you look at WebDav

CSE 135

Server Side Web Languages

Lecture # 2

Why do I care?
Well if you are doing doing Web programming you may have to form raw requests with headers ourselves.

Example in JavaScript using Ajax you will have to form raw HTTP requests using GET and POST (or even HEAD if you like) to transmit your data

xmlhttp = ajaxhttp(); xmlhttp.open("POST", url, true); xmlhttp.setRequestHeader("Content-Type", "application/xwww-form-urlencoded"); xmlhttp.send("ret=" + escape(param));

Also in HTML forms when you set the action attribute <form action=GET|POST > you are specifying the HTTP method to transmit the data

CSE 135

Server Side Web Languages

Lecture # 2

A Closer Look at the Request URI


Absolute URI vs. Absolute Path
Explicit Proxies Require Absolute URIs
Client is connected directly to the proxy Protocol and host name needed to resolve request You might need full URLs too esp. for Web services Watch out for www vs. no www issues

Grammar of the Absolute Path


Like Absolute URI minus the http://hostname Initial / equivalent of the hosts document root In HTTP 1.1 with name-based virtual hosting Host header directs request to appropriate document root

CSE 135

Server Side Web Languages

Lecture # 2

The HTTP Response


Response status line consists of three major parts
The HTTP Version followed by a space Status Code followed by a SP
5 groups of 3 digit integers indicating the result of the attempt to satisfy the request 1xx are informational 2xx are success codes 3xx are for alternate resource locations (redirects) 4xx indicate client side errors 5xx indicate server side errors

The Reason Phrase followed by the CRLF

CSE 135

Server Side Web Languages

Lecture # 2

Observation One Way Requests and 204s There are many details to HTTP that people dont consider but are highly useful one example is 204 responses which send back no data Observe Google using this in its search results page to send what I dub a flare gun request to see what exactly the user clicked on
The purpose of this is for improving search quality and defeating those folks who reverse engineer the Google algortithm The human filter if you like

CSE 135

Server Side Web Languages

Lecture # 2

HTTP Headers
Headers come in four major types, some for requests, some for responses, some for both:
General Headers
Provide info about messages of both kinds

Request Headers
Provide request-specific info

Response Headers
Provide response-specific info

Entity Headers
Provide info about request and response entities

Extension headers are also possible

CSE 135

Server Side Web Languages

Lecture # 2

A Closer Look at General Headers


Connection lets clients and servers manage connection state
Connection: Keep-Alive (HTTP 1.0) Connection: close (HTTP 1.1)

Date when the message was created


Date: Sat, 31-May-03 15:00:00 GMT

Via shows proxies that handled message


Via: 1.1 www.myproxy.com (Squid/1.4)

Cache-Control Among the most complex of headers, enables caching directives


Cache-Control: no-cache

CSE 135

Server Side Web Languages

Lecture # 2

Why do I care? Unfriendly Caches


If you are issuing a GET request and you do it again the browser will not bother to talk to the server (depending on browser settings including defaults), but instead will pull the data from cache. This can cause lots of screwups
Example someone looking at stale content Example Problems with Ajax style apps never waking up because browser using cached data

Obvious solution to stale caches is to add cache control headers (or to change resource names) but then again that does defeat the value
Better to know about caching and do it properly Consider typical Web pages what would you want to cache?

CSE 135

Server Side Web Languages

Lecture # 2

A Closer Look at Request Headers


Host The hostname (and optionally port) of server to which request is being sent
Required for name-based virtual hosting
Host: www.port80software.com

Referer The URL of the resource from which the current request URI came
Misspelled in the specification
Referer: http://www.host.com/login.asp

User-Agent Name of the requesting application, used in browser sensing


User-Agent: Mozilla/4.0 (Compatible; MSIE 6.0)

CSE 135

Server Side Web Languages

Lecture # 2

Some More Request Headers


Accept and its variants Inform servers of clients capabilities and preferences
Enables content negotiation Accept: image/gif, image/jpeg;q=0.5 Accept- variants for Language, Encoding, Charset

If-Modified-Since and other conditionals


Frequently used by browsers to manage caches If-Modified-Since: Sat, 31-May-03 15:00:00 GMT

Cookie How clients pass cookies back to the servers that set them
Cookie: id=23432;level=3

CSE 135

Server Side Web Languages

Lecture # 2

Using Request Headers: Browser Sniffing User-agent is often used in browser detection to serve different type of page to different type of accessing agent
Similarity problem
Everything looks like old Mozilla

Spoofing or removing problem

Better approach is to take this and add in an injected script or program that profiles the device. In the long run as device diversity grows the concept of browser will evolve significantly

CSE 135

Server Side Web Languages

Lecture # 2

Using Request Headers: Anti-Leeching


Often times people may leech your bandwidth with direct hotlinking to your object (GIF, Flash, etc.) without fetching the other related objects
This certainly would be bad if your biz model was about people seeing the related ads around the stolen object

Since the referer header is sent from the base page a simple form of anti-leeching is to check for it before sending a dependent object Of course the bad guy now moves to forge the header Class Question: can you think of other countermeasures?

CSE 135

Server Side Web Languages

Lecture # 2

Using Request Headers: Content Negotiation User-agent sends accept header indicating type of content it can handle

CSE 135

Server Side Web Languages

Lecture # 2

Using Request Headers: Content Negotiation A q-rating can indicate the preference the user agent has for the data requested Content negotiation allows us to ask for something like logo and then get the appropriate image (PNG, JPG, etc.) based upon what the device can handle.
This leads to extensionless URLs which aids in long term maintainability Well see the file extensions dont mean much really

Content negotiation can also allow language to be automatically negotiated.

CSE 135

Server Side Web Languages

Lecture # 2

Using Request Headers: HTTP Compression


Compressed HTTP is enabled via Accept headers User agent sends header indicating compression acceptance (gzip or deflate).Server using mod_gzip, httpZip, etc. sends compressed content or not. Works only on text (HTML, CSS, JS) but with compression up to 70% or more Time to first byte increased so high speed connections may not see as much benefit, though bandwidth is saved. Low speed clearly sees benefit.

CSE 135

Server Side Web Languages

Lecture # 2

HTTP Compression Example

CSE 135

Server Side Web Languages

Lecture # 2

Compression Considerations Increased origin server CPU or wasted cycles? TTFB vs TTLB consideration and LANs Decompress times Nasty little bugs

http://support.microsoft.com/default.aspx?scid=kb;en-us; 823386&Product=ie600
In Internet Explorer, The bytes that remain to be decoded in the buffer may be small (8 bytes or less) and the data contained in the buffer decompresses to 0 bytes. When Mshtml receives 0 bytes, it thinks that all the data is read and closes the data stream. As a result, the HTML page sometimes appears truncated. Typically, if it is for a referenced file such as a .js or a .css file type, the HTTP connection stops responding.

Most solved by commercial server-side implementations

People in the know use this stuff

CSE 135

Server Side Web Languages

Lecture # 2

Response Headers
Server The servers name and version
Server: Microsoft-IIS/5.0 Can be problematic for security reasons Security by obscurity?

Vary Tells client & proxy caches which headers were used for content negotiation
Vary: User-Agent, Accept

Set-Cookie This is how a server sets a cookie on a client


Set-Cookie: id=234; path=/shop; expires=Sat, 31May-03 15:00:00 GMT; secure

CSE 135

Server Side Web Languages

Lecture # 2

A Closer Look at Entity Headers


Allow Lists the request methods that can be used on the entity
Allow: GET, HEAD, POST

Location Gives the alternate or new location of the entity


Used with 3xx response codes (redirects)
Location: http://www.ibm.com/us/

Content-Encoding specifies encoding performed on the body of the response


Corresponds to Accept-Encoding request header
Content-Encoding: gzip

CSE 135

Server Side Web Languages

Lecture # 2

More Entity Headers


Content-Length The size of the entity body in bytes
Value shrinks when compression is applied
Content-Length: 24000

Content-Location The actual URL of the resource if different than its request URL
Often used to show the index or default page
Content-Location: http://www.foo.com/home.html

CSE 135

Server Side Web Languages

Lecture # 2

More Entity Headers


Content-Type specifies Media (MIME) type of the entity body
Corresponds to Accept header
Content-Type: image/png

This is the most important header to the browser. The data in this header tells the browser what it is receiving. Now it should make sense why file extensions dont really matter and are arbitrary.
Server: file extension -> Mime type Browser: Mime type -> Action (display, download, etc.)

Note: Without HTTP browser relies on file extension example loading a file off local disk.

CSE 135

Server Side Web Languages

Lecture # 2

Why do I care? Because sometimes you need to stamp outgoing data on the server-side with the appropriate MIME type

CSE 135

Server Side Web Languages

Lecture # 2

More Entity Headers : Caching Related


Expires Gives expiration for the instance of the resource for use in caching
Expires: Sat, 31-May-03 19:00:00 GMT

Last-Modified Date/time the entity was last changed (or created)


Last-Modified: Fri 30-May-03 09:00:00 GMT

CSE 135

Server Side Web Languages

Lecture # 2

More Entity Headers : Caching Related


Etag Uniquely identifies a particular instance of a given resource
Used with conditional request headers to validate cached instances of the resource
If-Match, If-None-Match

Etag: adkskdashjgk07563AF

CSE 135

Server Side Web Languages

Lecture # 2

Why do I care? Well you could go beyond basic cache-control and pragma headers and do Expires and other forms of cache hints. Ultimately you may be forced to use a query string or alternate file name to force misconfigured caches to stop causing you problems

CSE 135

Server Side Web Languages

Lecture # 2

Sending data via HTTP Data can be sent to a server-application in two primary ways:
1. Query String sent via a GET request 2. Data body sent via a POST request

In both cases the data is encoded in a special manner called x-www-form-urlencoded which replaces spaces with + symbols, special characters with %hex values equivalent to the particular special character being escaped and separates individual arguments to be passed with ampersands (&) characters.
Note: Data may be sent via HTTP headers mostly in the form of cookie based data. Though other HTTP headers such as user-agent, referrer, etc. can be tapped, but this is generally not user supplied but instead constitutes the environment in which the Web transaction takes place.

CSE 135

Server Side Web Languages

Lecture # 2

Sending Data with GET


In the case of GET we see submitted data (often from a fill-in form though may be hard coded in links) with the actual request URL The passed data is called the query string and follows a ? Character in the URL
Example: http://www.pint.com/cgi-bin/doit.pl?name=Thomas Example: http://www.pint.com/cgi-bin/toit.pl?x=5&y=7

These dirty URLs have potential downsides including:


Technology exposure visual reconnaissance Easy fiddling of parameters Poor usability (may be a good thing as well) Lack of long term maintainability Size limit is dependent on URL size limits

However, GET string based URLs are portable you can bookmark them, send to friends, etc.

CSE 135

Server Side Web Languages

Lecture # 2

Sending Data with GET Contd.


The GET based data can be submitted in one of two ways
Hard-coded into a link <a href=http://www.google.com/search?q=Web+server +software>Run query</a> As a result of a form submission <form action=http://www.google.com/search method=get> <label>Query: <input type=text name=q /></label> <input type=submit value=Submit /> </form>

CSE 135

Server Side Web Languages

Lecture # 2

Sending Data with GET Contd. Now it should start to make sense what query strings mean and how they are formed

CSE 135

Server Side Web Languages

Lecture # 2

Sending Data with GET Contd. Behind the scenes you see that indeed the data is transmitted in the request method itself

CSE 135

Server Side Web Languages

Lecture # 2

Sending Data with Post


In the case of POST you always generate the request either programmatically or more likely with a form
<form action=http://www.fakesite.com/cgi-bin/submitquery.pl method=post> Query: <input type=text name=query /> <input type=submit value=Submit /> </form>

The POST request sends the data in the message body but does so in x-www-form-urlencoded as well so we might have a message body like Name=Al+Smith&Age=30&Sex=male No size limit, but issues with browsers have to address lack of redos Repost form data?

CSE 135

Server Side Web Languages

Lecture # 2

Sending Data with Post Contd. The network trace shows the difference between POST and GET

CSE 135

Server Side Web Languages

Lecture # 2

Why do I care? GET and POST have different uses GET used when request is idempotent - meaning multiple requests return same result. POST should be used when you change the state of the server Lots of folks will often use GET for state changes because of ease of coding
Downsides inadvertent state changes by spiders, browsers, etc.

CSE 135

Server Side Web Languages

Lecture # 2

HTTP Considerations HTTP is a stateless protocol

No memory from one request to the next

Question: How can you keep track of information from one page to the next? Answer:
Hidden Form fields that are posted backed to the server
E.g. Microsofts VIEWSTATE value in .NET

Data posted in dirty URL strings Cookies


Two types memory or session cookies and persistent or disk cookies

Many programming environments go to significant ends to make provide for easy state management more on this later!

CSE 135

Server Side Web Languages

Lecture # 2

Web Servers Overview

CSE 135

Server Side Web Languages

Lecture # 2

The Programmer IT Divide


The Obvious Why Division of Labor
IT: Dont touch those knobs! Put them in a sand box so they dont hurt things, etc. Dev: I wanna try new framework/software X. I am too busy programming to read logs / install patches

Mind the gap!


Security Performance Huge waste of time and money because we dont know enough to meaningfully interact

CSE 135

Server Side Web Languages

Lecture # 2

Hmmsomething smells fishy

versus

http://www.channelregister.co.uk/2010/07/29/cray_1_replica/
An irony is that the resulting scale model Cray-1, , is probably more powerful than Cray's original near 40-year-old design.

Today you need scads of PCs to serve simple Web apps?

CSE 135

Server Side Web Languages

Lecture # 2

The Role of a Web Server


A box and a service Web servers serve various resources
As file (document) servers As application front ends

A physical server can of course serve many protocols (SMTP, FTP, etc.) or may be protocol specific
Web Servers are of course HTTP servers

CSE 135

Server Side Web Languages

Lecture # 2

Planning Web Server Deployments


Major issues to consider when planning a Web server or Web site deployment
What is the appropriate form of Web hosting? What type of server software will be used? What are the sizing requirements? How will DNS be handled?

There are no fixed answers to any of these questions Planning should be guided by the goals of the deployment and should harmonize with the related business processes

CSE 135

Server Side Web Languages

Lecture # 2

Choosing Among the Hosting Options


Host your own
Pro: Complete control over the physical box Con: Expensive and difficult to maintain well

Hosting provider schemes


Dedicated Server
Pro: Control without the hardware purchase Con: Must manage the box remotely

Co-located Server
Pro: Admin control of entire box Con: Must purchase box and manage remotely

Virtual Hosting
Pro: Cheapest and easiest to maintain solution Con: Server is shared, admin access limited

CSE 135

Server Side Web Languages

Lecture # 2

Choosing Server Software


Apache
Best reputation historically Fun with usage stats for public sites (ex. Netcraft) Features rapidly extended & refined via modular and open development model Strong administrator ethos = well managed boxes

IIS
Included in Windows server environment Security black-eye (or is it from the OS?) Favored in business and intranets IIS 6 solid, IIS 7 is VERY Apache like

Beware of sectarian quarrels about Web servers

CSE 135

Server Side Web Languages

Lecture # 2

Choosing Server Software


There actually is more than just Apache / IIS
One notable is Zeus (www.zeus.com)
High-performance Web server used by some really large sites Tries to provide APIs from both main worlds

Many app servers (Tomcat, Zope, etc.) include Web servers (or Apache) as part of their distribution

See www.serverwatch.com for a list of some Web and app servers

CSE 135

Server Side Web Languages

Lecture # 2

Choosing Server Software, cont.


In real world, usually a conditioned choice if not a forgone conclusion
Biggest single factors are type of deployment and prior commitment to an underlying OS Apache on UNIX and Linux predominates in universities, research institutes and for virtual hosting setups has majority of hosted domains Netscape/iPlanet used to have large enterprise market almost to itself now it is nearly gone IIS started with smaller companies, often as part of LAN server, but has now taken over Sun/Netscapes leading role in the enterprise

CSE 135

Server Side Web Languages

Lecture # 2

Sizing a Web Server


Sizing is process of determining the physical resources required to meet anticipated demand Processing power and memory are not typically a problem for the Web server
Basic HTTP server job of fetching files is not processor intensive Resource constraints on the box probably an effect of other server-side mechanisms
Automated session management by app servers Manipulation of large database queries Lots of non-optimized code in Web applications Network concerns concerns! (next slide)

CSE 135

Server Side Web Languages

Lecture # 2

Sizing a Web Server Network Bottlenecks


Network bottlenecks
Available bandwidth should accommodate max HTTP operations (hits) under peak load Could you figure out given an average file size a peak load for a T1 line?
So then would file size be an important consideration for Web design in a high traffic site?

Bandwidth sizing should be adjusted based on your actual request frequency and size
Assume peaks at triple or more the average loads

Also watch out for collisions and overloading of routers, switches, hubs and NICs on the network

CSE 135

Server Side Web Languages

Lecture # 2

Common Web Server Tasks


Set-up / install Web server
Done here but usually set name, IP, root directories Define protected directories with basic authentication, etc. Configure error pages for 404 errors, 403 errors, etc. Turn off directory browsing maybe? Small security changes (remove or modify server header) Set-up aliases and other redirects Tune for performance Monitor for security, performance and usage
Logs

Support Web applications


Installation of frameworks, files, etc.

CSE 135

Server Side Web Languages

Lecture # 2

Server Example: IIS - Main Settings Dialog

CSE 135

Server Side Web Languages

Lecture # 2

Server Example: IIS - Directory Configuration

CSE 135

Server Side Web Languages

Lecture # 2

Server Example: IIS - Default Documents

CSE 135

Server Side Web Languages

Lecture # 2

Server Example: IIS - Error Messages

CSE 135

Server Side Web Languages

Lecture # 2

Server Example: IIS - Access Control

CSE 135

Server Side Web Languages

Lecture # 2

Server Example: IIS Log Files

CSE 135

Server Side Web Languages

Lecture # 2

Server Example: IIS Headers and Misc.

CSE 135

Server Side Web Languages

Lecture # 2

Server Example: IIS MIME Types


So you see here that File extensions map to MIME types which is really what the user agent cares about when deciding what to do with an HTTP response File extensions are often the key to how a server-side scripting engine works

CSE 135

Server Side Web Languages

Lecture # 2

Server Example IIS App Filter List

CSE 135

Organizing a Server - Virtual and Physical Site Structure


Think of a site as having not one structure but two virtual and physical
Virtual structure is described by the URLs used to request resources from the site

Server Side Web Languages

Lecture # 2

This is the public view of the site the site as visitors will see it when they browse to it

Physical structure is the organization of the files and directories in the file system on the host machines hard disk
This is the private view of the site seen only by you and those users you choose to give access

It will become obvious why this distinction is necessary to keep things straight

CSE 135

Server Side Web Languages

Lecture # 2

Configuring Virtual-Physical Mappings


The Document Root
A directory in the file system of the host machine where the Web server looks for the files that constitute the Web site
Also called the root directory

Often given an index or default document that serves as the homepage of the site. Corresponds to the / at the end of hostname portion of the URL:
http://www.foo.com/index.html (virtual) Ex: /var/www/index.html (physical) Ex: C:\inetpub\wwwroot\index.html (physical)

CSE 135

Server Side Web Languages

Lecture # 2

Configuring Virtual-Physical Mappings


Notice how the hostname portion of the URL maps to the same place pointed to by the physical path that lies to the left of the the / representing the document root
The URL is virtual to the left of the document root, but it seems to be physical to the right of the document root

In fact, a URL is purely virtual there is no guarantee that the path to the right of the document root looks this way on disk
Could http://www.foo.com/index2.html map to C:/foo/a/b/c/ myfile.html? Sure you can do this with aliases, redirects, local OS mappings, all sorts of stuff

CSE 135

Server Side Web Languages

Lecture # 2

Configuring Virtual-Physical Mappings


A virtual directory or alias in the URL path preempts the lookup in the document root This extends the virtual structure to the right of (or below) the root / in the URL path
http://www.foo.com/virtual/index2.html /htdocs/physical/index2.html You can (and should) take advantage of this virtual/physical distinction to:
Preserve the sites URL scheme even if the physical structure has to change Avoids broken links due to site expansion/revision Manage directory and file locations in ways that minimize security risks and facilitate backup procedures Allow developers to keep relative URLs in source code simple

CSE 135

Server Side Web Languages

Lecture # 2

Virtual Hosting
We know the hostname part of the URL is a virtual locator for files that live (physically) in a sites document root The idea of virtual hosting takes this a step further by allowing a single server to host many domains, each with its own document root Two methods of virtual hosting
Old way: multiple IP addresses per server New way: name-based using host headers

CSE 135

Server Side Web Languages

Lecture # 2

Managing Users and Hosts


Users (developers) will need remote access allowing them to transfer files to and from the sites physical structure FTP (and other file transfer mechanisms) allow the administrator to restrict this access
to sub-sections of the site by user account or client IP

These restrictions should be backed up by access control lists on the directories that enforce the principle of least access

CSE 135

Server Side Web Languages

Lecture # 2

Managing Users and Hosts


Similar rules apply to managing access to the Web site itself by visitors
ACLs in the Web sites physical file structure should be set to the minimum required by the Web server to serve the resources on the site
This gets tricky with server side programming

If the Web site (or part of it) does not need to be available for anonymous access from everywhere then users, groups, hosts and IPs should be restricted HTTP Authentication can also be employed to require make all or part of a site private and require login

CSE 135

Server Side Web Languages

Lecture # 2

Managing Users and Hosts


Although HTTP authentication now offers safeguards like checksums and password encryption, it is not very secure
Lack of end-to-end encryption of the entire message transmission makes hijacking, scanning and spoofing easy

If all or part of the site requires authentication and serious security for users login credentials, form based authentication over SSL is the only choice

CSE 135

Server Side Web Languages

Lecture # 2

Basic SSL Configuration


Initiate an application for a certificate from a recognized Certificate Authority (CA)
The site (domain) owner will have to prove they are who they say they are

Create a Certificate Signing Request (CSR)


Contains the sites Public Key and matches up with a Private Key that is created simultaneously and stored on the server

Submit the request to the CA and pay up Retrieve the certificate and install it Test the certificate with an HTTPS request

CSE 135

Server Side Web Languages

Lecture # 2

Supporting Web Applications


Comparing static and dynamic sites Static site demands
Few performance demands on Web server
Serving files is light work Caching is easy to do State management probably not an issue

Few security risks


Tight permissions possible No interaction with other executables or processes

Developer support relatively simple


Basic access control and monitoring

CSE 135

Server Side Web Languages

Lecture # 2

Supporting Web Applications, cont.


Demands introduced by dynamic page generation on server side
Significantly heavier performance demands
Code execution Database access Caching more difficult to do Complex state management schemes

Security risks go way up


Higher level permissions required Buffer overflows, code injection, hijacking

Significantly more complex developer support


Install, maintain application environments Potentially help debug the actual applications

CSE 135

Server Side Web Languages

Lecture # 2

A Digression on Web Server Internals


Server-side processing makes a simple model significantly more complex Basic internal request/response cycle
Read request Do authentication if any Process other headers Map URL to physical path Read file or retrieve cached response Send response Log Cleanup

CSE 135

Server Side Web Languages

Lecture # 2

Web Server Internals, cont.


Server programming adds a new dimension
Read request, set up internal data structures Do authentication if any Process other headers Map URL to script or program
Script or program diverts request handling into new code paths Server must wait for result of processing before it finds out what it is supposed to send back

Send response Log Cleanup

CSE 135

Server Side Web Languages

Lecture # 2

3 Server-Side Programming Models


What happens when the request gets diverted from the servers own internals?
Classic CGI model fork and exec
Web server creates new child process, passing it request data as environment variables CGI script issues response using standard I/O stream mechanisms

Server API model


Web server runs additional request handling code inside its own process space

Web application frameworks


Web server calls API application, which may manage request within its own pool of resources and using its native objects

CSE 135

Server Side Web Languages

Lecture # 2

3 Server-Side Programming Models

Classic CGI fork and exec Server API running inside Web servers address space

Web application framework running inside Web server process but managing its own pool of resources via IPC

CSE 135

Server Side Web Languages

Lecture # 2

3 Server-Side Programming Models


Each model has its pros and cons
Classic CGI model
Pro: isolation means easiest in principle to secure, least damaging if something goes wrong Con: isolation makes it slow & resource intensive

Server API model


Pro: very fast & low overhead if written properly Con: hard to write; blows up server if done wrong

Web application frameworks


Pro: ideally combines efficiency of API model with safety of CGI; adds helpful encapsulation of routine tasks like state management Con: built-in tools can be resource hogs in wrong hands; ease of use may encourage carelessness

CSE 135

Server Side Web Languages

Lecture # 2

3 Server-Side Programming Models


Many examples of each
Classic CGI
Scripts written in Perl Programs written in C

Server API
Apache modules ISAPI filters and extensions

Web application frameworks


All descended from Server Side Includes (SSI), original parsed HTML solution that allowed interspersing of executable code with markup ASP, ASP.NET, Cold Fusion, JSP/Servlets, Python, PHP, etc.

CSE 135

Server Side Web Languages

Lecture # 2

Server Sizing with Dynamic Content


In high traffic scenarios with dynamic pages, when bandwidth is plentiful, disk access can be the major bottleneck
Especially problematic when backend databases are being accessed to build pages

Reading from disk always slower than reading from memory, thus add tons of memory? Memcache? A sliding scale of solutions
Use fast disk controllers (SCSI) or SSD (memory again!) Exploit caching mechanisms to keep as much data as possible in memory Add hardware! (and give it specialized roles)

CSE 135

Server Side Web Languages

Lecture # 2

A complex server farm configuration


Web and application Servers DB Clusters

Load Balancers

Reverse Proxies with memcache

CSE 135

Server Side Web Languages

Lecture # 2

Web Applications and Site Structure


With server-side programming it becomes even more important to treat the URL as virtual rather than physical
Each file called by an URL can generate many different responses At the extreme, some methodologies call for a single file to generate all pages in the site Many different physical resources, including database tables and additional files (includes) might be required to produce one response Filters or modules might preempt or rewrite certain URLs altogether

CSE 135

Server Side Web Languages

Lecture # 2

Web Analytics Overview


Log File Formats, Configuration, Management Why do Log Analysis?
Traffic Analysis (internal and external) Quality of Service Analysis Security audits Performance analysis

Statistics, Tracking, Reporting


Basic Concepts Limitations and Caveats Free and commercial tools

Setting up a Robust Logging System

CSE 135

Server Side Web Languages

Lecture # 2

Log File Formats


Apart from error logs, Web servers generate access or transfer logs that record per request activity Two formats
Common Logfile Format (CLF)

remotehost rfc1430 authuser [date] request status bytes


Combined Logfile Format adds referer and user-agent Extended Logfile Format (ELF) Two required directives (Version and Fields) at the top tell consumers of the log file how to parse it

#Version: 1.0 #Fields: date time c-ip sc-bytes time-taken cs-version

CSE 135

Server Side Web Languages

Lecture # 2

More on Extended Logfile Format


date and time are standard fields Beyond those, the administrator is free to specify a wide range of extended fields
In IIS: c-ip cs-username s-sitename s-computername sip s-port cs-method cs-uri-stem cs-uri-query scstatus sc-win32-status sc-bytes cs-bytes time-taken cs-version cs-host cs(User-Agent) cs(Cookie) cs (Referer)

Apache has particularly customizable formatting


Arbitrary ordering of fields interspersing of text and formatting Conditional logging using environment variables or regular expressions on the URL Routing of certain entries to specialized logs

CSE 135

Server Side Web Languages

Lecture # 2

Managing Logs Best Practices


Log everything you need, but not what you do not need Rotate log files at intervals appropriate for your analysis and archiving requirements Write logs to a convenient, distinct, ample and secure location For heavy duty analysis on high traffic sites, consider using dedicated database server(s)
Records can be inserted directly or asynchronously Analysis carried out without burdening site Especially necessary for analysis of logging that covers extended time periods (i.e., longer than a single day)

CSE 135

Server Side Web Languages

Lecture # 2

Why do Log Analysis? (Traffic)


Optimize content or ad pricing or positioning, assess popularity of site areas/features
Most popular pages Top entry point pages

Billing in hosting environment or resource allocation in enterprise environment


Most active domains

Search engine activity


Indexing and query frequency

Campaign tracking
Top referring sites/domains/URLs Time/event based spikes or dips

Audience analysis
IP geography, language preference, client host type (.com, .edu, .org, etc.)

CSE 135

Server Side Web Languages

Lecture # 2

Why do Log Analysis? (QoS)


Optimize first views, adjust site structure
Top entry point pages

Adjust for browser capabilities


User agents

Identify points of failure


Error codes and counts (404, 500)

Identify navigation patterns and frequent exit points


IP, referrer and cookie tracking Not easy to do, but maybe worth the effort for finding out if users are aborting an application path early

CSE 135

Server Side Web Languages

Lecture # 2

Why do Log Analysis? (Security)


Identify leaching or scraping activity
Most requested files IP, referrer and cookie tracking Entry point pages Bandwidth utilization

Track sources and methods of reconnaissance attempts, exploits and attacks


Error codes Attempted access of shells, scripts, etc. Attack and worm signatures Long/malformed request URLs Unusually large request entities (POST)

CSE 135

Server Side Web Languages

Lecture # 2

Why do Log Analysis? (Performance)


Verify or update Web server sizing estimates by using actual data Issue or verify bandwidth bills
Bytes sent (within given time frame) Request frequencies, especially peaks and valleys over given periods of time

Assess caching efficiency


Harder to do but possible by looking at (dependent) requests per page and 304 response codes

CSE 135

Server Side Web Languages

Lecture # 2

Statistics, Tracking, Reporting


Basic concepts
Counting hits versus counting page views Distinguishing page views from hits File name File type Web server response code (to exclude errors) Client host (if excluding internals) Counting unique visitors Sets of page views attributable to one user MUCH harder to do and IMPOSSIBLE TO DO RELIABLY, no matter what anyone tells you Requires a unique identifier to serve as a proxy for physical presence of the virtual visitor

CSE 135

Server Side Web Languages

Lecture # 2

Statistics, Tracking, Reporting, cont.


Counting Unique Visitors, continued
Client IP is easiest identifier to use, but also least reliable Dynamic IPs, proxies with NAT Login is highly reliable (except for sharing) but limited in applicability to sites/sections where it wont discourage users Cookies (transparently placed) are the best all-purpose compromise, but still have limits Must have backup if disabled on client Still not guaranteed to be persistent Bound to machine rather than user Can not be shared across domains

CSE 135

Server Side Web Languages

Lecture # 2

Statistics, Tracking, Reporting, cont.


Be aware of limitations and caveats when counting requests and page views
Browser and proxy caching stop requests from ever reaching the server and its logs, deflating actual page views by actual users Can be partially mitigated by use of HTTP cache control headers, but this is neither guaranteed to work nor costfree in bandwidth terms A good compromise is to flag pages for non caching but take advantage of caching for relatively persistent images Request counts will also be inflated by bot and script activity (desirable or undesirable)

CSE 135

Server Side Web Languages

Lecture # 2

Statistics, Tracking, Reporting, cont.


Tracking the elusive Visit
How long a unique visitor spends on the site before exiting The concept has tremendous potential utility for marketing and quality of service analysis Stateless nature of HTTP makes it next to impossible to determine with any degree of accuracy 100% accurate visit time What is commonly done is to use rule of thumb such as a series of page requests by a visitor without 30 consecutive minutes of inactivity common assumption Maybe JavaScript can be used to help though? Otherwise for session length remember it is totally arbitrary so you can make longer or shorter visit averages by adjusting this assumption

CSE 135

Server Side Web Languages

Lecture # 2

Analytics Beyond Logs


JavaScript Analytics (ex. Google Analytics) is far more popular than log based
Pros Degree of detail, access to log problems, ease of Web types adding tracking, consolidation of tracking data Cons Requires JavaScript (execution limits particularly bots, failures, load time concerns)

Network Tap based systems also exist which provide insight into delivery Given the three sides of the Web equation one wonders if this isnt again a question of not versus but working together for a full view

CSE 135

Server Side Web Languages

Lecture # 2

Dealing with Bots and Spiders


Automated User Agents
Bots, Robots, Crawlers, Spiders, etc. Most capable of automated site traversal Bots come in both benign and malign forms Search engine indexers, link checkers, monitors Spam bots, leechers & scrapers, attack bots Benign bots usually (not always!) announce themselves with unique User Agent headers Frequently updated lists of common search agent bots is available online googlebot and other well-known variations Benign bots are usually (not always!) well-behaved Crawl at rates well below DoS levels Obey Robot exclusion directives

CSE 135

Server Side Web Languages

Lecture # 2

Special Handling for Search Agents


What to do about indexing bots and dynamic pages?
May need to exclude them to prevent indexing of content that will vary per user or request May need to provide spider-friendly versions of dynamic pages to expose content to desired indexing Alternate, search-optimized pages can be helpful but proceed with caution! that could be black hat Bots can impersonate UAs to prevent/punish spamming (bait pages, stealth) Content should not vary, only presentation

CSE 135

Server Side Web Languages

Lecture # 2

Using the Robot Exclusion Protocol


Place a robots.txt file in the sites document root
Careful: dont show your soft spots? Or maybe you want to?

Well-behaved bots will request this first, and obey its directives
#sample robots.txt file User-Agent: * Disallow: /newtoday Disallow: /downloads User-Agent: newsbot Disallow: /pressrreleases

CSE 135

Server Side Web Languages

Lecture # 2

Beyond the Robot Exclusion Protocol


For controlling unfriendly bots, robot exclusion is insufficient Access control is hard to do, since neither IP ranges nor User Agents are reliable identifiers of unfriendlies Access control based on traversal pattern and rate is possible
Using IP and request path against time elapsed it should be possible to identify a traversal and dynamically block it Nontrivial to program and subject to countermeasures if it catches on Passive bot detection is certainly an open area for research notice all the CAPTCHAS as well

CSE 135

Server Side Web Languages

Lecture # 2

Server and Site Monitoring


Monitoring Site Availability
Content monitors request portions of key pages and compare actual to expected results to verify that site is alive and working properly Application monitors submit form data and analyze result to verify backend systems are up

Monitoring Server Uptime


Service monitors warn when services go down or become unreachable Automated restart can be attempted

All monitors usually alert via email, pager, SMS Thresholds can be set to allow for transient errors & delays, or warn of degrading performance

CSE 135

Server Side Web Languages

Lecture # 2

Server and Site Monitoring, cont.


More active monitoring is also possible Can be useful especially in testing and diagnostic situations
Process monitors allow for isolation of specific processes to pinpoint trouble spots, especially resource bottlenecks and leaks Performance monitors, especially in conjunction with stress tools that simulate traffic, help in accurate dimensioning Network monitors allow examination of packet level data and protocol details for uncovering connection related problems

CSE 135

Server Side Web Languages

Lecture # 2

Server Tuning
Many recommended optimizations are highly specific to Web server vendor/version Some common elements
Disable reverse DNS lookups in logging Shorten connection timeouts (trades some bandwidth for server resources) Remove unneeded server API modules Minimize other application overhead Optimize process & thread pools and limits

You might also like