Understanding the HTTP Protocol

We’ve seen a number of network protocols so far.

Each of them consisted of a set of allowed commands and possible responses, messages that are passed from client to server and then from server to client.

In each protocol we’ve seen so far, these commands and responses have been delimited

And for each of the protocols we’ve seen so far, that delimiter has been consistent: <CRLF>

A further consistency is shared between these protocols. In each case we’ve seen so far, the client is responsible for initiating the interaction. The server is passive until it receives some sort of request.

The HTTP Protocol

HTTP is no different

HTTP is also message-centered, with two-way communications we call requests and responses.

  • Requests (Asking for information)
  • Responses (Providing answers)

HTTP Request (Ask for information):

GET /index.html HTTP/1.1
Host: www.example.com
<CRLF>

HTTP Response (Provide answers):

HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34 GMT
Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
Etag: "3f80f-1b6-3e1cb03b"
Accept-Ranges:  none
Content-Length: 438
Connection: close
Content-Type: text/html; charset=UTF-8
<CRLF>
<438 bytes of content>

HTTP Req/Resp Format

Both share a common basic format:

  • Line separators are <CRLF> (familiar, no?)
  • A required initial line (a command or a response code)
  • A (mostly) optional set of headers, one per line
  • A blank line
  • An optional body

Let’s look at each a bit more closely.

HTTP Requests

In HTTP 1.0, the only required line in an HTTP request looks like this:

GET /path/to/index.html HTTP/1.0
<CRLF>

As virtual hosting grew more common, that was not enough, so HTTP 1.1 adds a single required header, Host:

GET /path/to/index.html HTTP/1.1 Host: www.mysite1.com:80 <CRLF>

Every HTTP request must begin with a single line, broken by whitespace into three parts:

GET /path/to/index.html HTTP/1.1

The three parts are the method, the URI, and the protocol

Let’s look at each in turn.

HTTP Methods

GET /path/to/index.html HTTP/1.1

  • Every HTTP request must start with a method

  • There are four main HTTP methods:

    • GET
    • POST
    • PUT
    • DELETE
  • There are others, notably HEAD, but you won’t see them too much

These four methods can be mapped to the four basic steps (CRUD) of persistent storage:

  • POST = Create
  • GET = Read
  • PUT = Update
  • DELETE = Delete

Safe <–> Unsafe

HTTP methods can be categorized as safe or unsafe, based on whether they might change something on the server:

  • Safe HTTP Methods
    • GET
  • Unsafe HTTP Methods
    • POST
    • PUT
    • DELETE

This is a normative distinction, which is to say be careful

Idempotent <–> Non-Idempotent

HTTP methods can be categorized as idempotent, based on whether a given request will always have the same result:

  • Idempotent HTTP Methods
    • GET
    • PUT
    • DELETE
  • Non-Idempotent HTTP Methods
    • POST

Again, normative. The developer is responsible for ensuring that it is true.

HTTP Requests: URI

GET /path/to/index.html HTTP/1.1

  • Every HTTP request must include a URI used to determine the resource to be returned
  • URI?? http://stackoverflow.com/questions/176264/whats-the-difference-between-a-uri-and-a-url/1984225#1984225
  • In static systems, the URI maps directly to a filesystem location on the server.
  • In dynamic systems, it may still do so (PHP, CGI)
    • It may also be used to determine what code object should be used to build a response
  • Static or dynamic, we call whatever that end point might be a resource
  • Resource? Files (html, img, .js, .css), but also:
    • Dynamic scripts
    • Raw data
    • API endpoints

In any server application, this job of connecting the URI requested to the appropriate end point is very important.

HTTP Responses

In both HTTP 1.0 and 1.1, a proper response consists of an intial line, followed by optional headers, a single blank line, and then optionally a response body:

HTTP/1.1 200 OK
Content-Type: text/plain
<CRLF>
this is a pretty minimal response

As with requests, the initial line of the response is strictly formatted, divided by whitespace into a response code and an explanation

HTTP Response Codes

HTTP/1.1 200 OK

All HTTP responses must include a response code indicating the outcome of the request.

  • 1xx (HTTP 1.1 only) - Informational message
  • 2xx - Success of some kind
  • 3xx - Redirection of some kind
  • 4xx - Client Error of some kind
  • 5xx - Server Error of some kind

The response code is a machine-readable number. The explanation that follows is meant as a way to make the responses more human-friendly.

Common Response Codes

There are certain HTTP response codes you are likely to see (and use) most often:

  • 200 OK - Everything is good
  • 301 Moved Permanently - You should update your link
  • 304 Not Modified - You should load this from cache
  • 404 Not Found - You’ve asked for something that doesn’t exist
  • 500 Internal Server Error - Something bad happened

Do not be afraid to use other, less common codes in building good apps. There are a lot of them for a reason. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

HTTP Headers

After the required initial line of a response, the HTTP protocol allows the server to send additional information to the client in the form of headers

In fact, both requests and responses can contain headers. So a client can also use them to send extra data to the server.

Headers take the form <Name>: <Value>

  • HTTP 1.0 has 16 valid headers, 1.1 has 46
  • Any number of spaces or tabs may separate the name from the value
  • If a header line starts with spaces or tabs, it is considered part of the value for the previous header
  • Header names are not case-sensitive, but values may be

It’s well worth being familiar with the possible headers for HTTP. You can read more about them here.

Common HTTP Response Headers

There are a couple of headers we’ll talk about immediately, because they are so common.

The first is the Content-Type header. It tells the client how to treat the data that is being returned in the body of the response.

  • it uses mime-type (Multi-purpose Internet Mail Extensions)
  • foo.jpeg - Content-Type: image/jpeg
  • foo.png - Content-Type: image/png
  • bar.txt - Content-Type: text/plain
  • baz.html - Content-Type: text/html

There are many mime-type identifiers.

The Python standard library provides a module that helps in determining the mimetype of a given file. It’s called mimetypes.

Using it, you can guess the mime-type of a file based on the filename or map a file extension to a type:

>>> textfile = "/path/to/textfile.txt"
>>> mimetypes.guess_type(textfile)
('text/plain', None)
>>> import os
>>> text_extension = os.path.splitext(textfile)
>>> text_extension
('/path/to/textfile', '.txt')
>>> mimetypes.types_map[text_extension[1]]
'text/plain'
>>> imagefile = "/path/to/imagefile.png"
>>> mimetypes.guess_type(imagefile)
('image/png', None)
>>> image_extension = os.path.splitext(imagefile)
>>> image_extension
('/path/to/imagefile', '.png')
>>> mimetypes.types_map[image_extension[1]]
'image/png'

Another common HTTP header is the Date header. It represents the date and time that a response was generated.

The value for this header must be expressed in GMT, not local time, and has a very particular format:

Fri, 12 Feb 2010 16:23:03 GMT

The Python standard library also provides a way of getting exactly this format. Since the format is almost exactly the same as that required for email headers, this method is found in a slightly unexpected module:

>>> import email.utils
>>> email.utils.formatdate(usegmt=True)
'Fri, 12 Feb 2010 16:23:03 GMT'

A third common HTTP header is the Content-Length header, used to inform the client just how much data to expect in the body of a response.

Since HTTP does not specify a delimiter for a response body (unlike the SMTP, POP3 and IMAP protocols), this header is particularly important.

The value for the header should correspond to the number of bytes of data that will be returned. For binary files like images calculating this value is quite straightforward:

>>> with open('Mars1.jpg', 'rb') as file_handle:
...     mars_image = file_handle.read()
...
>>> length = len(mars_image)
>>> length
1161387

However, when text is involved it gets a bit more complicated. Best practice in Python is to keep text that you are working with as unicode objects:

>>> body = u'éclaire'
>>> len(body)
7

Remember though that a socket can only transmit bytes, not decoded unicode objects, so in Python you must be sure that the content of the response body you send has been encoded:

>>> bytes = body.encode('utf-8')
>>> len(bytes)
8

Notice that the length of the encoded byte string is longer than the decoded unicode string. This is because the encoded form of the é character is actually two bytes in length.

When sending text back to a client, it is best practice to include information about what codec was used to encode the bytes you send.

It’s tempting to think of the Content-Encoding header as the proper place to send this data, but in fact that is used to inform the client of compressed data (.zip or similar).

Instead, the correct way to inform the client of the encoding used is to append a charset <name> value to the Content-Type header:

Content-Type: text/plain; charset=utf-8