We’ve seen a number of network protocols so far.
Each of them consisted of a set of allowed commands and possible responses, messages that are passed from client to server and then from server to client.
In each protocol we’ve seen so far, these commands and responses have been delimited
And for each of the protocols we’ve seen so far, that delimiter has been consistent: <CRLF>
A further consistency is shared between these protocols. In each case we’ve seen so far, the client is responsible for initiating the interaction. The server is passive until it receives some sort of request.
HTTP is no different
HTTP is also message-centered, with two-way communications we call requests and responses.
HTTP Request (Ask for information):
GET /index.html HTTP/1.1
Host: www.example.com
<CRLF>
HTTP Response (Provide answers):
HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34 GMT
Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
Etag: "3f80f-1b6-3e1cb03b"
Accept-Ranges: none
Content-Length: 438
Connection: close
Content-Type: text/html; charset=UTF-8
<CRLF>
<438 bytes of content>
Both share a common basic format:
Let’s look at each a bit more closely.
In HTTP 1.0, the only required line in an HTTP request looks like this:
GET /path/to/index.html HTTP/1.0
<CRLF>
As virtual hosting grew more common, that was not enough, so HTTP 1.1 adds a single required header, Host:
GET /path/to/index.html HTTP/1.1 Host: www.mysite1.com:80 <CRLF>
Every HTTP request must begin with a single line, broken by whitespace into three parts:
GET /path/to/index.html HTTP/1.1
The three parts are the method, the URI, and the protocol
Let’s look at each in turn.
GET /path/to/index.html HTTP/1.1
Every HTTP request must start with a method
There are four main HTTP methods:
- GET
- POST
- PUT
- DELETE
There are others, notably HEAD, but you won’t see them too much
These four methods can be mapped to the four basic steps (CRUD) of persistent storage:
HTTP methods can be categorized as safe or unsafe, based on whether they might change something on the server:
This is a normative distinction, which is to say be careful
HTTP methods can be categorized as idempotent, based on whether a given request will always have the same result:
Again, normative. The developer is responsible for ensuring that it is true.
GET /path/to/index.html HTTP/1.1
In any server application, this job of connecting the URI requested to the appropriate end point is very important.
In both HTTP 1.0 and 1.1, a proper response consists of an intial line, followed by optional headers, a single blank line, and then optionally a response body:
HTTP/1.1 200 OK
Content-Type: text/plain
<CRLF>
this is a pretty minimal response
As with requests, the initial line of the response is strictly formatted, divided by whitespace into a response code and an explanation
HTTP/1.1 200 OK
All HTTP responses must include a response code indicating the outcome of the request.
The response code is a machine-readable number. The explanation that follows is meant as a way to make the responses more human-friendly.
There are certain HTTP response codes you are likely to see (and use) most often:
Do not be afraid to use other, less common codes in building good apps. There are a lot of them for a reason. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
After the required initial line of a response, the HTTP protocol allows the server to send additional information to the client in the form of headers
In fact, both requests and responses can contain headers. So a client can also use them to send extra data to the server.
Headers take the form <Name>: <Value>
It’s well worth being familiar with the possible headers for HTTP. You can read more about them here.
There are a couple of headers we’ll talk about immediately, because they are so common.
The first is the Content-Type header. It tells the client how to treat the data that is being returned in the body of the response.
There are many mime-type identifiers.
The Python standard library provides a module that helps in determining the mimetype of a given file. It’s called mimetypes.
Using it, you can guess the mime-type of a file based on the filename or map a file extension to a type:
>>> textfile = "/path/to/textfile.txt"
>>> mimetypes.guess_type(textfile)
('text/plain', None)
>>> import os
>>> text_extension = os.path.splitext(textfile)
>>> text_extension
('/path/to/textfile', '.txt')
>>> mimetypes.types_map[text_extension[1]]
'text/plain'
>>> imagefile = "/path/to/imagefile.png"
>>> mimetypes.guess_type(imagefile)
('image/png', None)
>>> image_extension = os.path.splitext(imagefile)
>>> image_extension
('/path/to/imagefile', '.png')
>>> mimetypes.types_map[image_extension[1]]
'image/png'
Another common HTTP header is the Date header. It represents the date and time that a response was generated.
The value for this header must be expressed in GMT, not local time, and has a very particular format:
Fri, 12 Feb 2010 16:23:03 GMT
The Python standard library also provides a way of getting exactly this format. Since the format is almost exactly the same as that required for email headers, this method is found in a slightly unexpected module:
>>> import email.utils
>>> email.utils.formatdate(usegmt=True)
'Fri, 12 Feb 2010 16:23:03 GMT'
A third common HTTP header is the Content-Length header, used to inform the client just how much data to expect in the body of a response.
Since HTTP does not specify a delimiter for a response body (unlike the SMTP, POP3 and IMAP protocols), this header is particularly important.
The value for the header should correspond to the number of bytes of data that will be returned. For binary files like images calculating this value is quite straightforward:
>>> with open('Mars1.jpg', 'rb') as file_handle:
... mars_image = file_handle.read()
...
>>> length = len(mars_image)
>>> length
1161387
However, when text is involved it gets a bit more complicated. Best practice in Python is to keep text that you are working with as unicode objects:
>>> body = u'éclaire'
>>> len(body)
7
Remember though that a socket can only transmit bytes, not decoded unicode objects, so in Python you must be sure that the content of the response body you send has been encoded:
>>> bytes = body.encode('utf-8')
>>> len(bytes)
8
Notice that the length of the encoded byte string is longer than the decoded unicode string. This is because the encoded form of the é character is actually two bytes in length.
When sending text back to a client, it is best practice to include information about what codec was used to encode the bytes you send.
It’s tempting to think of the Content-Encoding header as the proper place to send this data, but in fact that is used to inform the client of compressed data (.zip or similar).
Instead, the correct way to inform the client of the encoding used is to append a charset <name> value to the Content-Type header:
Content-Type: text/plain; charset=utf-8