******************************* Understanding the HTTP Protocol ******************************* We've seen a number of network protocols so far. Each of them consisted of a set of allowed commands and possible responses, *messages* that are passed from client to server and then from server to client. In each protocol we've seen so far, these commands and responses have been *delimited* And for each of the protocols we've seen so far, that delimiter has been consistent: A further consistency is shared between these protocols. In each case we've seen so far, the *client* is responsible for initiating the interaction. The *server* is passive until it receives some sort of request. The HTTP Protocol ================= HTTP is no different HTTP is also message-centered, with two-way communications we call *requests* and *responses*. * Requests (Asking for information) * Responses (Providing answers) HTTP Request (Ask for information):: GET /index.html HTTP/1.1 Host: www.example.com HTTP Response (Provide answers):: HTTP/1.1 200 OK Date: Mon, 23 May 2005 22:38:34 GMT Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux) Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT Etag: "3f80f-1b6-3e1cb03b" Accept-Ranges: none Content-Length: 438 Connection: close Content-Type: text/html; charset=UTF-8 <438 bytes of content> HTTP Req/Resp Format -------------------- Both share a common basic format: * Line separators are (familiar, no?) * A *required* initial line (a command or a response code) * A *(mostly) optional* set of headers, one per line * A blank line * An *optional* body Let's look at each a bit more closely. HTTP Requests ============= In HTTP 1.0, the only required line in an HTTP request looks like this:: GET /path/to/index.html HTTP/1.0 As virtual hosting grew more common, that was not enough, so HTTP 1.1 adds a single required *header*, **Host**: GET /path/to/index.html HTTP/1.1 Host: www.mysite1.com:80 Every HTTP request **must** begin with a single line, broken by whitespace into three parts:: GET /path/to/index.html HTTP/1.1 The three parts are the *method*, the *URI*, and the *protocol* Let's look at each in turn. HTTP Methods ------------ **GET** ``/path/to/index.html HTTP/1.1`` * Every HTTP request must start with a *method* * There are four main HTTP methods: * GET * POST * PUT * DELETE * There are others, notably HEAD, but you won't see them too much These four methods can be mapped to the four basic steps (*CRUD*) of persistent storage: * POST = Create * GET = Read * PUT = Update * DELETE = Delete Safe <--> Unsafe ---------------- HTTP methods can be categorized as **safe** or **unsafe**, based on whether they might *change something* on the server: * Safe HTTP Methods * GET * Unsafe HTTP Methods * POST * PUT * DELETE This is a *normative* distinction, which is to say **be careful** Idempotent <--> Non-Idempotent ------------------------------ HTTP methods can be categorized as **idempotent**, based on whether a given request will *always* have the same result: * Idempotent HTTP Methods * GET * PUT * DELETE * Non-Idempotent HTTP Methods * POST Again, *normative*. The developer is responsible for ensuring that it is true. HTTP Requests: URI ------------------ ``GET`` **/path/to/index.html** ``HTTP/1.1`` * Every HTTP request must include a **URI** used to determine the **resource** to be returned * URI?? http://stackoverflow.com/questions/176264/whats-the-difference-between-a-uri-and-a-url/1984225#1984225 * In static systems, the URI maps directly to a filesystem location on the server. * In dynamic systems, it may still do so (PHP, CGI) * It may also be used to determine what code object should be used to build a response * Static or dynamic, we call whatever that end point might be a *resource* * Resource? Files (html, img, .js, .css), but also: * Dynamic scripts * Raw data * API endpoints In any server application, this job of connecting the URI requested to the appropriate end point is very important. HTTP Responses ============== In both HTTP 1.0 and 1.1, a proper response consists of an intial line, followed by optional headers, a single blank line, and then optionally a response body:: HTTP/1.1 200 OK Content-Type: text/plain this is a pretty minimal response As with requests, the initial line of the response is strictly formatted, divided by whitespace into a *response code* and an *explanation* HTTP Response Codes ------------------- ``HTTP/1.1`` **200 OK** All HTTP responses must include a **response code** indicating the outcome of the request. * 1xx (HTTP 1.1 only) - Informational message * 2xx - Success of some kind * 3xx - Redirection of some kind * 4xx - Client Error of some kind * 5xx - Server Error of some kind The response code is a machine-readable number. The explanation that follows is meant as a way to make the responses more human-friendly. Common Response Codes --------------------- There are certain HTTP response codes you are likely to see (and use) most often: * ``200 OK`` - Everything is good * ``301 Moved Permanently`` - You should update your link * ``304 Not Modified`` - You should load this from cache * ``404 Not Found`` - You've asked for something that doesn't exist * ``500 Internal Server Error`` - Something bad happened Do not be afraid to use other, less common codes in building good apps. There are a lot of them for a reason. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html HTTP Headers ------------ After the required initial line of a response, the HTTP protocol allows the server to send additional information to the client in the form of **headers** In fact, both requests and responses can contain headers. So a client can also use them to send extra data to the server. Headers take the form ``: `` * HTTP 1.0 has 16 valid headers, 1.1 has 46 * Any number of spaces or tabs may separate the *name* from the *value* * If a header line starts with spaces or tabs, it is considered part of the *value* for the previous header * Header *names* are **not** case-sensitive, but *values* may be It's well worth being familiar with the possible headers for HTTP. You can `read more about them here`_. .. _read more about them here: http://www.cs.tut.fi/~jkorpela/http.html Common HTTP Response Headers ---------------------------- There are a couple of headers we'll talk about immediately, because they are so common. The first is the ``Content-Type`` header. It tells the client how to treat the data that is being returned in the body of the response. * it uses **mime-type** (Multi-purpose Internet Mail Extensions) * foo.jpeg - ``Content-Type: image/jpeg`` * foo.png - ``Content-Type: image/png`` * bar.txt - ``Content-Type: text/plain`` * baz.html - ``Content-Type: text/html`` There are *many* `mime-type identifiers`_. .. _mime-type identifiers: http://www.webmaster-toolkit.com/mime-types.shtml: The Python standard library provides a module that helps in determining the mimetype of a given file. It's called ``mimetypes``. Using it, you can guess the mime-type of a file based on the filename or map a file extension to a type: .. code-block:: pycon >>> textfile = "/path/to/textfile.txt" >>> mimetypes.guess_type(textfile) ('text/plain', None) >>> import os >>> text_extension = os.path.splitext(textfile) >>> text_extension ('/path/to/textfile', '.txt') >>> mimetypes.types_map[text_extension[1]] 'text/plain' >>> imagefile = "/path/to/imagefile.png" >>> mimetypes.guess_type(imagefile) ('image/png', None) >>> image_extension = os.path.splitext(imagefile) >>> image_extension ('/path/to/imagefile', '.png') >>> mimetypes.types_map[image_extension[1]] 'image/png' Another common HTTP header is the ``Date`` header. It represents the date and time that a response was generated. The value for this header must be expressed in GMT, not local time, and has a very particular format:: Fri, 12 Feb 2010 16:23:03 GMT The Python standard library also provides a way of getting exactly this format. Since the format is almost exactly the same as that required for email headers, this method is found in a slightly unexpected module: .. code-block:: pycon >>> import email.utils >>> email.utils.formatdate(usegmt=True) 'Fri, 12 Feb 2010 16:23:03 GMT' A third common HTTP header is the ``Content-Length`` header, used to inform the client just how much data to expect in the body of a response. Since HTTP does not specify a delimiter for a response body (unlike the SMTP, POP3 and IMAP protocols), this header is particularly important. The value for the header should correspond to the number of bytes of data that will be returned. For binary files like images calculating this value is quite straightforward: .. code-block:: pycon >>> with open('Mars1.jpg', 'rb') as file_handle: ... mars_image = file_handle.read() ... >>> length = len(mars_image) >>> length 1161387 However, when text is involved it gets a bit more complicated. Best practice in Python is to keep text that you are working with as ``unicode`` objects: >>> body = u'éclaire' >>> len(body) 7 Remember though that a socket can **only** transmit bytes, not decoded unicode objects, so in Python you must be sure that the content of the response body you send has been encoded: >>> bytes = body.encode('utf-8') >>> len(bytes) 8 Notice that the length of the encoded byte string is *longer* than the decoded unicode string. This is because the encoded form of the ``é`` character is actually *two bytes* in length. When sending text back to a client, it is best practice to include information about what *codec* was used to encode the bytes you send. It's tempting to think of the ``Content-Encoding`` header as the proper place to send this data, but in fact that is used to inform the client of *compressed* data (.zip or similar). Instead, the correct way to inform the client of the encoding used is to append a ``charset `` value to the ``Content-Type`` header:: Content-Type: text/plain; charset=utf-8