Scraping web pages is tedious and inherently brittle.
If the owner of the website updates their layout, your code breaks.
But there is another way to get information from the web in a more normalized fashion.
Web Services
A web service is described by the W3C as:
“a software system designed to support interoperable machine-to-machine interaction over a network”
In general, a web service provides a defined set of functionality and returns structured data in response to well-formed requests.
RSS is one of the earliest forms of Web Services
A single web-based endpoint provides a dynamically updated listing of content
RSS is implemented in pure HTTP. The return value is an XML document
Atom is a competing, but similar standard.
An RSS document might look something like this:
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>
<title>RSS Title</title>
<description>This is an example of an RSS feed</description>
<link>http://www.someexamplerssdomain.com/main.html</link>
<lastBuildDate>Mon, 06 Sep 2010 00:01:00 +0000 </lastBuildDate>
<pubDate>Mon, 06 Sep 2009 16:45:00 +0000 </pubDate>
<ttl>1800</ttl>
<item>
<title>Example entry</title>
<description>Here is some text containing an interesting description.</description>
<link>http://www.wikipedia.org/</link>
<guid>unique string per item</guid>
<pubDate>Mon, 06 Sep 2009 16:45:00 +0000 </pubDate>
</item>
...
</channel>
</rss>
In python, the best tool for consuming RSS, Atom and other related feeds is feedparser. It’s not part of the standard library, but it is very well supported and commonly used.
RSS provides a pre-defined data set, can we also allow calling procedures to get more dynamic data?
We can! Enter XML-RPC (Remote Procedure Call)
The Python standard library provides modules both for producing and for consuming XML-RPC services.
Services can be built by registering functions or even entire class objects with the SimpleXMLRPCServer class from the SimpleXMLRPCServer library.
The xmlrpclib module provides client methods to consume XML-RPC services.
It is worth noting that there are a number of known vulnerabilities in the XML standard and that the xmlrpclib is not secured against them. When interacting with XML-RPC online, it behooves you to pay close attention to the sources of your data.
You have a brief XML-RPC walkthrough assignment that will help you to familiarize yourself with this form of Web Services.
Enter SOAP: Simple Object Access Protocol
SOAP extends XML-RPC in a couple of useful ways:
There is no standard library module that supports SOAP directly.
HOWEVER
The source for suds has not seen a commit since February 13, 2012, and the last release of the package to PyPI was even longer ago, in 2010.
I have recently discovered that there is an updated version of this package available. It’s called suds-jurko after the username of the developer who stepped up to lead development forward.
I cannot speak to the quality of this package, but it does show a lot of activity, with the most recent commit less than one week ago, and the most recent release coming just last month (Jan 2014).
I’m not going to bother providing an exercise in SOAP. I’ve never had an interaction with a SOAP service that was a positive experience, and if you have any other option, you should avoid it.
If you do have to use a SOAP web service, prepare yourself for some significant time spent trying to get things to work properly.
SOAP (and XML-RPC) have some issues.
Moreover, for SOAP suds is the best we have, and it hasn’t been updated since Sept. 2010 by its official owner. What does that say about the utility of the standard? (At least as perceived by the Python community?)
So, if not XML, then what data format should one use to provide responses to Web Service calls?
JavaScript Object Notation:
JSON is based on two structures:
These both end up looking quite suspiciously like code we are familiar with.
In addition JSON provides a few basic data types (see http://json.org/):
Note that the boolean values must be typed as all lowercase letters. This is JavaScript, not Python. It’s an easy mistake to make.
Also note that there is no representation of a date type in JSON.
No date type? OMGWTF??!!1!1
You have a couple of options for how to deal with dates and time in JSON.
Option 1 - Unix Epoch Time (number):
>>> import time
>>> time.time()
1358212616.7691269
Option 2 - ISO 8661 (string):
>>> import datetime
>>> datetime.datetime.now().isoformat()
'2013-01-14T17:18:10.727240'
You can encode Python to json, and decode json back to Python:
>>> import json
>>> array = [1,2,3]
>>> json.dumps(array)
>>> '[1, 2, 3]'
>>> orig = {'foo': [1,2,3], 'bar': u'my resumé', 'baz': True}
>>> encoded = json.dumps(orig)
>>> encoded
'{"baz": true, "foo": [1, 2, 3], "bar": "my resum\\u00e9"}'
>>> decoded = json.loads(encoded)
>>> decoded == orig
True
Notice that by default, unicode strings are converted to ASCII-compatible escape characters.
Also note that encoding a Python tuple to JSON and then decoding it back to Python cannot produce a tuple:
>>> foo = (1, 2, 3)
>>> encoded = json.dumps(foo)
>>> encoded
'[1, 2, 3]'
>>> bar = json.loads(encoded)
>>> bar
[1, 2, 3]
>>> bar == foo
False
>>>
This is just the way that it is due to the fact that JavaScript does not have any data equivalent of the immutable tuple type in Python.
In addition to loads and dumps, json provides load and dump. The difference between these methods is in what argument they accept as the value to be decoded or encoded.
The json.loads method takes a unicode object. But json.load accepts any object which is file-like, meaning that it has a .read() method. This method is optimized so that very large input streams can be handled more efficiently.
The json.dumps method directly returns a string. But json.dump requires a second positional argument which must be a file-like object, something with a .write() method. The value that would normally have been returned from the method will instead be written to the provided object.
Both json.loads and json.dumps can take optional cls keyword arguments that allow you to pass a subclass of the json.JSONEncoder or json.JSONDecoder class to be used instead of those standard objects.
This can allow you to create powerful, customized encoders and decoders for specialized data types not recognized by the standard implementations:
>>> import json
>>> class ComplexEncoder(json.JSONEncoder):
... def default(self, obj):
... if isinstance(obj, complex):
... return [obj.real, obj.imag]
... # Let the base class default method raise the TypeError
... return json.JSONEncoder.default(self, obj)
...
>>> dumps(2 + 1j, cls=ComplexEncoder)
'[2.0, 1.0]'
>>> ComplexEncoder().encode(2 + 1j)
'[2.0, 1.0]'
>>> list(ComplexEncoder().iterencode(2 + 1j))
['[', '2.0', ', ', '1.0', ']']
SOAP was invented in part to provide completely machine-readable interoperability through WSDL.
Does that really work in real life?
Hardly ever.
Another reason was to provide extensibility via custom types defined in WSDL.
Does that really work in real life?
Hardly ever.
So, if neither of these goals is really achieved by using SOAP, why pay all the overhead required to use the protocol?
Instead, a new form for Web Services was defined, ReST.
It stands for Representational State Transfer.
Seriously. Buy it and read (<http://www.crummy.com/writing/RESTful-Web-Services/)
In XML-RCP/SOAP the same HTTP method is used for all calls, and the endpoint is a different function. For a commenting system on a blog, this might look something like this:
RESTful Web Services are designed to use HTTP Methods as they were intended to be used. So the same set of functionality can be expressed like this:
We call this approach Resource Oriented Architecture because each function is implemented as a different method of the same web-based resource.
The URL represents the resource we are working with.
The HTTP Method represents the action to be taken.
Similarly, the HTTP Code returned by a RESTful service can tell us the result of our action.
For example, in our putative commenting system, consider a POST request to create a new comment:
POST /comment HTTP/1.1
Possible responses might include:
Or for a PUT request to edit an existing comment:
PUT /comment/<id> HTTP/1.1
And a DELETE request to remove a comment:
DELETE /comment/<id> HTTP/1.1
A fundamental truth of HTTP is that the protocol is stateless.
I’ll repeat that, for emphasis.
HTTP is fundamentally STATELESS
No individual request may be assumed to know anything about any other request.
If we are considering a request as a call to take some action on an applications (our web service), then the following conclusion may be drawn:
All the required information representing the possible actions to take should be present in every response.
This brings us to the definition of HATEOAS
Hypermedia As The Engine Of Application State
A State Engine is like a machine. The individual parts of the machine may be said to be in a given state at any one time. For example, the engine of a car can be running or stopped. If the engine is running, the drive shaft may be engaged or disengaged via a transmission.
The act of starting the engine moves the state of the engine from stopped to running. The act of putting the car in gear moves the drive shaft from the state of disengaged to engaged.
In a state engine we call these acts transitions. They exist to move the resources of the application between available states.
In HTTP, we have said that no individual request can be asserted to be aware of the current state of the application to which it will be sent.
This means that a RESTful web service response must:
Tonight, you’ll do a walkthrough of using a RESTful web service to extend the data you built about apartment listings yesterday.
As you do so, make note of how the information the web service provides you indicates what else you can do with the data you get back.