Consuming Data from a RESTful Web Service

As an example of a RESTful web service, let’s add some more information to our list of apartment rentals from previous exercises.

We’ll use a common, public API provided by Google.

incremental center

Geocoding

Geocoding with Google APIs

https://developers.google.com/maps/documentation/geocoding

Open a python interpreter using your souptests virtualenv:

[souptests]
heffalump:souptests cewing$ python

Then, import the requests library and prepare to make an HTTP request to the google geocoding service resource:

>>> import requests
>>> import json
>>> from pprint import pprint
>>> url = 'http://maps.googleapis.com/maps/api/geocode/json'
>>> addr = '511 Boren Ave. N, Seattle, 98109'
>>> parameters = {'address': addr, 'sensor': 'false' }
>>> resp = requests.get(url, params=parameters)
>>> data = json.loads(resp.text)
>>> if data['status'] == 'OK':
...     pprint(data)
...
{u'results': [{u'address_components': [{u'long_name': u'511',
                                        u'short_name': u'511',
                                        u'types': [u'street_number']},
                                       {u'long_name': u'Boren Avenue North',
                                        u'short_name': u'Boren Ave N',
                                        u'types': [u'route']},
                                       {u'long_name': u'South Lake Union',
                                        u'short_name': u'SLU',
                                        u'types': [u'neighborhood',
                                                   u'political']},
                                       {u'long_name': u'Seattle',
                                        u'short_name': u'Seattle',
                                        u'types': [u'locality',
                                                   u'political']},
                                       {u'long_name': u'King County',
                                        u'short_name': u'King County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'Washington',
                                        u'short_name': u'WA',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']},
                                       {u'long_name': u'98109',
                                        u'short_name': u'98109',
                                        u'types': [u'postal_code']}],
               u'formatted_address': u'511 Boren Avenue North, Seattle, WA 98109, USA',
               u'geometry': {u'location': {u'lat': 47.6235481,
                                           u'lng': -122.336212},
                             u'location_type': u'ROOFTOP',
                             u'viewport': {u'northeast': {u'lat': 47.6248970802915,
                                                          u'lng': -122.3348630197085},
                                           u'southwest': {u'lat': 47.6221991197085,
                                                          u'lng': -122.3375609802915}}},
               u'types': [u'street_address']}],
 u'status': u'OK'}
>>>

You can also do the reverse, provide a location as latitude and longitude and receive address informatin back:

>>> location = data['results'][0]['geometry']['location']
>>> latlng="{lat},{lng}".format(**location)
>>> parameters = {'latlng': latlng, 'sensor': 'false'}
>>> resp = requests.get(url, params=paramters)
>>> data = json.loads(resp.text)
>>> if data['status'] == 'OK':
...     pprint(data)
...
{u'results': [{u'address_components': [{u'long_name': u'511',
                                        u'short_name': u'511',
                                        u'types': [u'street_number']},
                                       {u'long_name': u'Boren Avenue North',
                                        u'short_name': u'Boren Ave N',
                                        u'types': [u'route']},
                                       {u'long_name': u'South Lake Union',
                                        u'short_name': u'SLU',
                                        u'types': [u'neighborhood',
                                                   u'political']},
                                       {u'long_name': u'Seattle',
                                        u'short_name': u'Seattle',
                                        u'types': [u'locality',
                                                   u'political']},
                                       {u'long_name': u'King County',
                                        u'short_name': u'King County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'Washington',
                                        u'short_name': u'WA',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']},
                                       {u'long_name': u'98109',
                                        u'short_name': u'98109',
                                        u'types': [u'postal_code']}],
               u'formatted_address': u'511 Boren Avenue North, Seattle, WA 98109, USA',
               u'geometry': {u'location': {u'lat': 47.6235481,
                                           u'lng': -122.336212},
                             u'location_type': u'ROOFTOP',
                             u'viewport': {u'northeast': {u'lat': 47.6248970802915,
                                                          u'lng': -122.3348630197085},
                                           u'southwest': {u'lat': 47.6221991197085,
                                                          u'lng': -122.3375609802915}}},
               u'types': [u'street_address']},
              ...
              ],
 u'status': u'OK'}
>>>

Notice that in the response there are actually a number of results. These are decreasingly specific designations for the location you provided. The types values for each indicate the level of geographical specificity for each result.

Mashup!

Let’s create a simple mashup of this data with the apartment listings we built by scraping Craigslist in an earlier exercise.

Open your scraper.py file in your editor and add a new function. Call it add_address. This function should:

  • take a single listing from our craiglist work
  • format the location data provided in that listing properly
  • make a reverse geocoding lookup using the google api above
  • add the best available address to the listing
  • return the updated listing

Can you write this function without looking at the solution below? Try it.

Solution

Here are the changes I made to scraper.py to add this function:

# add an import
import json

# and a function
def add_address(listing):
    api_url = 'http://maps.googleapis.com/maps/api/geocode/json'
    loc = listing['location']
    latlng_tmpl = "{data-latitude},{data-longitude}"
    parameters = {
        'sensor': 'false',
        'latlng': latlng_tmpl.format(**loc),
    }
    resp = requests.get(api_url, params=parameters)
    resp.raise_for_status() # <- this is a no-op if all is well
    data = json.loads(resp.text)
    if data['status'] == 'OK':
        best = data['results'][0]
        listing['address'] = best['formatted_address']
    else:
        listing['address'] = 'unavailable'
    return listing

You’ll need to bolt the new function into your script so that the results it gives are added to each listing.

Make the following changes to your __main__ block:

if __name__ == '__main__':
    import pprint
    if len(sys.argv) > 1 and sys.argv[1] == 'test':
        html, encoding = read_search_results()
    else:
        html, encoding = fetch_search_results(
            minAsk=500, maxAsk=1000, bedrooms=2
        )
    doc = parse_source(html, encoding)    # above here is the same
    for listing in extract_listings(doc): # change everything below
        listing = add_address(listing)
        pprint(listing)

Give it a whirl, using the test approach so you don’t hit Craigslist while trying it out:

[souptests]
heffalump:souptests cewing$ python scraper.py test
{'address': u'12339-12399 78th Avenue South, Seattle, WA 98178, USA',
 'description': u'2 BEDROOM 2 BATHROOM Zero Down   Rent with Option to Buy',
 'link': u'/oly/apa/4345117401.html',
 'location': {'data-latitude': u'47.4924143400595',
              'data-longitude': u'-122.235904626445'},
 'price': u'$960',
 'size': u'3br'}
{'address': ...
...
[souptests]
heffalump:souptests cewing$

Nifty, eh?

Reduce Your Footprint

At the moment, all of our results need to be held in memory at the same time. In this case it probably isn’t too big a deal, but it’s good to practice being kind to your resources.

Update the ‘extract_listings’ method to turn it into a generator. Then we can process a single apartment listing at a time, decreasing the memory requirements of our script:

def extract_listings(parsed):
    location_attrs = {'data-latitude': True, 'data-longitude': True}
    listings = parsed.find_all('p', class_='row', attrs=location_attrs)
    # delete the line where you create a list in which to store
    # your listings
    for listing in listings:
        location = {key: listing.attrs.get(key, '') for key in location_attrs}
        link = listing.find('span', class_='pl').find('a')
        price_span = listing.find('span', class_='price')
        this_listing = {
            'location': location,
            'link': link.attrs['href'],
            'description': link.string.strip(),
            'price': price_span.string.strip(),
            'size': price_span.next_sibling.strip(' \n-/')
        }
        # delete the line where you append this result to a list
        yield this_listing  # This is the only change you need to make

When you make that change, each individual listing will be yielded from the extract_listings generator, and you will be able to add an address to each without building all the rest first.

Going Further

This would be a great opportunity for using asynchronous processing as well.

Can you think of a way to handle the adding of an address to each listing using an asynchronous call using gevent or tornado?

Consider this a standing challenge.