Consuming Data from a RESTful Web Service¶

As an example of a RESTful web service, let’s add some more information to our list of apartment rentals from previous exercises.

We’ll use a common, public API provided by Google.

incremental center

Geocoding

Geocoding with Google APIs¶

https://developers.google.com/maps/documentation/geocoding

Open a python interpreter using your souptests virtualenv:

[souptests]
heffalump:souptests cewing$ python

Then, import the requests library and prepare to make an HTTP request to the google geocoding service resource:

>>> import requests
>>> import json
>>> from pprint import pprint
>>> url = 'http://maps.googleapis.com/maps/api/geocode/json'
>>> addr = '511 Boren Ave. N, Seattle, 98109'
>>> parameters = {'address': addr, 'sensor': 'false' }
>>> resp = requests.get(url, params=parameters)
>>> data = json.loads(resp.text)
>>> if data['status'] == 'OK':
...     pprint(data)
...
{u'results': [{u'address_components': [{u'long_name': u'511',
                                        u'short_name': u'511',
                                        u'types': [u'street_number']},
                                       {u'long_name': u'Boren Avenue North',
                                        u'short_name': u'Boren Ave N',
                                        u'types': [u'route']},
                                       {u'long_name': u'South Lake Union',
                                        u'short_name': u'SLU',
                                        u'types': [u'neighborhood',
                                                   u'political']},
                                       {u'long_name': u'Seattle',
                                        u'short_name': u'Seattle',
                                        u'types': [u'locality',
                                                   u'political']},
                                       {u'long_name': u'King County',
                                        u'short_name': u'King County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'Washington',
                                        u'short_name': u'WA',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']},
                                       {u'long_name': u'98109',
                                        u'short_name': u'98109',
                                        u'types': [u'postal_code']}],
               u'formatted_address': u'511 Boren Avenue North, Seattle, WA 98109, USA',
               u'geometry': {u'location': {u'lat': 47.6235481,
                                           u'lng': -122.336212},
                             u'location_type': u'ROOFTOP',
                             u'viewport': {u'northeast': {u'lat': 47.6248970802915,
                                                          u'lng': -122.3348630197085},
                                           u'southwest': {u'lat': 47.6221991197085,
                                                          u'lng': -122.3375609802915}}},
               u'types': [u'street_address']}],
 u'status': u'OK'}
>>>

You can also do the reverse, provide a location as latitude and longitude and receive address informatin back:

>>> location = data['results'][0]['geometry']['location']
>>> latlng="{lat},{lng}".format(**location)
>>> parameters = {'latlng': latlng, 'sensor': 'false'}
>>> resp = requests.get(url, params=paramters)
>>> data = json.loads(resp.text)
>>> if data['status'] == 'OK':
...     pprint(data)
...
{u'results': [{u'address_components': [{u'long_name': u'511',
                                        u'short_name': u'511',
                                        u'types': [u'street_number']},
                                       {u'long_name': u'Boren Avenue North',
                                        u'short_name': u'Boren Ave N',
                                        u'types': [u'route']},
                                       {u'long_name': u'South Lake Union',
                                        u'short_name': u'SLU',
                                        u'types': [u'neighborhood',
                                                   u'political']},
                                       {u'long_name': u'Seattle',
                                        u'short_name': u'Seattle',
                                        u'types': [u'locality',
                                                   u'political']},
                                       {u'long_name': u'King County',
                                        u'short_name': u'King County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'Washington',
                                        u'short_name': u'WA',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']},
                                       {u'long_name': u'98109',
                                        u'short_name': u'98109',
                                        u'types': [u'postal_code']}],
               u'formatted_address': u'511 Boren Avenue North, Seattle, WA 98109, USA',
               u'geometry': {u'location': {u'lat': 47.6235481,
                                           u'lng': -122.336212},
                             u'location_type': u'ROOFTOP',
                             u'viewport': {u'northeast': {u'lat': 47.6248970802915,
                                                          u'lng': -122.3348630197085},
                                           u'southwest': {u'lat': 47.6221991197085,
                                                          u'lng': -122.3375609802915}}},
               u'types': [u'street_address']},
              ...
              ],
 u'status': u'OK'}
>>>

Notice that in the response there are actually a number of results. These are decreasingly specific designations for the location you provided. The types values for each indicate the level of geographical specificity for each result.

Mashup!¶

Let’s create a simple mashup of this data with the apartment listings we built by scraping Craigslist in an earlier exercise.

Open your scraper.py file in your editor and add a new function. Call it add_address. This function should:

take a single listing from our craiglist work
format the location data provided in that listing properly
make a reverse geocoding lookup using the google api above
add the best available address to the listing
return the updated listing

Can you write this function without looking at the solution below? Try it.

Solution¶

Here are the changes I made to scraper.py to add this function:

# add an import
import json

# and a function
def add_address(listing):
    api_url = 'http://maps.googleapis.com/maps/api/geocode/json'
    loc = listing['location']
    latlng_tmpl = "{data-latitude},{data-longitude}"
    parameters = {
        'sensor': 'false',
        'latlng': latlng_tmpl.format(**loc),
    }
    resp = requests.get(api_url, params=parameters)
    resp.raise_for_status() # <- this is a no-op if all is well
    data = json.loads(resp.text)
    if data['status'] == 'OK':
        best = data['results'][0]
        listing['address'] = best['formatted_address']
    else:
        listing['address'] = 'unavailable'
    return listing

You’ll need to bolt the new function into your script so that the results it gives are added to each listing.

Make the following changes to your __main__ block:

if __name__ == '__main__':
    import pprint
    if len(sys.argv) > 1 and sys.argv[1] == 'test':
        html, encoding = read_search_results()
    else:
        html, encoding = fetch_search_results(
            minAsk=500, maxAsk=1000, bedrooms=2
        )
    doc = parse_source(html, encoding)    # above here is the same
    for listing in extract_listings(doc): # change everything below
        listing = add_address(listing)
        pprint(listing)

Give it a whirl, using the test approach so you don’t hit Craigslist while trying it out:

[souptests]
heffalump:souptests cewing$ python scraper.py test
{'address': u'12339-12399 78th Avenue South, Seattle, WA 98178, USA',
 'description': u'2 BEDROOM 2 BATHROOM Zero Down   Rent with Option to Buy',
 'link': u'/oly/apa/4345117401.html',
 'location': {'data-latitude': u'47.4924143400595',
              'data-longitude': u'-122.235904626445'},
 'price': u'$960',
 'size': u'3br'}
{'address': ...
...
[souptests]
heffalump:souptests cewing$

Nifty, eh?

Reduce Your Footprint¶

At the moment, all of our results need to be held in memory at the same time. In this case it probably isn’t too big a deal, but it’s good to practice being kind to your resources.

Update the ‘extract_listings’ method to turn it into a generator. Then we can process a single apartment listing at a time, decreasing the memory requirements of our script:

def extract_listings(parsed):
    location_attrs = {'data-latitude': True, 'data-longitude': True}
    listings = parsed.find_all('p', class_='row', attrs=location_attrs)
    # delete the line where you create a list in which to store
    # your listings
    for listing in listings:
        location = {key: listing.attrs.get(key, '') for key in location_attrs}
        link = listing.find('span', class_='pl').find('a')
        price_span = listing.find('span', class_='price')
        this_listing = {
            'location': location,
            'link': link.attrs['href'],
            'description': link.string.strip(),
            'price': price_span.string.strip(),
            'size': price_span.next_sibling.strip(' \n-/')
        }
        # delete the line where you append this result to a list
        yield this_listing  # This is the only change you need to make

When you make that change, each individual listing will be yielded from the extract_listings generator, and you will be able to add an address to each without building all the rest first.

Going Further¶

This would be a great opportunity for using asynchronous processing as well.

Can you think of a way to handle the adding of an address to each listing using an asynchronous call using gevent or tornado?

Consider this a standing challenge.

Consuming Data from a RESTful Web Service¶

Geocoding with Google APIs¶

Mashup!¶

Solution¶

Reduce Your Footprint¶

Going Further¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Consuming Data from a RESTful Web Service¶

Geocoding with Google APIs¶

Mashup!¶

Solution¶

Reduce Your Footprint¶

Going Further¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation