As an example of a RESTful web service, let’s add some more information to our list of apartment rentals from previous exercises.
We’ll use a common, public API provided by Google.
Geocoding
https://developers.google.com/maps/documentation/geocoding
Open a python interpreter using your souptests virtualenv:
[souptests]
heffalump:souptests cewing$ python
Then, import the requests library and prepare to make an HTTP request to the google geocoding service resource:
>>> import requests
>>> import json
>>> from pprint import pprint
>>> url = 'http://maps.googleapis.com/maps/api/geocode/json'
>>> addr = '511 Boren Ave. N, Seattle, 98109'
>>> parameters = {'address': addr, 'sensor': 'false' }
>>> resp = requests.get(url, params=parameters)
>>> data = json.loads(resp.text)
>>> if data['status'] == 'OK':
... pprint(data)
...
{u'results': [{u'address_components': [{u'long_name': u'511',
u'short_name': u'511',
u'types': [u'street_number']},
{u'long_name': u'Boren Avenue North',
u'short_name': u'Boren Ave N',
u'types': [u'route']},
{u'long_name': u'South Lake Union',
u'short_name': u'SLU',
u'types': [u'neighborhood',
u'political']},
{u'long_name': u'Seattle',
u'short_name': u'Seattle',
u'types': [u'locality',
u'political']},
{u'long_name': u'King County',
u'short_name': u'King County',
u'types': [u'administrative_area_level_2',
u'political']},
{u'long_name': u'Washington',
u'short_name': u'WA',
u'types': [u'administrative_area_level_1',
u'political']},
{u'long_name': u'United States',
u'short_name': u'US',
u'types': [u'country',
u'political']},
{u'long_name': u'98109',
u'short_name': u'98109',
u'types': [u'postal_code']}],
u'formatted_address': u'511 Boren Avenue North, Seattle, WA 98109, USA',
u'geometry': {u'location': {u'lat': 47.6235481,
u'lng': -122.336212},
u'location_type': u'ROOFTOP',
u'viewport': {u'northeast': {u'lat': 47.6248970802915,
u'lng': -122.3348630197085},
u'southwest': {u'lat': 47.6221991197085,
u'lng': -122.3375609802915}}},
u'types': [u'street_address']}],
u'status': u'OK'}
>>>
You can also do the reverse, provide a location as latitude and longitude and receive address informatin back:
>>> location = data['results'][0]['geometry']['location']
>>> latlng="{lat},{lng}".format(**location)
>>> parameters = {'latlng': latlng, 'sensor': 'false'}
>>> resp = requests.get(url, params=paramters)
>>> data = json.loads(resp.text)
>>> if data['status'] == 'OK':
... pprint(data)
...
{u'results': [{u'address_components': [{u'long_name': u'511',
u'short_name': u'511',
u'types': [u'street_number']},
{u'long_name': u'Boren Avenue North',
u'short_name': u'Boren Ave N',
u'types': [u'route']},
{u'long_name': u'South Lake Union',
u'short_name': u'SLU',
u'types': [u'neighborhood',
u'political']},
{u'long_name': u'Seattle',
u'short_name': u'Seattle',
u'types': [u'locality',
u'political']},
{u'long_name': u'King County',
u'short_name': u'King County',
u'types': [u'administrative_area_level_2',
u'political']},
{u'long_name': u'Washington',
u'short_name': u'WA',
u'types': [u'administrative_area_level_1',
u'political']},
{u'long_name': u'United States',
u'short_name': u'US',
u'types': [u'country',
u'political']},
{u'long_name': u'98109',
u'short_name': u'98109',
u'types': [u'postal_code']}],
u'formatted_address': u'511 Boren Avenue North, Seattle, WA 98109, USA',
u'geometry': {u'location': {u'lat': 47.6235481,
u'lng': -122.336212},
u'location_type': u'ROOFTOP',
u'viewport': {u'northeast': {u'lat': 47.6248970802915,
u'lng': -122.3348630197085},
u'southwest': {u'lat': 47.6221991197085,
u'lng': -122.3375609802915}}},
u'types': [u'street_address']},
...
],
u'status': u'OK'}
>>>
Notice that in the response there are actually a number of results. These are decreasingly specific designations for the location you provided. The types values for each indicate the level of geographical specificity for each result.
Let’s create a simple mashup of this data with the apartment listings we built by scraping Craigslist in an earlier exercise.
Open your scraper.py file in your editor and add a new function. Call it add_address. This function should:
Can you write this function without looking at the solution below? Try it.
Here are the changes I made to scraper.py to add this function:
# add an import
import json
# and a function
def add_address(listing):
api_url = 'http://maps.googleapis.com/maps/api/geocode/json'
loc = listing['location']
latlng_tmpl = "{data-latitude},{data-longitude}"
parameters = {
'sensor': 'false',
'latlng': latlng_tmpl.format(**loc),
}
resp = requests.get(api_url, params=parameters)
resp.raise_for_status() # <- this is a no-op if all is well
data = json.loads(resp.text)
if data['status'] == 'OK':
best = data['results'][0]
listing['address'] = best['formatted_address']
else:
listing['address'] = 'unavailable'
return listing
You’ll need to bolt the new function into your script so that the results it gives are added to each listing.
Make the following changes to your __main__ block:
if __name__ == '__main__':
import pprint
if len(sys.argv) > 1 and sys.argv[1] == 'test':
html, encoding = read_search_results()
else:
html, encoding = fetch_search_results(
minAsk=500, maxAsk=1000, bedrooms=2
)
doc = parse_source(html, encoding) # above here is the same
for listing in extract_listings(doc): # change everything below
listing = add_address(listing)
pprint(listing)
Give it a whirl, using the test approach so you don’t hit Craigslist while trying it out:
[souptests]
heffalump:souptests cewing$ python scraper.py test
{'address': u'12339-12399 78th Avenue South, Seattle, WA 98178, USA',
'description': u'2 BEDROOM 2 BATHROOM Zero Down Rent with Option to Buy',
'link': u'/oly/apa/4345117401.html',
'location': {'data-latitude': u'47.4924143400595',
'data-longitude': u'-122.235904626445'},
'price': u'$960',
'size': u'3br'}
{'address': ...
...
[souptests]
heffalump:souptests cewing$
Nifty, eh?
At the moment, all of our results need to be held in memory at the same time. In this case it probably isn’t too big a deal, but it’s good to practice being kind to your resources.
Update the ‘extract_listings’ method to turn it into a generator. Then we can process a single apartment listing at a time, decreasing the memory requirements of our script:
def extract_listings(parsed):
location_attrs = {'data-latitude': True, 'data-longitude': True}
listings = parsed.find_all('p', class_='row', attrs=location_attrs)
# delete the line where you create a list in which to store
# your listings
for listing in listings:
location = {key: listing.attrs.get(key, '') for key in location_attrs}
link = listing.find('span', class_='pl').find('a')
price_span = listing.find('span', class_='price')
this_listing = {
'location': location,
'link': link.attrs['href'],
'description': link.string.strip(),
'price': price_span.string.strip(),
'size': price_span.next_sibling.strip(' \n-/')
}
# delete the line where you append this result to a list
yield this_listing # This is the only change you need to make
When you make that change, each individual listing will be yielded from the extract_listings generator, and you will be able to add an address to each without building all the rest first.
This would be a great opportunity for using asynchronous processing as well.
Can you think of a way to handle the adding of an address to each listing using an asynchronous call using gevent or tornado?
Consider this a standing challenge.