Work through this exercise to create a Python script to extract a list of apartment rentals from Craigslist.
While working, you should use the virtualenv project we created in class for learning about the BeautifulSoup package.
heffalump:~ cewing$ workon souptests
[souptests]
heffalump:souptests cewing$
Begin by creating a new Python file, call it scraper.py. Open it in your editor.
The first step is to use the requests library to fetch a set of search results from the Craigslist site.
In order to do so, we will need to assemble a query that fits with the search form present on the Seattle apartment listing page.
Use your browser’s devtools to determine the name of the various inputs available in the search form on that page. You should end up with a list that includes at least the following:
You’ll also discover, if you submit a search, that the URL to be used for a search request is in fact this:
Our goal is to write a Python function that will return the search results HTML from a query to craigslist. This function should:
Here is one possible solution for this query:
import requests
def fetch_search_results(
query=None, minAsk=None, maxAsk=None, bedrooms=None
):
search_params = {
key: val for key, val in locals().items() if val is not None
}
if not search_params:
raise ValueError("No valid keywords")
base = 'http://seattle.craigslist.org/search/apa'
resp = requests.get(base, params=search_params, timeout=3)
resp.raise_for_status() # <- no-op if status==200
return resp.content, resp.encoding
Write the results of your search to a file, apartments.html so that you can work on it without needing to hammer the craigslist servers.
Write a read_search_results function which reads this file from disk and returns the content and encoding in the same way as the above function. Then you can switch between the two without altering the API. I leave this exercise to you.
Next, we need a function parse_source to set up the HTML as DOM nodes for scraping. It will need to:
This function can be quite simple. Add it to scraper.py:
# add this import at the top
from bs4 import BeautifulSoup
# then add this function lower down
def parse_source(html, encoding='utf-8'):
parsed = BeautifulSoup(html, from_encoding=encoding)
return parsed
In order to see the results we have at this point, we’ll need to make our scraper.py executable by adding a __main__ block. For:
# add another import at the top
import sys
if __name__ == '__main__':
if len(sys.argv) > 1 and sys.argv[1] == 'test':
html, encoding = read_search_results()
else:
html, encoding = fetch_search_results(
minAsk=500, maxAsk=1000, bedrooms=2
)
doc = parse_source(html, encoding)
print doc.prettify(encoding=encoding)
Now, you can execute your scraper script in one of two ways:
You are going to build a function that extracts useful information from each of the listings in the parsed HTML search results. From each listing, we should extract the following information:
You’ll be building this function one step at a time, to simplify the task.
The first job is to find the container that holds each individual listing. Use your browser’s devtools to identify the container that holds each individual listing. Then, write a function that takes in the parsed HTML and returns a list of the apartment listing container nodes. Call this function extract_listings:
def extract_listings(parsed):
listings = parsed.find_all('p', class_='row')
return listings
If you update your __main__ block to use this new function, you can verify the results visually:
if __name__ == '__main__':
if len(sys.argv) > 1 and sys.argv[1] == 'test':
html, encoding = read_search_results()
else:
html, encoding = fetch_search_results(
minAsk=500, maxAsk=1000, bedrooms=2
)
doc = parse_source(html, encoding)
listings = extract_listings(doc) # add this line
print len(listings) # and this one
print listings[0].prettify() # and this one too
Call your script from the command line (in test mode), to see your results:
[souptests]
heffalump:souptests cewing$ python scraper.py test
100
<p class="row" data-latitude="47.4924143400595" data-longitude="-122.235904626445" data-pid="4345117401">
...
</p>
[souptests]
heffalump:souptests cewing$
If you look through your test listings file using your browser’s devtools, you’ll notice that only some of them actually have latitude and longitude, Because the specs for our scraper require this data, we want to filter out any listings that do not.
BeautifulSoup allows us to filter our searches using HTML attributes with the attrs argument. One way of doing this is to provide a specific value for a given attribute:
doc.filter('p', attrs={'data-longitude': "47.4924143400595"})
It should be pretty clear though, that each of our listings is located in a different place, and this type of filtering won’t really help much. Happily, you can also provide True as the value for a given attribute. By doing so, you are telling BeautifulSoup that you want to match any node that has that attribute, regardless of the specific value.
Let’s use this to enhance our extract_listings function so that it only returns the listings that have location attributes:
def extract_listings(parsed):
location_attrs = {'data-latitude': True, 'data-longitude': True}
listings = parsed.find_all('p', class_='row', attrs=location_attrs)
return listings
Calling the script from the command line now should return a slightly different number of results:
[souptests]
heffalump:souptests cewing$ python scraper.py test
74
<p class="row" data-latitude="47.4924143400595" data-longitude="-122.235904626445" data-pid="4345117401">
...
</p>
[souptests]
heffalump:souptests cewing$
You’ve used the location data to filter results. Now that only those results that have locations are being listed, let’s begin scraping that data out of the HTML page.
In BeautifulSoup, the HTML attributes of a given tag are found as the attrs attribute of the Tag object. This attribute is a dictionary and it is certain to be present, even if it is empty. The names of the attributes are the keys of this dictionary, and the HTML values are the values.
We’ve already said that there is a certain set of data we want to preserve about each listing. We could create some custom Python class to represent a listing, and perhaps in some situations that would be appropriate, but for this simple script we will just build a dictionary that represents each listing.
Let’s update our extract_listings function to build a dictionary for each listing, and begin by populating it with the location data we extract:
def extract_listings(parsed):
location_attrs = {'data-latitude': True, 'data-longitude': True}
listings = parsed.find_all('p', class_='row', attrs=location_attrs)
extracted = []
for listing in listings:
location = {key: listing.attrs.get(key, '') for key in location_attrs}
this_listing = {
'location': location,
}
extracted.append(this_listing)
return extracted
Since the return value of this function has now changed from a list of Tag objects to a list of dictionaries, we will also need to update our __main__ block:
if __name__ == '__main__':
import pprint # add this import
if len(sys.argv) > 1 and sys.argv[1] == 'test':
html, encoding = read_search_results()
else:
html, encoding = fetch_search_results(
minAsk=500, maxAsk=1000, bedrooms=2
)
doc = parse_source(html, encoding)
listings = extract_listings(doc)
print len(listings)
pprint.pprint(listings[0]) # update this line
And now, executing this script at the command line should return the following:
[souptests]
heffalump:souptests cewing$ python scraper.py test
74
{'location': {'data-latitude': u'47.4924143400595',
'data-longitude': u'-122.235904626445'}}
[souptests]
heffalump:souptests cewing$
We used the find_all method of a Tag above to extract all the listings from our parsed document. We can use the find method of each listing to find the first item that matches our search filter.
Use your browser’s devtools to find the element in each listing that contains the descriptive text about the listing. What kind of HTML tag is it? What other useful bit of information is present in that tag?
Use this information to enhance our extract_listings function so that each dictionary it produces also contains a description and link:
def extract_listings(parsed):
location_attrs = {'data-latitude': True, 'data-longitude': True}
listings = parsed.find_all('p', class_='row', attrs=location_attrs)
extracted = []
for listing in listings:
location = {key: listing.attrs.get(key, '') for key in location_attrs}
link = listing.find('span', class_='pl').find('a') # add this
this_listing = {
'location': location,
'link': link.attrs['href'], # add this too
'description': link.string.strip(), # and this
}
extracted.append(this_listing)
return extracted
Note that we are calling the string strip method on the value we get for the description from the string attribute of the Tag.
The most obvious reason is that we don’t want extra whitespace.
The second reason is more subtle, but also more important. The values returned by string are not simple unicode strings.
They are actually instances of the NavigableString class:
>>> listing.find('span', class_='pl').find('a').string.__class__
<class 'bs4.element.NavigableString'>
These class instances contain not only the text, but also instance attributes that connect them to the DOM nodes that surround them. These attributes take quite a bit of memeory. Calling strip or casting them to unicode with the unicode type object converts them, saving memory.
Executing the script from the command line now should show you that you have succeeded:
[souptests]
heffalump:souptests cewing$ python scraper.py test
74
{'description': u'2 BEDROOM 2 BATHROOM Zero Down Rent with Option to Buy',
'link': u'/oly/apa/4345117401.html',
'location': {'data-latitude': u'47.4924143400595',
'data-longitude': u'-122.235904626445'}}
[souptests]
heffalump:souptests cewing$
Again, use your browser devtools to find the container that holds both the price of a listed apartment, and the text that describes its size.
What’s different about these two?
The price data is contained inside a convenient container of its own. The size, however, is not. It is just some text found in the main container after the Tag that holds the price. You can see this by dropping a breakpoint into your extract_listings function and inspecting the DOM:
> /Users/cewing/projects/souptests/scraper.py(39)extract_listings()
-> this_listing = {
(Pdb) l2 = listing.find('span', class_='l2')
(Pdb) print l2.prettify()
<span class="l2">
<span class="price">
$960
</span>
/ 3br
<span class="pnr">
<small>
(Seattle98102)
</small>
<span class="px">
<span class="p">
pic
<a class="maptag" data-pid="4345117401" href="#">
map
</a>
</span>
</span>
</span>
</span>
(Pdb)
If you try to get at that text by using the string attribute of the l2 span tag, you’ll see that it just isn’t there:
(Pdb) print l2.string
None
(Pdb)
Likewise, if you use the text attribute to get all the text in the tag, you end up with more than you really want:
(Pdb) print l2.text
$960 / 3br - (Seattle98102) pic map
(Pdb)
You could parse this string to extract what you want, but why? There’s an easier way.
All text in a DOM document is really contained in instances of the NavigableString class.
We’ve already talked about how this class contains references to the DOM nodes around it. These references allow us to navigate the DOM, moving from one node to the next directly instead of simply searching for what we want. BeautifulSoup supports navigating from node to node in a number of ways:
In this case, that ability can help us a great deal. Looking carefully, you might notice that the text describing the size of an apartment is located just after the span that contains our price. This means we can use navigation methods starting from the span containing the price to get where we want to be:
(Pdb) price_node = listing.find('span', class_='l2').find('span', class_='price')
(Pdb) price_node
<span class="price">$960</span>
(Pdb) price_node.next_sibling
u' / 3br - '
(Pdb) price_node.next_sibling.strip()
u'/ 3br -'
(Pdb) price_node.next_sibling.strip(' \n-/')
u'3br'
(Pdb)
Type ‘quit’ at your pdb prompt to exit the debugger and then remove the breakpoint from your code.
Now update extract_listings to include the information we’ve just found:
def extract_listings(parsed):
location_attrs = {'data-latitude': True, 'data-longitude': True}
listings = parsed.find_all('p', class_='row', attrs=location_attrs)
extracted = []
for listing in listings:
location = {key: listing.attrs.get(key, '') for key in location_attrs}
link = listing.find('span', class_='pl').find('a')
price_span = listing.find('span', class_='price') # add me
this_listing = {
'location': location,
'link': link.attrs['href'],
'description': link.string.strip(),
'price': price_span.string.strip(), # and me
'size': price_span.next_sibling.strip(' \n-/') # me too
}
extracted.append(this_listing)
return extracted
And now executing your script from the command line should show these new elements for a listing:
[souptests]
heffalump:souptests cewing$ python scraper.py test
74
{'description': u'2 BEDROOM 2 BATHROOM Zero Down Rent with Option to Buy',
'link': u'/oly/apa/4345117401.html',
'location': {'data-latitude': u'47.4924143400595',
'data-longitude': u'-122.235904626445'},
'price': u'$960',
'size': u'3br'}
[souptests]
heffalump:souptests cewing$