Scraping Apartment Listings from Craigslist

Work through this exercise to create a Python script to extract a list of apartment rentals from Craigslist.

While working, you should use the virtualenv project we created in class for learning about the BeautifulSoup package.

heffalump:~ cewing$ workon souptests
[souptests]
heffalump:souptests cewing$

Begin by creating a new Python file, call it scraper.py. Open it in your editor.

Step 1: Fetch Search Results

The first step is to use the requests library to fetch a set of search results from the Craigslist site.

In order to do so, we will need to assemble a query that fits with the search form present on the Seattle apartment listing page.

Use your browser’s devtools to determine the name of the various inputs available in the search form on that page. You should end up with a list that includes at least the following:

  • keywords: query=keyword+values+here
  • price: minAsk=NNN maxAsk=NNN
  • bedrooms: bedrooms=N (N in range 1-8)

You’ll also discover, if you submit a search, that the URL to be used for a search request is in fact this:

Our goal is to write a Python function that will return the search results HTML from a query to craigslist. This function should:

  • It will accept one keyword argument for each of the possible query values
  • It will build a dictionary of request query parameters from incoming keywords
  • It will make a request to the craigslist server using this query
  • It will return the body of the response if there is no error
  • It will raise an error if there is a problem with the response

Here is one possible solution for this query:

import requests

def fetch_search_results(
    query=None, minAsk=None, maxAsk=None, bedrooms=None
):
    search_params = {
        key: val for key, val in locals().items() if val is not None
    }
    if not search_params:
        raise ValueError("No valid keywords")

    base = 'http://seattle.craigslist.org/search/apa'
    resp = requests.get(base, params=search_params, timeout=3)
    resp.raise_for_status()  # <- no-op if status==200
    return resp.content, resp.encoding

Write the results of your search to a file, apartments.html so that you can work on it without needing to hammer the craigslist servers.

Write a read_search_results function which reads this file from disk and returns the content and encoding in the same way as the above function. Then you can switch between the two without altering the API. I leave this exercise to you.

Step 2: Parse Search Results

Next, we need a function parse_source to set up the HTML as DOM nodes for scraping. It will need to:

  • Take the response body from the previous method (or some other source)
  • Parse it using BeautifulSoup
  • Return the parsed object for further processing

This function can be quite simple. Add it to scraper.py:

# add this import at the top
from bs4 import BeautifulSoup

# then add this function lower down
def parse_source(html, encoding='utf-8'):
    parsed = BeautifulSoup(html, from_encoding=encoding)
    return parsed

In order to see the results we have at this point, we’ll need to make our scraper.py executable by adding a __main__ block. For:

# add another import at the top
import sys

if __name__ == '__main__':
    if len(sys.argv) > 1 and sys.argv[1] == 'test':
        html, encoding = read_search_results()
    else:
        html, encoding = fetch_search_results(
            minAsk=500, maxAsk=1000, bedrooms=2
        )
    doc = parse_source(html, encoding)
    print doc.prettify(encoding=encoding)

Now, you can execute your scraper script in one of two ways:

  1. python scraper.py will fetch results directly from Craigslist.
  2. python scraper.py test will use your stored results from file.

Step 3: Extract Listing Information

You are going to build a function that extracts useful information from each of the listings in the parsed HTML search results. From each listing, we should extract the following information:

  • Location data (latitude and longitude)
  • Source link (to craiglist detailed listing)
  • Description text
  • Price and size data

You’ll be building this function one step at a time, to simplify the task.

3a: Find Individual Listings

The first job is to find the container that holds each individual listing. Use your browser’s devtools to identify the container that holds each individual listing. Then, write a function that takes in the parsed HTML and returns a list of the apartment listing container nodes. Call this function extract_listings:

def extract_listings(parsed):
    listings = parsed.find_all('p', class_='row')
    return listings

If you update your __main__ block to use this new function, you can verify the results visually:

if __name__ == '__main__':
    if len(sys.argv) > 1 and sys.argv[1] == 'test':
        html, encoding = read_search_results()
    else:
        html, encoding = fetch_search_results(
            minAsk=500, maxAsk=1000, bedrooms=2
        )
    doc = parse_source(html, encoding)
    listings = extract_listings(doc) # add this line
    print len(listings)              # and this one
    print listings[0].prettify()     # and this one too

Call your script from the command line (in test mode), to see your results:

[souptests]
heffalump:souptests cewing$ python scraper.py test
100
<p class="row" data-latitude="47.4924143400595" data-longitude="-122.235904626445" data-pid="4345117401">
  ...
</p>

[souptests]
heffalump:souptests cewing$

If you look through your test listings file using your browser’s devtools, you’ll notice that only some of them actually have latitude and longitude, Because the specs for our scraper require this data, we want to filter out any listings that do not.

BeautifulSoup allows us to filter our searches using HTML attributes with the attrs argument. One way of doing this is to provide a specific value for a given attribute:

doc.filter('p', attrs={'data-longitude': "47.4924143400595"})

It should be pretty clear though, that each of our listings is located in a different place, and this type of filtering won’t really help much. Happily, you can also provide True as the value for a given attribute. By doing so, you are telling BeautifulSoup that you want to match any node that has that attribute, regardless of the specific value.

Let’s use this to enhance our extract_listings function so that it only returns the listings that have location attributes:

def extract_listings(parsed):
    location_attrs = {'data-latitude': True, 'data-longitude': True}
    listings = parsed.find_all('p', class_='row', attrs=location_attrs)
    return listings

Calling the script from the command line now should return a slightly different number of results:

[souptests]
heffalump:souptests cewing$ python scraper.py test
74
<p class="row" data-latitude="47.4924143400595" data-longitude="-122.235904626445" data-pid="4345117401">
  ...
</p>

[souptests]
heffalump:souptests cewing$

3b: Extract Location Data

You’ve used the location data to filter results. Now that only those results that have locations are being listed, let’s begin scraping that data out of the HTML page.

In BeautifulSoup, the HTML attributes of a given tag are found as the attrs attribute of the Tag object. This attribute is a dictionary and it is certain to be present, even if it is empty. The names of the attributes are the keys of this dictionary, and the HTML values are the values.

We’ve already said that there is a certain set of data we want to preserve about each listing. We could create some custom Python class to represent a listing, and perhaps in some situations that would be appropriate, but for this simple script we will just build a dictionary that represents each listing.

Let’s update our extract_listings function to build a dictionary for each listing, and begin by populating it with the location data we extract:

def extract_listings(parsed):
location_attrs = {'data-latitude': True, 'data-longitude': True}
listings = parsed.find_all('p', class_='row', attrs=location_attrs)
extracted = []
for listing in listings:
    location = {key: listing.attrs.get(key, '') for key in location_attrs}
    this_listing = {
        'location': location,
    }
    extracted.append(this_listing)
return extracted

Since the return value of this function has now changed from a list of Tag objects to a list of dictionaries, we will also need to update our __main__ block:

if __name__ == '__main__':
    import pprint                                  # add this import
    if len(sys.argv) > 1 and sys.argv[1] == 'test':
        html, encoding = read_search_results()
    else:
        html, encoding = fetch_search_results(
            minAsk=500, maxAsk=1000, bedrooms=2
        )
    doc = parse_source(html, encoding)
    listings = extract_listings(doc)
    print len(listings)
    pprint.pprint(listings[0])                     # update this line

And now, executing this script at the command line should return the following:

[souptests]
heffalump:souptests cewing$ python scraper.py test
74
{'location': {'data-latitude': u'47.4924143400595',
              'data-longitude': u'-122.235904626445'}}
[souptests]
heffalump:souptests cewing$

3d: Extract Price and Size

Again, use your browser devtools to find the container that holds both the price of a listed apartment, and the text that describes its size.

What’s different about these two?

The price data is contained inside a convenient container of its own. The size, however, is not. It is just some text found in the main container after the Tag that holds the price. You can see this by dropping a breakpoint into your extract_listings function and inspecting the DOM:

> /Users/cewing/projects/souptests/scraper.py(39)extract_listings()
-> this_listing = {
(Pdb) l2 = listing.find('span', class_='l2')
(Pdb) print l2.prettify()
<span class="l2">
 <span class="price">
  $960
 </span>
 / 3br
 <span class="pnr">
  <small>
   (Seattle98102)
  </small>
  <span class="px">
   <span class="p">
    pic
    <a class="maptag" data-pid="4345117401" href="#">
     map
    </a>
   </span>
  </span>
 </span>
</span>

(Pdb)

If you try to get at that text by using the string attribute of the l2 span tag, you’ll see that it just isn’t there:

(Pdb) print l2.string
None
(Pdb)

Likewise, if you use the text attribute to get all the text in the tag, you end up with more than you really want:

(Pdb) print l2.text
  $960 / 3br -    (Seattle98102)   pic map
(Pdb)

You could parse this string to extract what you want, but why? There’s an easier way.

All text in a DOM document is really contained in instances of the NavigableString class.

We’ve already talked about how this class contains references to the DOM nodes around it. These references allow us to navigate the DOM, moving from one node to the next directly instead of simply searching for what we want. BeautifulSoup supports navigating from node to node in a number of ways:

  • into (or down to the next DOM tree level):
    • Tag.children (iterator with immediately contained elements)
    • Tag.descendants (generator returning all contained elements)
  • out (or up to the next DOM tree level):
    • Tag.parent: (the tag containing this tag)
    • Tag.parents: (generator returning all containers above this tag, closest first)
  • across (or within the same DOM tree level):
    • Tag.next_sibling (the node immediately following this one)
    • Tag.next_siblings (generator returning all nodes at this level after this one)
    • Tag.previous_sibling (the node immediately before this one)
    • Tag.previous_siblings (generator returning all nodes at this level before this one)

In this case, that ability can help us a great deal. Looking carefully, you might notice that the text describing the size of an apartment is located just after the span that contains our price. This means we can use navigation methods starting from the span containing the price to get where we want to be:

(Pdb) price_node = listing.find('span', class_='l2').find('span', class_='price')
(Pdb) price_node
<span class="price">$960</span>
(Pdb) price_node.next_sibling
u' / 3br -  '
(Pdb) price_node.next_sibling.strip()
u'/ 3br -'
(Pdb) price_node.next_sibling.strip(' \n-/')
u'3br'
(Pdb)

Type ‘quit’ at your pdb prompt to exit the debugger and then remove the breakpoint from your code.

Now update extract_listings to include the information we’ve just found:

def extract_listings(parsed):
    location_attrs = {'data-latitude': True, 'data-longitude': True}
    listings = parsed.find_all('p', class_='row', attrs=location_attrs)
    extracted = []
    for listing in listings:
        location = {key: listing.attrs.get(key, '') for key in location_attrs}
        link = listing.find('span', class_='pl').find('a')
        price_span = listing.find('span', class_='price')   # add me
        this_listing = {
            'location': location,
            'link': link.attrs['href'],
            'description': link.string.strip(),
            'price': price_span.string.strip(),             # and me
            'size': price_span.next_sibling.strip(' \n-/')  # me too
        }
        extracted.append(this_listing)
    return extracted

And now executing your script from the command line should show these new elements for a listing:

[souptests]
heffalump:souptests cewing$ python scraper.py test
74
{'description': u'2 BEDROOM 2 BATHROOM Zero Down   Rent with Option to Buy',
 'link': u'/oly/apa/4345117401.html',
 'location': {'data-latitude': u'47.4924143400595',
              'data-longitude': u'-122.235904626445'},
 'price': u'$960',
 'size': u'3br'}
[souptests]
heffalump:souptests cewing$