******************************************* Scraping Apartment Listings from Craigslist ******************************************* Work through this exercise to create a Python script to extract a list of apartment rentals from Craigslist. While working, you should use the virtualenv project we created in class for learning about the BeautifulSoup package. .. code-block:: bash heffalump:~ cewing$ workon souptests [souptests] heffalump:souptests cewing$ Begin by creating a new Python file, call it ``scraper.py``. Open it in your editor. Step 1: Fetch Search Results ============================ The first step is to use the ``requests`` library to fetch a set of search results from the Craigslist site. In order to do so, we will need to assemble a *query* that fits with the search form present on the `Seattle apartment listing page`_. .. _Seattle apartment listing page: https://seattle.craigslist.org/apa/ Use your browser's devtools to determine the name of the various inputs available in the search form on that page. You should end up with a list that includes at least the following: * keywords: ``query=keyword+values+here`` * price: ``minAsk=NNN maxAsk=NNN`` * bedrooms: ``bedrooms=N`` (N in range 1-8) You'll also discover, if you submit a search, that the URL to be used for a search request is in fact this: http://seattle.craigslist.org/search/apa Our goal is to write a Python function that will return the search results HTML from a query to craigslist. This function should: * It will accept one keyword argument for each of the possible query values * It will build a dictionary of request query parameters from incoming keywords * It will make a request to the craigslist server using this query * It will return the body of the response if there is no error * It will raise an error if there is a problem with the response Here is one possible solution for this query: .. code-block:: python import requests def fetch_search_results( query=None, minAsk=None, maxAsk=None, bedrooms=None ): search_params = { key: val for key, val in locals().items() if val is not None } if not search_params: raise ValueError("No valid keywords") base = 'http://seattle.craigslist.org/search/apa' resp = requests.get(base, params=search_params, timeout=3) resp.raise_for_status() # <- no-op if status==200 return resp.content, resp.encoding Write the results of your search to a file, ``apartments.html`` so that you can work on it without needing to hammer the craigslist servers. Write a ``read_search_results`` function which reads this file from disk and returns the content and encoding in the same way as the above function. Then you can switch between the two without altering the API. I leave this exercise to you. Step 2: Parse Search Results ============================ Next, we need a function ``parse_source`` to set up the HTML as DOM nodes for scraping. It will need to: * Take the response body from the previous method (or some other source) * Parse it using BeautifulSoup * Return the parsed object for further processing This function can be quite simple. Add it to ``scraper.py``: .. code-block:: python # add this import at the top from bs4 import BeautifulSoup # then add this function lower down def parse_source(html, encoding='utf-8'): parsed = BeautifulSoup(html, from_encoding=encoding) return parsed In order to see the results we have at this point, we'll need to make our ``scraper.py`` executable by adding a ``__main__`` block. For: .. code-block:: python # add another import at the top import sys if __name__ == '__main__': if len(sys.argv) > 1 and sys.argv[1] == 'test': html, encoding = read_search_results() else: html, encoding = fetch_search_results( minAsk=500, maxAsk=1000, bedrooms=2 ) doc = parse_source(html, encoding) print doc.prettify(encoding=encoding) Now, you can execute your scraper script in one of two ways: 1. ``python scraper.py`` will fetch results directly from Craigslist. 2. ``python scraper.py test`` will use your stored results from file. Step 3: Extract Listing Information =================================== You are going to build a function that extracts useful information from each of the listings in the parsed HTML search results. From each listing, we should extract the following information: * Location data (latitude and longitude) * Source link (to craiglist detailed listing) * Description text * Price and size data You'll be building this function one step at a time, to simplify the task. 3a: Find Individual Listings ---------------------------- The first job is to find the container that holds each individual listing. Use your browser's devtools to identify the container that holds each individual listing. Then, write a function that takes in the parsed HTML and returns a list of the apartment listing container nodes. Call this function ``extract_listings``: .. code-block:: python def extract_listings(parsed): listings = parsed.find_all('p', class_='row') return listings If you update your ``__main__`` block to use this new function, you can verify the results visually: .. code-block:: python if __name__ == '__main__': if len(sys.argv) > 1 and sys.argv[1] == 'test': html, encoding = read_search_results() else: html, encoding = fetch_search_results( minAsk=500, maxAsk=1000, bedrooms=2 ) doc = parse_source(html, encoding) listings = extract_listings(doc) # add this line print len(listings) # and this one print listings[0].prettify() # and this one too Call your script from the command line (in test mode), to see your results: .. code-block:: bash [souptests] heffalump:souptests cewing$ python scraper.py test 100

...

[souptests] heffalump:souptests cewing$ If you look through your test listings file using your browser's devtools, you'll notice that only *some* of them actually have latitude and longitude, Because the specs for our scraper require this data, we want to filter out any listings that do not. ``BeautifulSoup`` allows us to filter our searches using HTML attributes with the ``attrs`` argument. One way of doing this is to provide a specific value for a given attribute: .. code-block:: python doc.filter('p', attrs={'data-longitude': "47.4924143400595"}) It should be pretty clear though, that each of our listings is located in a different place, and this type of filtering won't really help much. Happily, you can also provide ``True`` as the value for a given attribute. By doing so, you are telling ``BeautifulSoup`` that you want to match any node that **has that attribute**, regardless of the specific value. Let's use this to enhance our ``extract_listings`` function so that it only returns the listings that have location attributes: .. code-block:: python def extract_listings(parsed): location_attrs = {'data-latitude': True, 'data-longitude': True} listings = parsed.find_all('p', class_='row', attrs=location_attrs) return listings Calling the script from the command line now should return a slightly different number of results: .. code-block:: bash [souptests] heffalump:souptests cewing$ python scraper.py test 74

...

[souptests] heffalump:souptests cewing$ 3b: Extract Location Data ------------------------- You've used the location data to filter results. Now that only those results that have locations are being listed, let's begin scraping that data out of the HTML page. In ``BeautifulSoup``, the HTML attributes of a given tag are found as the ``attrs`` attribute of the ``Tag`` object. This attribute is a dictionary and it is certain to be present, even if it is empty. The names of the attributes are the keys of this dictionary, and the HTML values are the values. We've already said that there is a certain set of data we want to preserve about each listing. We could create some custom Python class to represent a listing, and perhaps in some situations that would be appropriate, but for this simple script we will just build a dictionary that represents each listing. Let's update our ``extract_listings`` function to build a dictionary for each listing, and begin by populating it with the location data we extract: .. code-block:: python def extract_listings(parsed): location_attrs = {'data-latitude': True, 'data-longitude': True} listings = parsed.find_all('p', class_='row', attrs=location_attrs) extracted = [] for listing in listings: location = {key: listing.attrs.get(key, '') for key in location_attrs} this_listing = { 'location': location, } extracted.append(this_listing) return extracted Since the return value of this function has now changed from a list of ``Tag`` objects to a list of dictionaries, we will also need to update our ``__main__`` block: .. code-block:: python if __name__ == '__main__': import pprint # add this import if len(sys.argv) > 1 and sys.argv[1] == 'test': html, encoding = read_search_results() else: html, encoding = fetch_search_results( minAsk=500, maxAsk=1000, bedrooms=2 ) doc = parse_source(html, encoding) listings = extract_listings(doc) print len(listings) pprint.pprint(listings[0]) # update this line And now, executing this script at the command line should return the following: .. code-block:: bash [souptests] heffalump:souptests cewing$ python scraper.py test 74 {'location': {'data-latitude': u'47.4924143400595', 'data-longitude': u'-122.235904626445'}} [souptests] heffalump:souptests cewing$ 3c: Extract Description and Link -------------------------------- We used the ``find_all`` method of a ``Tag`` above to extract *all* the listings from our parsed document. We can use the ``find`` method of each listing to find *the first* item that matches our search filter. Use your browser's devtools to find the element in each listing that contains the descriptive text about the listing. What kind of HTML tag is it? What other useful bit of information is present in that tag? Use this information to enhance our ``extract_listings`` function so that each dictionary it produces also contains a ``description`` and ``link``: .. code-block:: python def extract_listings(parsed): location_attrs = {'data-latitude': True, 'data-longitude': True} listings = parsed.find_all('p', class_='row', attrs=location_attrs) extracted = [] for listing in listings: location = {key: listing.attrs.get(key, '') for key in location_attrs} link = listing.find('span', class_='pl').find('a') # add this this_listing = { 'location': location, 'link': link.attrs['href'], # add this too 'description': link.string.strip(), # and this } extracted.append(this_listing) return extracted Note that we are calling the string ``strip`` method on the value we get for the description from the ``string`` attribute of the ``Tag``. The most obvious reason is that we don't want extra whitespace. The second reason is more subtle, but also more important. The values returned by ``string`` are **not** simple unicode strings. They are actually instances of the ``NavigableString`` class: .. code-block:: python >>> listing.find('span', class_='pl').find('a').string.__class__ These class instances contain not only the text, but also instance attributes that connect them to the DOM nodes that surround them. These attributes take quite a bit of memeory. Calling ``strip`` or casting them to ``unicode`` with the ``unicode`` type object converts them, saving memory. Executing the script from the command line now should show you that you have succeeded: .. code-block:: bash [souptests] heffalump:souptests cewing$ python scraper.py test 74 {'description': u'2 BEDROOM 2 BATHROOM Zero Down Rent with Option to Buy', 'link': u'/oly/apa/4345117401.html', 'location': {'data-latitude': u'47.4924143400595', 'data-longitude': u'-122.235904626445'}} [souptests] heffalump:souptests cewing$ 3d: Extract Price and Size -------------------------- Again, use your browser devtools to find the container that holds *both* the price of a listed apartment, and the text that describes its size. What's different about these two? The price data is contained inside a convenient container of its own. The size, however, is not. It is just some text found in the main container **after** the ``Tag`` that holds the price. You can see this by dropping a breakpoint into your ``extract_listings`` function and inspecting the DOM: .. code-block:: pycon > /Users/cewing/projects/souptests/scraper.py(39)extract_listings() -> this_listing = { (Pdb) l2 = listing.find('span', class_='l2') (Pdb) print l2.prettify() $960 / 3br (Seattle98102) pic map (Pdb) If you try to get at that text by using the ``string`` attribute of the ``l2`` span tag, you'll see that it just isn't there: .. code-block:: pycon (Pdb) print l2.string None (Pdb) Likewise, if you use the ``text`` attribute to get *all* the text in the tag, you end up with more than you really want: .. code-block:: pycon (Pdb) print l2.text $960 / 3br - (Seattle98102) pic map (Pdb) You *could* parse this string to extract what you want, but why? There's an easier way. All text in a DOM document is really contained in instances of the ``NavigableString`` class. We've already talked about how this class contains references to the DOM nodes around it. These references allow us to *navigate* the DOM, moving from one node to the next directly instead of simply searching for what we want. ``BeautifulSoup`` supports navigating from node to node in a number of ways: * **into** (or down to the next DOM tree level): * ``Tag.children`` (iterator with immediately contained elements) * ``Tag.descendants`` (generator returning **all** contained elements) * **out** (or up to the next DOM tree level): * ``Tag.parent``: (the tag containing this tag) * ``Tag.parents``: (generator returning all containers above this tag, closest first) * **across** (or within the same DOM tree level): * ``Tag.next_sibling`` (the node immediately following this one) * ``Tag.next_siblings`` (generator returning **all nodes** at this level after this one) * ``Tag.previous_sibling`` (the node immediately before this one) * ``Tag.previous_siblings`` (generator returning **all nodes** at this level before this one) In this case, that ability can help us a great deal. Looking carefully, you might notice that the text describing the size of an apartment is located just after the ``span`` that contains our price. This means we can use navigation methods starting from the span containing the price to get where we want to be: .. code-block:: pycon (Pdb) price_node = listing.find('span', class_='l2').find('span', class_='price') (Pdb) price_node $960 (Pdb) price_node.next_sibling u' / 3br - ' (Pdb) price_node.next_sibling.strip() u'/ 3br -' (Pdb) price_node.next_sibling.strip(' \n-/') u'3br' (Pdb) Type 'quit' at your pdb prompt to exit the debugger and then remove the breakpoint from your code. Now update ``extract_listings`` to include the information we've just found: .. code-block:: python def extract_listings(parsed): location_attrs = {'data-latitude': True, 'data-longitude': True} listings = parsed.find_all('p', class_='row', attrs=location_attrs) extracted = [] for listing in listings: location = {key: listing.attrs.get(key, '') for key in location_attrs} link = listing.find('span', class_='pl').find('a') price_span = listing.find('span', class_='price') # add me this_listing = { 'location': location, 'link': link.attrs['href'], 'description': link.string.strip(), 'price': price_span.string.strip(), # and me 'size': price_span.next_sibling.strip(' \n-/') # me too } extracted.append(this_listing) return extracted And now executing your script from the command line should show these new elements for a listing: .. code-block:: bash [souptests] heffalump:souptests cewing$ python scraper.py test 74 {'description': u'2 BEDROOM 2 BATHROOM Zero Down Rent with Option to Buy', 'link': u'/oly/apa/4345117401.html', 'location': {'data-latitude': u'47.4924143400595', 'data-longitude': u'-122.235904626445'}, 'price': u'$960', 'size': u'3br'} [souptests] heffalump:souptests cewing$