**************************** Extracting Data from the Web **************************** Web Scraping ============ The internet makes a vast quantity of data available. But not always in the form or combination you want. It can be nice to combine data from different sources to create *meaning*. Data Sources ------------ Data online comes in many different formats: * Simple websites with data in HTML * Web services providing structured data * Web services providing tranformative service (geocoding) * Web services providing presentation (mapping) Let's concentrate on that first class of data, HTML. HTML Data ========= Ideally HTML would be well-formed and strictly correct in it's structure: .. code-block:: html
A nice clean paragraph
And another nice clean paragraph
But in fact, it usually ends up looking more like this: .. code-block:: html