ABSTRACT Typical tasks involved in data wrangling are exhibited by applying the techniques of data acquisition, data extraction and parsing, data quality audit, data transformation and data integration on the Orlando Open Street Map data. Eventually, the processed map data is prepared for storage in a persistent SQL data store for further data exploration and analysis.
Data that is used to support inferential models or exploratory and analytical studies usually do not come tidy. That is, raw and messy data do not come in a form that can readily be consumed by downstream processes.
This is especially true for data that were acquired by unconventional means such as web scraping or whose data elements were collected from an unknown provenance. Extra work has to be done on the data to make them easier to work with and useful for whatever purpose we may need them for later.
The OpenStreetMap (OSM) data is an example of a likely messy dataset. OpenStreetMap makes available user-contributed map data about local roads and businesses. Because information about roads, highways and other map objects are done by individuals, issues about the accuracy of the data may occur or at most be reasonably expected. This makes an OSM file a very suitable candidate for illustrating the data wrangling process.
Let us begin by loading the Python modules which we will be using in this project.
import codecs
import os
import os.path
import pandas as pd
import pprint
import pymongo
import re
import requests
import sys
import time
import xml.etree.cElementTree as ET
import json
from collections import defaultdict
from IPython.display import display, HTML
The OpenStreetMap data which we will be working on in this project is taken from Orlando, Florida. I chose this place for personal reasons; it is a city with which I am quite familiar.
The Orlando City OSM map file is acquired from the OpenStreetMap Project interactive web application where a user can manually search and select an area of the city for export and download.
There are several ways to acquire the XML OSM map file for a particular location. One method is to download a pre-selected area from MapZen and another method is to use the Overpass API to retrieve a map area tailored for your needs.
I decided to use the latter method in order to have control of the boundaries of the custom map area within the city of Orlando within which I would like to focus my study.
Moreover, I like to have the freedom to choose a map area large enough that will generate a sizeable file, preferably more than 50 MB. A large yet manageable OSM XML file will yield more data elements for analysis.
The get_map_file()
function is used to interface with the Overpass web API application to generate the OSM XML map file that we need for this project.
def get_map_file(api_url, local_filename):
"""
Use the Overpass API to download a custom OSM XML map file
"""
total_size = 79567834 # 77.703 KB
if os.path.exists(local_filename):
if os.path.getsize(local_filename) == total_size:
print 'The file {} already exists and verified. File size: {}'.format( \
local_filename, os.path.getsize(local_filename))
return None
r = requests.get(api_url, stream=True)
with open(local_filename, 'wb') as f:
print "Downloading %s" % local_filename, '...'
dl_size = 0
for chunk in r.iter_content(chunk_size=4096):
if chunk:
f.write(chunk)
dl_size += len(chunk)
done = int(100 * dl_size / float(total_size))
sys.stdout.write("\r[%s%s]" % ('-' * done, ' ' * (100 - done)))
sys.stdout.flush()
print '\nDownload finished. {} is ready.'.format(local_filename)
To use the get_map_file()
function, we pass in a couple of parameters: API_URL
parameter is the URL signature that contains the bounding box information that will be used by the Overpass API to generate the map file. OSM_FILE
parameter is a custom file name (and path) that will be assigned to the downloaded map file.
API_URL = 'https://overpass-api.de/api/map?bbox=-81.5120,28.4558,-81.2205,28.6135'
OSM_FILE = 'orlando.osm'
get_map_file(API_URL, OSM_FILE)
Here is a description of the raw XML OSM data file that we acquired from the OpenStreetMap web application.
File Property | Description |
---|---|
Data Source | OpenStreetMap Project API |
Data Format | OSM XML |
File Size | 77.7 MB |
A snapshot of the raw data file follows. This is generated by the OpenStreetMap Project web application.
It shows that the OSM data file makes use of UTF-8 character encoding, a dominant internet encoding standard. It also tells us that the file is marked up using the standard specifications for XML data format. We also get a hint of the presence of the osm version
tag and attribute that OpenStreetMap also adheres to a specialized set of document standards and specifications.
Now that we have acquired the data in the form of an XML file we can begin to examine the file superficially paying particular attention to its makeup. By doing this, we can start forming some questions that enhance our understanding of its contents which might then lead to more curious questions later.
def count_tags(filename):
"""
Returns a dictionary of name and count of top-level XML tags
"""
tags_dict = {} # initialize an empty dictionary
# SAX parsing of the file
for event, elem in ET.iterparse(filename, events=('start', )):
if elem.tag not in tags_dict:
tags_dict[elem.tag] = 1
else:
tags_dict[elem.tag] += 1
return tags_dict
Since we know that the data file is in XML format, let us try to discover the top-level elements that comprise the dataset.
The function count_tags()
above uses the Simple API for XML or SAX method of parsing to iteratively process the map file. The function returns a dictionary of unique OSM element names as keys and the cardinality of each element as values.
This will give us information about the type and volume of data we can expect to encounter in the map file.
OSM_FILE = 'orlando.osm'
tags = count_tags(OSM_FILE)
sorted_by_count = [(k, v) for (v, k) in sorted([(value, key) for (key, value) in tags.items()], reverse=True)]
df = pd.DataFrame(sorted_by_count)
df.columns = ['Element', 'Occurrences']
display(HTML("<h4>Types of elements and occurrences</h4>"))
display(HTML(df.to_html(index=False)))
The sorted report allowed us to discover the top-level elements that make up the OSM map data. They are the elements which have only one occurrence: bounds
, meta
, note
, and osm
.
Element | Description |
---|---|
osm | Information about the API application or service that generated the OSM file. |
note | Additional information about the document. |
meta | Usually holds timestamp information about the generated file. |
bounds | Contains the map file bounding box longitude and latitude information. |
The next set of elements worth paying attention to are the node
, way
and relation
elements. OpenStreetMap's conceptual data model of the physical world makes use of these basic components and it is worth knowing what constitutes each of them.
OpenStreetMap's documentation on some of the other primary elements are briefly described below. It is worthwhile to be familiar with these data elements so we know the types of data that are available to us and where to find them.
Element | Description |
---|---|
Node (nd) | |
Way (way) | |
Relation (relation) | |
Tag (tag) |
XML atttributes provide additional information about the particular element to which they are associated.
The get_attributes()
function will help us discover what and how many attributes exist in our dataset. We shall call this function and sort the results to determine the most frequent attributes that are in use in our data file.
def get_attributes(filename):
"""
Identify and bin each attribute using a dictionary
"""
attrs = {}
for event, elem in ET.iterparse(filename, events=('start', 'end')):
if event == 'end':
for attr in elem.attrib:
if attr not in attrs:
attrs[attr] = 1
else:
attrs[attr] += 1
return attrs
OSM_FILE = 'orlando.osm'
start_time = time.time()
attrs = get_attributes(OSM_FILE)
sorted_by_count = [(k, v) for (v, k) in sorted([(value, key) for (key, value) in attrs.items()], reverse=True)]
df = pd.DataFrame(sorted_by_count)
df.columns = ['Attribute', 'Occurrences']
display(HTML("<h4>Element Attributes and Occurrences</h4>"))
display(HTML(df.to_html(index=False)))
# print('\n--- %s seconds ---' % (time.time() - start_time))
Of particular interest to us are the k
(key) and v
(value) attributes which are displayed above as appearing 247,303 times in our dataset.
The key and value attributes are members of tag
elements and as was described before, a tag
describes specific map features of an OSM map element be they nodes, ways or relations. A key is unique within each tag element and it particularly shows a topic, category or type of a map feature (such as a highway or a name). The value, on the other hand, provides the details of the feature classified by the key associated with it.
def get_keys(filename):
"""
Collect key attributes and occurrences in a dictionary
"""
keys = {}
for event, elem in ET.iterparse(filename, events=('start', 'end')):
if event == 'end':
key = elem.attrib.get('k')
if key:
if key not in keys:
keys[key] = 1
else:
keys[key] += 1
return keys
The get_keys()
function above returns a list of dictionaries containing each key and associated frequency in the OSM data file.
# Returns the top five key-occurrences combinations from the OSM file
OSM_FILE = 'orlando.osm'
start_time = time.time()
keys = get_keys(OSM_FILE)
sorted_by_count = [(k, v) for (v, k) in sorted([(value, key) for (key, value) in keys.items()], reverse=True)]
df = pd.DataFrame(sorted_by_count)
df.columns = ['Key', 'Occurrences']
display(HTML("<h4>Key Attributes and Occurrences</h4>"))
display(HTML(df[:5].to_html(index=False)))
# print('\n--- %s seconds ---' % (time.time() - start_time))
While listing the different types of keys or map features available, we notice that there are a variety of different keys:
old_name
)addr:city
)addr.place
)According to the OpenStreetMap documentation regarding keys and values:
keys can be qualified with prefixes, infixes, or suffixes (usually, separated with a colon), forming super- or sub-categories or namespace.
It is useful to know that key attributes can be hierarchical and that they themselves may contain other data structures.
Let us remind ourselves that the purpose of this data audit is to discover potential problems with the raw data with an eye towards improving the data quality before ultimately putting them into persistent storage.
The key_type()
function will check the contents of each tag
element's key attribute (k
) through the use of three regular expressions to verify certain patterns. It will return the number of matches that it finds which will inform us of any potential tags that might require further examination.
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problem_chars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
def key_type(element, keys):
"""
Categorize and count each tag element
"""
if element.tag == "tag":
if lower.search(element.attrib['k']):
keys['lower'] += 1
elif lower_colon.search(element.attrib['k']):
keys['lower_colon'] += 1
elif problem_chars .search(element.attrib['k']):
keys['problem_chars'] += 1
else:
keys['other'] += 1
return keys
def process_map(filename):
keys = {"lower": 0, "lower_colon": 0, "problem_chars": 0, "other": 0}
for _, element in ET.iterparse(filename):
keys = key_type(element, keys)
return keys
We now pass in our Orlando OSM data file as a parameter to the process_map()
function which will iteratively parse through the XML file applying the key_type()
function over the file's records.
OSM_FILE = 'orlando.osm'
keys = process_map(OSM_FILE)
sorted_by_count = [(k, v) for (v, k) in sorted([(value, key) for (key, value) in keys.items()], reverse=True)]
df = pd.DataFrame(sorted_by_count)
df.columns = ['Key Type', 'Occurrences']
display(HTML("<h4>Type of Key Formats and Occurrences</h4>"))
display(HTML(df.to_html(index=False)))
It looks like we are safe because the report did not show any occurrence of problematic keys. We can continue with our data wrangling tasks trusting that we can safely use the key attributes going forward.
Among the key attributes that may be commonly expected in a map file are those that pertain to street names. The tag
element holds the street name information in the value associated with its addr:street
key attribute.
Let us list several addr:street
values to identify errors and untidy data and to get an idea of what type of processing or cleaning we might have to do with those values. The code below helps us do this.
OSM_FILE = 'orlando.osm'
street_type_re = re.compile(r'\S+\.?$', re.IGNORECASE)
def audit_street_type(street_types, street_name):
"""
Use the regex pattern to match street type part of the street name
Args:
street_types(dictionary): Key-value collection of street types and occurrences
street_name(string): The value of a tag pertaining to an address key
"""
m = street_type_re.search(street_name)
if m:
street_type = m.group()
street_types[street_type] += 1
def is_street_name(elem):
"""
Check if the element is a tag and the key attribute is a street information
Args:
elem(dictionary): OSM tag object in key-value pair format
Returns:
bool: Indicates whether contents of dictionary item pertains to a street
"""
return (elem.attrib['k'] == "addr:street")
def print_sorted_dict(d):
"""
Displays the results from iterating through street type dictionary
Args:
d(dictionary): Key-value collection of street types and occurrences
"""
sorted_by_count = sorted([(key, value) for (key, value) in d.items()], reverse=False)
df = pd.DataFrame(sorted_by_count)
df.columns = ['Street Type', 'Occurrences']
display(HTML("<h4>Sample of Street Types and Occurrences</h4>"))
display(HTML(df[:5].to_html(index=False)))
def street_type_audit(file_in):
"""
Wrapper file iterator for the main street type audit routine
Args:
file_in(string): Input map file
Returns:
street_types(dictionary): Key-value collection of street types and occurrences
"""
osm_file = open(file_in, "r")
street_types = defaultdict(int)
for event, elem in ET.iterparse(osm_file, events=("start",)):
if elem.tag == "node" or elem.tag == "way":
for tag in elem.iter("tag"):
if is_street_name(tag):
audit_street_type(street_types, tag.attrib['v'])
return street_types
if __name__ == '__main__':
start_time = time.time()
street_types = street_type_audit(OSM_FILE)
print_sorted_dict(street_types)
# print('\n--- %s seconds ---' % (time.time() - start_time))
The output of our street type audit shows that there are some indeterminable number and number-character patterns that were used (e. g. #303
, #F
, 136
, etc.) in the street names. These can be further investigated in a different study.
What we are interested in for this project is noticing the variety of ways a particular street type is expressed in the raw OSM data file. For example, Avenue
which is predominantly used (190 occurrences) has two other variants: Ave
(used 6 times in the file) and Ave.
(with a period, used once).
The discovery of this characteristic in our dataset provides a motivation for our first data cleaning task — we shall seek and update street type data in the street names and make certain street types conform to a uniform set of street type labels that we will specify. For instance, we will replace all occurrences of the terms Ave
and Ave.
with the standard term Avenue
.
The data audit of the street types reveals the need for some work to be done in shaping the raw data into a form that we can use. But we should not undertake the cleaning of the data haphazardly. This is where having a data cleaning plan becomes very useful.
A data cleaning plan pushes us to be organized and methodical in our approach towards data cleaning.
I have prepared the following data cleaning plan for tidying the street names in the Orlando OSM dataset:
Procedure number 1 has been accomplished by our recent data audit work on the Orlando map file street names. As for procedure number 2, we can be sure that the cause of variations and other possible problems are mostly attributable to human error.
What follows then is the need to develop some operations that will correct the data (procedure number 3).
The code below defines some helper functions that will verify the street type used in each street name. The idea is to first test the street type if it matches any item in the list of expected street types that we have prepared. A mismatch marks the street type as a candidate for possible replacement.
street_type_re = re.compile(r'\S+\.?$', re.IGNORECASE)
expected = ["Avenue", "Boulevard", "Circle", "Court", "Drive", "Lane",
"Loop", "Parkway", "Place", "Plaza", "Road", "Run", "Street",
"Trail", "Turnpike", "Way"]
mapping = { "pl": "Place",
"st": "Street",
"ave": "Avenue",
"rd": "Road",
"w": "West",
"n": "North",
"s": "South",
"e": "East",
"blvd":"Boulevard",
"sr": "Drive",
"ct": "Court",
"ne": "Northeast",
"se": "Southeast",
"nw": "Northwest",
"sw": "Southwest",
"dr": "Drive",
"sq": "Square",
"ln": "Lane",
"trl": "Trail",
"pkwy": "Parkway",
"ste": "Suite",
"lp": "Loop",
"hwy": "Highway"}
# updated version of audit_street_type
def audit_street_type(street_types, street_name):
m = street_type_re.search(street_name)
if m:
street_type = m.group()
if street_type not in expected:
street_types[street_type].add(street_name)
def is_street_name(elem):
"""
Check if the element is a tag and the key attribute is a street information
"""
return (elem.attrib['k'] == "addr:street")
def audit_street_names(osmfile):
osm_file = open(osmfile, "r")
street_types = defaultdict(set)
for event, elem in ET.iterparse(osm_file, events=("start",)):
if elem.tag == "node" or elem.tag == "way":
for tag in elem.iter("tag"):
if is_street_name(tag):
audit_street_type(street_types, tag.attrib['v'])
return street_types
def get_valid_name(name, mapping):
name_list = re.findall(r"[\w']+", name)
end_of_street_name = len(name_list)
for i in range(len(name_list)):
word = name_list[i].lower()
if word in mapping:
end_of_street_name = i
name_list[i] = mapping[word]
name_list = name_list[:(end_of_street_name + 1)]
better_name = ' '.join(name_list)
return better_name
Our primary goal in auditing and cleaning the street type data is to homogenize the frequently used terms, that is, to make them uniform.
Because of the street type audit report that we produced, we are able to come up with a set of street type terms which we know will need to be checked and corrected if necessary and include them in a python list called expected
.
We also prepared a dictionary of commonly used street type contractions (mapping
) as keys and paired each of them with a corrected version that we like to use in the update.
Let us put our helper functions to work. The code below loads and iterates through the OSM map file gathering all street types and keeping each uncommon street name that it finds into a dictionary keyed on the street type (the street name contains a street type that is not in our list of expected street types).
The dictionary (street_types
) is then put into another iterative process whereby the street name finds its corrected version based on a mapping of how we want a street type to appear when it finds an incorrect street type (get_valid_name
). The proposed street name conversion is then displayed for each visit into the dictionary elements.
start_time = time.time()
OSM_FILE = 'orlando.osm'
street_types = audit_street_names(OSM_FILE)
display(HTML("<h4>Proposed street name conversions</h4>"))
for st_type, ways in street_types.iteritems():
for street_name_in_file in ways:
valid_name = get_valid_name(street_name_in_file, mapping)
if street_name_in_file != valid_name:
print street_name_in_file, "===>", valid_name
print('\n--- %s seconds ---' % (time.time() - start_time))
The output shown above shows us the results of the proposed street name conversions that will happen should we execute the data cleaning operations that we have defined. Our efforts to ensure uniformity among certain street types which we identified during the data audit were effective.
The clean_street_names()
function that follows will implement the data cleaning tasks on the street names by passing it our Orlando OSM file. The function also generates a seperate output file, cleaned_street_name.xml
, which will be a version of the original OSM file in which the street names have undergone data transformation.
def clean_street_names(filename, cleaned_filename):
"""
Apply data cleaning on street names
"""
tree = ET.parse(filename)
root = tree.getroot()
for tag in root.findall('*/tag'):
if is_street_name(tag):
name = tag.get('v')
better_name = get_valid_name(name, mapping)
tag.set('v', better_name)
return tree.write(cleaned_filename)
start_time = time.time()
OSM_FILE = 'orlando.osm'
cleaned_street_name = 'cleaned_street_name.xml'
clean_street_names(OSM_FILE, cleaned_street_name)
print('\n--- %s seconds ---' % (time.time() - start_time))
As the last step in this iteration of the street name data cleaning plan (specifically procedure 4), we shall run the get_valid_name()
function on our dataset. However, we will be using our newly-created output file cleaned_street_name.xml
to check whether some street type related issues still persist in our dataset.
I mentioned the word iteration because data cleaning is usually an iterative process. We have to repeat our data cleaning process not only to (1) verify the effectiveness of our data cleaning efforts but also (2) to examine if there is still more cleaning left to be done.
cleaned_street_name = 'cleaned_street_name.xml'
street_types = audit_street_names(cleaned_street_name)
vcount = 0
for st_type, ways in street_types.iteritems():
for name in ways:
better_name = get_valid_name(name, mapping)
if name != better_name:
vcount = vcount + 1
print name, "=>", better_name
if vcount == 0:
display(HTML("<br/><p style=\"color:green;\">SUCCESS: There were no street names to convert.</p>"))
The general audit that we made on the entire Orlando OSM map file informed us of the existence of certain tag
elements bearing the phone
key label.
Let us explore a sample of 52 values that are contained in phone tag elements to get an idea of whether or not they require further processing or cleaning. Again, we are doing this within the context of the desire to homogenize our data whenever it is warranted and applicable.
OSM_FILE = 'cleaned_street_name.xml'
display(HTML("<h4>Sample Phone Numbers</h4>"))
count_max = 10
count_number = 0
for _, element in ET.iterparse(OSM_FILE, ('start', 'end')):
if element.tag == "tag":
if element.attrib['k'] == 'phone' \
and count_number < count_max:
print(element.attrib['v'])
count_number += 1
print '\nTotal count of sample phone numbers:', count_number
The phone numbers take on different formats. This is expected because the individuals who volunteered to input the data may have different ideas on how a phone number should be entered.
Our data stores the phone numbers in the following common ways:
+1-234-567-8901 123-456-7890 1234567890 (123) 456-7890
The design for dealing with the tag elements containing the phone numbers in the OSM file is as follows:
To give us an idea of the number of phone numbers the map file actually contains, we will re-use the code we ran earlier.
OSM_FILE = 'cleaned_street_name.xml'
count_number = 0
for _, element in ET.iterparse(OSM_FILE, ('start', 'end')):
if element.tag == "tag":
if element.attrib['k'] == 'phone':
count_number += 1
print '\nTotal count of phone numbers:', count_number
Next, we will define our goal of cleaning that portion of the data that pertains to phone numbers. By this, cleaning the phone numbers basically means making all the values adhere to a uniform phone number format. And for our situation, the format would be to just retain all the digits and strip all symbols and inner white space (e.g. (407) 123-4567 will become 4071234567).
This policy is operationalized through the use of a regular expression patterns.
Let us now find out how many phone numbers actually adhere to our uniform format. The phone numbers can be classified into two groups: numbers_only
and other_format
(shown below with examples):
numbers_only |
|
other_format |
strictly_numbers = re.compile(r'\d+')
def count_phone_types(element, phone_types):
if element.tag == "tag":
if element.attrib['k'] == 'phone':
if strictly_numbers.match(element.attrib['v']):
phone_types['numbers_only'] += 1
else:
phone_types['other_format'] += 1
return phone_types
def process_map_phone_types(filename):
phone_types = {"numbers_only": 0, "other_format": 0}
for _, element in ET.iterparse(filename, ('start', 'end')):
phone_types = count_phone_types(element, phone_types)
return phone_types
display(HTML("<h4>Phone Types and Occurrences</h4>"))
pprint.pprint(process_map_phone_types(cleaned_street_name))
We will endeavor to clean the telephone data and expect to transform the values of the phone number elements into a numbers-only
format. This format lends itself conveniently well for building a uniformly formatted set of phone numbers.
The clean_phone_number()
function implements the data cleaning task required to standardize the telephone number format by applying a regular expression substitution method on the OSM file that has recently undergone street type data cleaning from a previous cleaning plan.
def clean_phone_number(filename, cleaned_filename):
"""
Apply the data cleaning process on OSM file
Args:
filename(string): Name of uncleaned file
cleaned_filename(string): Name of file after being processed
"""
tree = ET.parse(filename)
root = tree.getroot()
for tag in root.findall('*/tag'):
if tag.attrib['k'] == 'phone':
phone_number = tag.get('v')
if not strictly_numbers.match(phone_number):
update_phone_number = re.sub(r'[\D|\s]+', '', phone_number)[-10: ]
tag.set('v', update_phone_number)
return tree.write(cleaned_filename)
Let us use the cleaned_street_name.xml
OSM file and the name of the file that will hold the results of the data cleaning process (cleaned_phone_number.xml
) as arguments to the clean_phone_number()
function.
Right after generating the cleaned_phone_number.xml
output file, we will use this file again as an attribute to the process_map_phone_types()
function to check the results of our data cleaning efforts.
cleaned_street_name = 'cleaned_street_name.xml'
start_time = time.time()
cleaned_phone_number = 'cleaned_phone_number.xml'
clean_phone_number(cleaned_street_name, cleaned_phone_number)
display(HTML("<h4>Data cleaning of phone number results</h4>"))
pprint.pprint(process_map_phone_types(cleaned_phone_number))
# print('\n--- %s seconds ---' % (time.time() - start_time))
The function returns a zero under the other_format
group which means that the phone numbers have been properly transformed to only hold numbers.
The output of the telephone data cleaning process, cleaned_phone_number.xml
, is going to be used in the next phase of our data cleaning efforts.
When I set out to get the Orlando dataset from the Open Street Map project's interactive web application, I did so by arbitrarily drawing a square region on top of the area that encompasses much of Orlando. I understand that this may not be precise which means that I may have grabbed some locations that are actually outside the jurisdiction of Orlando.
In this section of the data cleaning project, we will have the opportunity to be more precise by detecting and keeping those data whose location information only belongs to city of Orlando. And we are going to do this using the zip or postal code information that are stored in the file.
An interesting element of the Open Street Map file are the postal code values associated with the key: addr:postcode
. To get an idea of the various postal codes that are in the data as well as the number of data elements that are associated with each unique postal code, we have developed the count_postcodes()
function.
OSM_FILE = 'cleaned_street_name.xml'
def count_postcodes(filename):
"""
Extract postal codes and generate frequencies
Args:
filename(string): Current OSM XML file that is currently
going through the data cleaning pipeline
Returns:
postcodes(dictionary): Post code and frequency dictionary
"""
postcodes = {}
for event, elem in ET.iterparse(filename, events=('start', 'end')):
if event == 'end':
key = elem.attrib.get('k')
if key == 'addr:postcode':
postcode = elem.attrib.get('v')
if postcode not in postcodes:
postcodes[postcode] = 1
else:
postcodes[postcode] += 1
return postcodes
start_time = time.time()
postcodes = count_postcodes(OSM_FILE)
sorted_by_occurrence = [(k, v) for (v, k) in sorted([(value, key)
for (key, value)
in postcodes.items()], reverse=True)]
df = pd.DataFrame(sorted_by_occurrence)
df.columns = ['Postal Code', 'Frequency']
display(HTML("<h4>Sample Postal Codes and Frequencies</h4>"))
display(HTML(df[:5].to_html(index=False)))
The preliminary audit results tells us that most of the postal codes tend to cluster around a small group of numbers (32835, 32789, 32803, 32806, 32801, 32839, 32804, 32822, 32819, 32792 and 32809).
It also shows that there are a few postcodes that were entered using a different format (e.g. 32819-7827 and the like) as well as erroneous entries (e.g. 'FL 32803'). Data cleaning sometimes involves manual corrections and we can deal with these few errant entries programmatically using the fix_select_zipcode()
function that we have prepared. The function will replace these select post codes with the proper format or data based on the original values.
Looking into list of postal codes which our data audit revealed, it is possible that within our data, the postal code (and its associated entry) is not actually valid for our purposes. We therefore need to distinguish those codes that are within the jurisdiction of the city of Orlando and those which lie without.
In order to audit the addr:postcode
field for accuracy, we are going to need some means of programmatically processing it and comparing the values that we encounter to some set of data that we can reliably assume is correct.
The gold standard data that we are going to use for this project is the set of postal codes from USPS which is corroborated by zip-codes.com. There is a total of 59 postal codes ranging from 32801 through 32899.
We are now able to perform another data audit given this gold standard data in hand. The following code will tell us how many node
, way
and relation
elements are associated with postal codes (addr:postcode
) that belong to the city of Orlando (orlando_postcodes
) and those that do not (other_postcodes
). It does so by iterating over the addr:postcode
key values and comparing each with a list of official Orlando postal codes retrieved from a reliable source.
OSM_FILE = 'cleaned_street_name.xml'
valid_orlando_postcodes = [
'32801','32802','32803','32804','32805','32806','32807','32808','32809',
'32810','32811','32812','32814','32815','32816','32817','32818','32819',
'32820','32821','32822','32824','32825','32826','32827','32828','32829',
'32830','32831','32832','32833','32834','32835','32836','32837','32839',
'32853','32854','32855','32856','32857','32858','32859','32860','32861',
'32862','32867','32868','32869','32872','32877','32878','32885','32886',
'32887','32891','32896','32897','32899']
def get_postcode(elem):
"""
Extracts the postal code value from the dictionary item
Args:
elem(Object): OSM element object
Returns"
bool: The element is a postal code element
value(string): The postal code that is inside the dictionary
"""
if elem.tag in ['node', 'way', 'relation']:
for tag in elem.iter():
if tag.get('k') == 'addr:postcode':
return True, tag.get('v')
return False, None
return False, None
def fix_select_zipcode(zipcode):
"""
Converts the postal code to the correct format
Args:
zipcode(string): postal code value
Returns"
new_zip(string): corrected postal code value
"""
new_zip = ''
if zipcode == 'FL 32803':
new_zip = '32803'
elif zipcode == 'FL 32792':
new_zip = '32792'
elif zipcode == '32819-7827':
new_zip = '32819'
elif zipcode == '32817-4427':
new_zip = '32817'
elif zipcode == '32805-1100':
new_zip = '32805'
elif zipcode == '32803-4806':
new_zip = 32803
elif zipcode == '32792-6801':
new_zip = 32792
else:
new_zip = zipcode
return new_zip
def process_map(filename):
"""
Accumulates postal codes that belong or do not belong to Orlando
Args:
filename(string): OSM map file name
Returns:
number_types(dictionary): Dictionary bins postal codes and counts
"""
number_types = {"orlando_postcodes": 0, "other_postcodes": 0}
tree = ET.parse(filename)
root = tree.getroot()
for child in ['node', 'way', 'relation']:
for element in root.findall(child):
has_postcode, postcode_value = get_postcode(element)
if has_postcode:
postcode_value = fix_select_zipcode(postcode_value)
if postcode_value in valid_orlando_postcodes:
number_types['orlando_postcodes'] += 1
else:
number_types['other_postcodes'] += 1
return number_types
display(HTML("<h4>Postal codes by jurisdiction</h4>"))
pprint.pprint(process_map(OSM_FILE))
Our project's stated scope and limitations permits us to discard the node elements (64) which hold non-Orlando city postal codes.
The design for dealing with the tag elements containing the postal numbers in the workflow OSM file is as follows:
Continuing on with procedure number 3 from above, we will take the workflow OSM data file that had undergone street type data cleaning (cleaned_street_name.xml
) and pass it as one of the arguments to the clean_postcode()
function. The other argument would be the cleaned_postcode
variable which contains the arbitrary name of the file that will hold the results of our postal code data cleaning process.
OSM_FILE = 'cleaned_street_name.xml'
def clean_postcode(filename, cleaned_filename):
tree = ET.parse(filename)
root = tree.getroot()
for child in ['node', 'way', 'relation']:
for elem in root.findall(child):
has_postcode, postcode_value = get_postcode(elem)
if has_postcode:
if postcode_value not in valid_orlando_postcodes:
root.remove(elem)
return tree.write(cleaned_filename)
start_time = time.time()
cleaned_postcode = 'cleaned_postcode.xml'
clean_postcode(OSM_FILE, cleaned_postcode)
print('\n--- %s seconds ---' % (time.time() - start_time))
The clean_postcode()
function visits each set of node
, way
or relation
nodes from the OSM file and tries to find and extract any available addr:postcode
value. It then performs a test and decides whether to discard the node that contains the postal code. This decision is based on whether or not the postal code can be found in the standard Orlando postal code list.
CLEANED_POSTCODE_FILE = 'cleaned_postcode.xml'
display(HTML("<h4>Postal codes by jurisdiction (after data cleaning)</h4>"))
pprint.pprint(process_map(CLEANED_POSTCODE_FILE))
Finally, the code above implements the last step of the postal code data cleaning plan when it resubmits the newly generated OSM file as an argument to the process_map()
function. This task will show the number of nodes that contain invalid postal codes. If the the function reports no remaining invalid post codes, then we are through with the data cleaning tasks that we set out to tackle for the purposes of this project.
Our OSM file has undergone three data cleaning processes producing the final file, cleaned_postcode.xml
, which we load into persistent storage to facilitate further data analysis.
This section of the project would like to focus on the variety of data analysis options which become available once the dataset has been loaded into a relational database like SQLite. Thus, I shall briefly outline the work that I have done to import the cleaned OSM file into the database.
cleaned_postcode.xml
) is used as input to a python application that I developed which parses and segregates the primary nodes and tags into their own CSV files, namely: nodes.csv, nodes_tags.csv, ways.csv, ways_nodes.csv
and ways_tags.csv
. orlando_osm.db
) and the tables that will hold the OSM data. The tables were given names that match the CSV files with which they are populated. A few of them are shown below. timestamp
attributes..mode csv .import nodes.csv nodes .import nodes_tags.csv nodes_tags .import ways.csv ways .import ways_nodes.csv ways_nodes .import ways_tags.csv ways_tags
UPDATE nodes SET timestamp = REPLACE(timestamp, X'0D', ''); UPDATE ways SET timestamp = REPLACE(timestamp, X'0D', '');
Let us verify the results of the CSV file imports. The sqlite3
Python module allows us to connect and query our SQLite orlando_osm.db
database file.
File Name | File Size |
---|---|
Original Orlando OSM file | 75.8 MB |
nodes.csv | 24.4 MB |
nodes_tags.csv | 1.26 MB |
ways.csv | 2.98 MB |
ways_nodes.csv | 9.29 MB |
ways_tags.csv | 8.00 MB |
Notice that in the code that follows we will be using the execute_query()
function every now and then as a convenient way of executing the various SQL queries. In this fashion, we will only be altering the query itself, which is contained in the static literal variable called QUERY
.
Among the first things we would like to know right after loading our file into our database is to get an idea of how many records were loaded in each database table. The following code wil provide us this information.
import sqlite3
import pandas as pd
from collections import defaultdict
from IPython.display import display, HTML
DATABASE = 'orlando_osm.db'
QUERY = '''
SELECT 'nodes' AS nodes, COUNT(*)
FROM nodes
UNION
SELECT 'nodes_tags' AS nodes_tags, COUNT(*)
FROM nodes_tags
UNION
SELECT 'ways' AS ways, COUNT(*)
FROM ways
UNION
SELECT 'ways_nodes' AS ways_nodes, COUNT(*)
FROM ways_nodes
UNION
SELECT 'ways_tags' AS ways_tags, COUNT(*)
FROM ways_tags
'''
def create_connection(db_file):
"""
Create a database connection to the SQLite database
"""
try:
conn = sqlite3.connect(db_file)
return conn
except Error as e:
print(e)
return None
def execute_query(db, query):
"""
Returns the query results as a list
"""
conn = create_connection(db)
with conn:
cur = conn.cursor()
cur.execute(query)
return cur.fetchall()
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows)
df.columns = ['Table Name', 'Count']
display(HTML(df.to_html(index=False, header=True)))
The tables that have the greatest share of the Open Street Map records are the nodes
, ways_nodes
and ways_tags
tables.
The query below should return the first five records of the nodes
table. The interesting attributes of this table are the longitude and latitude coordinates which confirm what a node stands for, an Open Street Map node is a particular point on the earth's surface. The node also contains the the user name of the contributing person.
import sqlite3
import pandas as pd
DATABASE = 'orlando_osm.db'
QUERY = '''
SELECT id, lat, lon, user, uid, timestamp
FROM nodes
LIMIT 6;
'''
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows[1:])
df.columns = ['nodes_id', 'latitude', 'longitude', 'user', 'user_id', 'timestamp']
display(HTML(df.to_html(index=False, header=True)))
The nodes_tags provides additional information about the node elements of the Open Street Map dataset. The information consists of key-value pairs describing information about road structures such as highway traffic signals, crossings, etc.
QUERY = '''
SELECT id, key, value
FROM nodes_tags
LIMIT 6;
'''
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows[1:])
df.columns = ['nodes_id', 'key', 'value']
display(HTML(df.to_html(index=False, header=True)))
The Open Street Map data dictionary specification for the ways
element describes it as a an ordered list of nodes. By itself, our ways
table consists of metadata pertaining to the element. The records in this table will be useful when paired with the ways_nodes
intervening table as it will allow us to access the many side of the one-many relationship between the ways
entities and the various nodes
entities with which they are associated.
QUERY = '''
SELECT id, user, uid, version, changeset, timestamp
FROM ways
LIMIT 6;
'''
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows[1:])
df.columns = ['ways_id', 'user', 'user_id', 'version', 'changeset', 'timestamp']
display(HTML(df.to_html(index=False, header=True)))
As introduced earlier, the ways_nodes
table concretizes the one-many relationship between the ways
and nodes
entities. A way
is seen as a collection of node elements.
QUERY = '''
SELECT id, node_id, position
FROM ways_nodes
LIMIT 6;
'''
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows[1:])
df.columns = ['ways_id', 'nodes_id', 'position']
display(HTML(df.to_html(index=False, header=True)))
Just like the nodes_tags
record, the ways_tags
record keeps descriptive information about its parent element in the form of key-value pairs (e.g. {building: apartments}
, {highway: pedestrian}
, {landuse: grass}
).
QUERY = '''
SELECT id, key, value, type
FROM ways_tags
LIMIT 6;
'''
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows[1:])
df.columns = ['ways_id', 'key', 'value', 'type']
display(HTML(df.to_html(index=False, header=True)))
Now that we have been briefly acquainted with the tables in our OSM database, we can start exploring it to glean some insights from the dataset.
The Open Street Map Project is what can be called a living dataset in that it relies on its users who create, edit and update the data that make up the map system. The query that follows would provide information about the number of unique users who have volunteered to generate and maintain our special OSM map file for Orlando.
QUERY = '''
SELECT COUNT(DISTINCT(U1.uid))
FROM (SELECT uid
FROM nodes
UNION ALL
SELECT uid
FROM ways) U1
'''
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows)
df.columns = ['unique_uid_count']
display(HTML("<h4>Number of Unique Contributors</h4>"))
display(HTML(df.to_html(index=False, header=True)))
There are 709 users, who among them have contributed the most? The query below lists ten of the most prolific contributors along with their number of contributions as well as their contribution's percentage share of the total contributions in our dataset.
QUERY = '''
SELECT x2.user, x2.cnt, round(((x2.cnt*1.0/x2.tot)*100.0),0) as pct
FROM (
SELECT
user,
cnt,
(SELECT COUNT(*)
FROM (
SELECT user
FROM nodes
UNION ALL
SELECT user
FROM ways
)
) as tot
FROM (
SELECT
U2.user,
COUNT(*) AS cnt
FROM (SELECT user
FROM nodes
UNION ALL
SELECT user
FROM ways) U2
GROUP BY U2.user
)
) x2
order by x2.cnt DESC
LIMIT 10;
'''
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows)
df.columns = ['user_id','num_contributions','pct_of_total']
display(HTML("<h4>Top 10 Orlando Open Street Map Contributors</h4>"))
display(HTML(df.to_html(index=False, header=True)))
I would like to revisit the postal codes which were the subject of our data audit and data cleaning tasks a brief while ago.
Which postal codes had the highest frequency of use in our Orlando Open Street Map file?
QUERY = '''
SELECT A1.value, COUNT(*)
FROM
(
SELECT X1.value
FROM ways_tags X1
WHERE X1.key LIKE '%postcode%'
UNION ALL
SELECT Y1.value
FROM nodes_tags Y1
WHERE Y1.key LIKE '%postcode%'
) A1
GROUP BY A1.value
ORDER BY COUNT(*) DESC
'''
rows = execute_query(DATABASE, QUERY)
pcodes1_df = pd.DataFrame(rows)
pcodes1_df.columns = ['postal_code', 'count']
display(HTML(pcodes1_df.to_html(index=False, header=True)))
The top five open street map contributions where a postal code was used came from the following areas:
Postal Code | Record Count | Area Description |
---|---|---|
32835 | 36 | The southwest,Metrowest area |
32803 | 31 | Colonialtown-Audubon Park area |
32806 | 29 | Delaney Park area |
32801 | 24 | Downtown Orlando area |
32839 | 24 | Millenia Mall, Americana Blvd area |
These areas are among the busiest places in the city so it makes sense that the OSM contributions reflect this reality. The bar plot below provides a visualization of the post code frequencies that can be found in our dataset.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
QUERY = '''
SELECT X1.value
FROM ways_tags X1
WHERE X1.key LIKE '%postcode%'
UNION ALL
SELECT Y1.value
FROM nodes_tags Y1
WHERE Y1.key LIKE '%postcode%'
'''
rows = execute_query(DATABASE, QUERY)
pcodes2_df = pd.DataFrame(rows)
pcodes2_df.columns = ['postal_code']
fig, ax = plt.subplots(figsize=(14,6))
plt.title("Number of OSM Elements with Post Code Data")
sns.set(style="whitegrid", color_codes=True)
g = sns.countplot(x="postal_code",
data=pcodes2_df,
palette="RdYlBu",
order = pcodes2_df['postal_code'].value_counts().index)
g.set_xticklabels(g.get_xticklabels(),rotation=90);
Among the things that I would like to explore from the Orlando Open Street Map dataset is information that would interest a foodie like me.
I would like to know how many nodes
and ways
records hold information about restaurants.
QUERY = '''
SELECT 'nodes_tags', COUNT(*)
FROM nodes_tags
WHERE key == 'amenity' AND value LIKE '%restaurant%'
UNION ALL
SELECT 'ways_tags', COUNT(*)
FROM ways_tags
WHERE key == 'amenity' AND value LIKE '%restaurant%'
'''
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows)
df.columns = ['Table', 'Restaurants']
display(HTML(df.to_html(index=False, header=True)))
Building up on what we have so far, let us associate each node or way tag elements with its postal code. We intend to gather some information about restaurants from each of these collections for use later.
QUERY = '''
SELECT 'ways_tags', X1.id, X1.key, X1.value, X2.key, X2.value
FROM ways_tags X1
JOIN ways_tags X2 on X1.id = X2.id
WHERE X1.key LIKE '%postcode%'
AND (X2.key == 'amenity' AND X2.value LIKE '%restaurant%')
UNION
SELECT 'nodes_tags', Y1.id, Y1.key, Y1.value, Y2.key, Y2.value
FROM nodes_tags Y1
JOIN nodes_tags Y2 ON Y1.id = Y2.id
WHERE Y1.key LIKE '%postcode%'
AND (Y2.key == 'amenity' AND Y2.value LIKE '%restaurant%');
'''
rows = execute_query(DATABASE, QUERY)
rest1_df = pd.DataFrame(rows)
rest1_df.columns = ['source', 'id', 'type_1', 'postal_code', 'type_2', 'value_2']
display(HTML(rest1_df.to_html(index=False, header=True)))
The code above identifies the nodes and ways tag elements that pertain to restaurants as well as their associated postal codes. We could use this information to determine the number of restaurants that are located within a particular postal code.
QUERY = '''
SELECT A1.value, COUNT(*)
FROM
(
SELECT 'ways_tags', X1.id, X1.key, X1.value, X2.key, X2.value
FROM ways_tags X1
JOIN ways_tags X2 on X1.id = X2.id
WHERE X1.key LIKE '%postcode%'
AND (X2.key == 'amenity' AND X2.value LIKE '%restaurant%')
UNION ALL
SELECT 'nodes_tags', Y1.id, Y1.key, Y1.value, Y2.key, Y2.value
FROM nodes_tags Y1
JOIN nodes_tags Y2 ON Y1.id = Y2.id
WHERE Y1.key LIKE '%postcode%'
AND (Y2.key == 'amenity' AND Y2.value LIKE '%restaurant%')
) A1
GROUP BY value;
'''
rows = execute_query(DATABASE, QUERY)
rest2_df = pd.DataFrame(rows)
rest2_df.columns = ['postal_code', 'restaurants']
display(HTML(rest2_df.to_html(index=False, header=True)))
It is quite interesting to notice that our query resulted in only 18 restaurants and yet we know from a previous query that there were 147 ways and nodes entries pertaining to restaurant
.
This is because these are the only node and ways elements that had the accompanying postal code information when the OSM record was created. Out of the 147 OSM contributions that pertain to restaurant
, OSM users gave additional postal code information on only 18 of them which leaves us with 129 restaurant
records with missing postal codes.
This situation is understandable and albeit not unexpected owing to the voluntary nature of the Open Street Map system. Users are not constrained to enter some additional information that might be otherwise be relevant.
Let's return to our discussion of the restaurant dataset. It is more useful to visualize what is going on with the restaurant and postal code dataset by putting them in a bar plot.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(11,6))
plt.title("Number of Restaurants by Postal Code", fontsize=16)
plt.ylabel('Count',fontsize=16)
plt.xlabel('Postal Code', fontsize=16)
sns.set(style="whitegrid", color_codes=True)
sns.countplot(x="postal_code", data=rest1_df,
palette="muted",
order = rest1_df['postal_code'].value_counts().index);
It is quite revealing to show how, given our limited number of query results, the data still supports the notion of which areas in Orlando can be assumed to have a bigger share of the restaurant market.
Postal code 32806 represents the Delaney Park area which is about a mile southeast of downtown Orlando. Numerous restaurants in the area have taken advantage of workers coming from the three big businesses there: the Orlando Regional Medical Center, Winnie Palmer Hospital for Children and Orlando's Amtrak Station.
Postal code 32801 is the main downtown Orlando area. Perhaps it does not need much to explain how restaurants like Le Gourmet Break, The Boheme, Fratelli's, and numerous others exist to cater to the customers and workers who work and do leisure downtown.
Postal code 32819 stands for the southwest Orlando and Dr. Phillips-Bay Hill areas, most popularly known locally as the "Restaurant Belt". Numerous restaurants line up the major avenues of this neighborhood and they mostly serve Orlando tourists and theme park patrons.
Next, I would like to explore the variety of dishes that are being served in restaurants. The Orlando Open Street Map dataset have entries that have cuisine
as key which is being used in both node_tags
and ways_tags
elements.
I encountered some values that had a cuisine
key that I thought would need further data cleaning. For example, there were multivalued field items that have been put into one field like mediterranean;mexican;lebanese
and oriental;thai;sushi
. For the purposes of this study, I have excluded these ambiguous data items.
We would like to further qualify our query to exclude fast food items such as burger, donut, pizza, chicken sandwich, etc.
QUERY = '''
SELECT tags.value, COUNT(*) as count
FROM (SELECT key,
CASE value
WHEN 'chinese' then 'asian'
ELSE value
END as value
FROM nodes_tags
UNION ALL
SELECT key,
CASE value
WHEN 'chinese' then 'asian'
ELSE value
END as value
FROM ways_tags) tags
WHERE tags.key LIKE 'cuisine'
AND tags.value NOT IN ('coffee_shop', 'sandwich',
'pizza','barbecue,_american', 'chicken_sandwich',
'chocolate', 'coffee_and_donut_shop', 'doughnut',
'ice_cream;italian_pizza;noodles', 'subs','regional',
'chicken', 'burger', 'mediterranean;mexican;lebanese',
'oriental;thai;sushi')
GROUP BY tags.value
ORDER BY count DESC;
'''
rows = execute_query(DATABASE, QUERY)
cuis1_df = pd.DataFrame(rows)
cuis1_df.columns = ['Cuisine', 'Frequency']
display(HTML(cuis1_df.to_html(index=False, header=True)))
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
QUERY = '''
SELECT value
FROM (SELECT key,
CASE value
WHEN 'chinese' then 'asian'
ELSE value
END as value
FROM nodes_tags
UNION ALL
SELECT key,
CASE value
WHEN 'chinese' then 'asian'
ELSE value
END as value
FROM ways_tags) tags
WHERE tags.key LIKE 'cuisine'
AND tags.value NOT IN ('coffee_shop', 'sandwich',
'pizza','barbecue,_american', 'chicken_sandwich',
'chocolate', 'coffee_and_donut_shop', 'doughnut',
'ice_cream;italian_pizza;noodles', 'subs','regional',
'chicken','burger', 'mediterranean;mexican;lebanese',
'oriental;thai;sushi');
'''
rows = execute_query(DATABASE, QUERY)
cuis2_df = pd.DataFrame(rows)
cuis2_df .columns = ['value']
fig, ax = plt.subplots(figsize=(12,6))
plt.title("Number of OSM elements by Cuisine")
sns.set(style="whitegrid", color_codes=True)
g2 = sns.countplot(x="value", data=cuis2_df, palette="Set2"
,order = cuis2_df['value'].value_counts().index)
g2.set_xticklabels(g2.get_xticklabels(),rotation=90);
The American cuisine tops the list from among the captured cuisine
data in the Orlando OSM dataset. This is closely followed by the Mexican and Asian cuisines.
We tried to retrieve and present information about restaurants and their postal codes earlier. But that work did not produce much result because of the insufficiency of user-provided postal code data for each associated restaurant.
However, we will not give up so easily because there is another way of determining a similar feature such as the location for each restaurant. And this one is more precise.
It involves getting at the longitude and latitude values which we know are present for each node and way element. I start off with our basic nodes_tags
and ways_tags
tables which both contain information about which node and way elements pertain to restaurants (i.e. {key: amenity, value: restaurant}
) and do a natural join with the main nodes
database table via the id
attribute. The nodes
tuple or record contain informaton about the latitude
and longitude
data for each node.
import sqlite3
import pandas as pd
import pprint
from collections import defaultdict
DATABASE = 'orlando_osm.db'
def convert_to_restaurant_list(df):
resto_list = []
for index, row in resto_df.iterrows():
location = []
dict_item = {}
location.append(row['lon'])
location.append(row['lat'])
rname = row['name'].encode("utf-8")
rname = rname.replace("'", "")
dict_item = {'loc': location,
'label': rname}
resto_list.append(dict_item)
return resto_list
QUERY = '''
SELECT DISTINCT nodes.lat, nodes.lon, N1.value
FROM (
SELECT *
FROM nodes_tags
UNION ALL
SELECT *
FROM ways_tags
) b
JOIN nodes ON b.id = nodes.id
JOIN (
SELECT id, key, value
FROM nodes_tags
WHERE key = 'name'
) N1 ON N1.id = nodes.id
WHERE b.key = 'amenity' AND b.value = 'restaurant';
'''
rows = execute_query(DATABASE, QUERY)
resto_df = pd.DataFrame(rows)
resto_df.columns = ['lon','lat','name']
display(HTML(resto_df[:10].to_html(index=False, header=True)))
resto_list = convert_to_restaurant_list(resto_df)
Notice that the name of the restaurant is also displayed in the query result set. I was able to include this by making another natural join to the nodes_tags
records using the node ID reference keys.
The outcome of our query returned a dataframe that contains the map coordinates and names of the restaurants in our dataset. These are the components that we can use to visualize the data points on an interactive map using the folium
module. But before we can do so, I developed the convert_to_restaurant_list()
function to convert our dataframe into a structured collection (list of dictionaries) that folium
neads to generate an interactive map.
%matplotlib inline
import folium
width, height = 650, 650
restaurant_map = folium.Map(location=[28.539714, -81.379414], zoom_start=11,
tiles='OpenStreetMap')
# Add markers to the map
for restaurant in resto_list:
marker = folium.Marker(location=restaurant['loc'], popup=restaurant['label'])
marker.add_to(restaurant_map)
restaurant_map
It is quite satisfying whenever we see how our data visualization reveals and corraborates the informed assumption that I made earlier when we explored the number of restaurants that are within the jurisdiction of various postal codes, to wit:
Postal code 32819 stands for the southwest Orlando and Dr. Phillips-Bay Hill areas, most popularly known locally as the "Restaurant Belt". Numerous restaurants line up the major avenues of this neighborhood and they mostly serve Orlando tourists and theme park patrons.
The southwest portion of the interactive map of Orlando is densely dotted by restaurant markers and we can safely infer that this area most probably falls within the postal code 32819 jurisdiction - Orlando's Restaurant Belt.
The Open Street Map Project is a very commendable undertaking in that it satisfies a public need (providing map information) while doing so at the least amount of upkeep (through the voluntary participation of its users). Its effectiveness and its ability to stay useful and relevant may be attributed to its flexibility and ease of use. Users seem to be allowed to edit and update the map file data arbitrarily. This modest feature may come at a cost in terms of data integrity and consistency.
I will explain what I meant by consistency using a particular experience that I had while doing data analysis. During the part when I was exploring the data for culinary cuisines, I came across some data that were problematic.
QUERY = '''
SELECT tags.value, COUNT(*) as count
FROM (SELECT key,
value
FROM nodes_tags
UNION ALL
SELECT key,
value
FROM ways_tags) tags
WHERE tags.key LIKE 'cuisine'
AND tags.value NOT IN ('american', 'mexican',
'italian','Tex-Mex', 'barbecue', 'cuban','vegetarian')
GROUP BY tags.value
ORDER BY count DESC;
'''
rows = execute_query(DATABASE, QUERY)
cuis1_df = pd.DataFrame(rows)
cuis1_df.columns = ['Cuisine', 'Frequency']
display(HTML(cuis1_df.to_html(index=False, header=True)))
I would like to discuss the issues posed by these entries that were meant to satisfy the concept of a cuisine.
The data items such as chicken
and chocolate
that were entered as values for the key of cuisine
does not seem to go within the purview of the definition of a cuisine. According to Wikipedia:
A cuisine is a style of cooking characterized by distinctive ingredients, techniques and dishes, and usually associated with a specific culture or geographic region.
Data items chicken
and chocolate
are not cuisines in themselves, but they can be used as ingredients for a particular cuisine (i.e, chocolate and chicken are used in a dish called Chocolate Chicken Mole, a Mexican cuisine dish).
Data items like burger
, sandwich
, pizza
, doughnut
, and subs
can properly be considered as dishes or menu items which cater to a particular cuisine. Thus, a burger
and donut
may instead be classified as belonging to the American cuisine; pizza
under Italian, and sandwich
and subs
may perhaps be included in some catch-all cuisine as every geographic reqion has its version of these common foods.
There are data items that perhaps try to describe the type of business rather than the cuisine. Items like coffee_shop
and coffee_and_donut_shop
may need to be re-entered or manually cleaned.
Some cuisines that were entered can reasonably be classified into one common cuisine. For example asian
, chinese
and japanese
can all be consolidated under just the asian
cuisine.
Perhaps a rather more problematic issue is the case when a field is allowed to contain multiple values. Some examples of this that can be found in the Orlando map dataset are barbecue,_american
, ice_cream;italian_pizza;noodles
, mediterranean;mexican;lebanese
and oriental; thai; sushi
. Queries made when these values are present are possible to produce ambiguous and inconsistent results. There are also cuisine data items like regional
that are inherently ambiguous by themselves (what particular region pertains to a regional
cuisine?)
A data field that is prone to be filled with arbitrary values and/or even multple (non-atomic) values run the risk of data quality (DQ) issues. Such DQ issues often affect the results of SQL queries offering non-results or even false results.
One way of dealing with multiple values is to disallow users from entering them by the use of option or dropdown elements where users are constrained to select from a predetermined set of options as inputs. However, this may be difficult to implement because the key part of the key-value relationship (cuisine
) in the node or way elements are themselves arbitrary. A user may or may not choose to add a cuisine-type key-value data item pair.
If this is the case, perhaps the Open Street Map project may be able to add a feature that allows users to tell the system about the data item type that they are thinking of creating. If the user intends to augment a node or way element with a new multi-value key-value type entry, then there should be a toggle or check box on the screen for the user to specify the multi-valued characteristic of the data entry. Along with this, the system should keep information of the most often used multi-valued data item sets (i.e. cuisine
) along with their proper predetermined value selection set.
The system should then be able to provide edit notifications to the user should the user attempt to include multiple items in the input field which has been configured to be multi-valued or the system can also offer the predetermined value selection set to help the user choose the correct value data item.
Moreover, the "cuisine-aware" system can also maintain an internal list of value items that can fall under the common fast food
cuisine. Thus, items like chicken sandwich
, burger
, ice cream
, subs
, pizza
and the like which can be found in our Orlando map dataset can appropriately and automatically be assigned to this category.
We are able to illustrate the typical tasks involved with wrangling data from a raw data file to a form that can be efficiently used for data analysis.
The major process of data cleaning was begun with gathering of as much information about the components of the map file. This involved some demonstrations of the loading and parsing of various XML files.
Using this information, data on the street names, telephone numbers and postal codes were audited in order to evaluate their accuracy, completeness, consistency and validity. The data audits were performed giving particular emphasis on the importance of adhering to a data cleaning plan to ensure a systematic approach at working with the data.
Afterwards, data cleaning were performed programmatically and at then end of each phase of the data cleaning exercise, a clean version of the original map file was generated and tested to serve as input to the succeeding phase of the data cleaning process.
We also discussed how we parsed the cleaned map file into several CSV files. These flat files were then loaded into a SQLite database where we performed various SQL queries to support and facilitate our data exploration tasks.
Our primary takeaway from all of these exercises is to demonstrate the necessary data wrangling processes that have to take place before a proper data analysis can happen.