ABSTRACT Typical tasks involved in data wrangling are exhibited by applying the techniques of data acquisition, data extraction and parsing, data quality audit, data transformation and data integration on the Orlando Open Street Map data. Eventually, the processed map data is prepared for storage in a persistent SQL data store for further data exploration and analysis.

Introduction¶

Data that is used to support inferential models or exploratory and analytical studies usually do not come tidy. That is, raw and messy data do not come in a form that can readily be consumed by downstream processes.

This is especially true for data that were acquired by unconventional means such as web scraping or whose data elements were collected from an unknown provenance. Extra work has to be done on the data to make them easier to work with and useful for whatever purpose we may need them for later.

The OpenStreetMap (OSM) data is an example of a likely messy dataset. OpenStreetMap makes available user-contributed map data about local roads and businesses. Because information about roads, highways and other map objects are done by individuals, issues about the accuracy of the data may occur or at most be reasonably expected. This makes an OSM file a very suitable candidate for illustrating the data wrangling process.

Let us begin by loading the Python modules which we will be using in this project.

import codecs
import os
import os.path
import pandas as pd
import pprint
import pymongo
import re
import requests
import sys
import time
import xml.etree.cElementTree as ET
import json
from collections import defaultdict
from IPython.display import display, HTML

Data Acquisition¶

The OpenStreetMap data which we will be working on in this project is taken from Orlando, Florida. I chose this place for personal reasons; it is a city with which I am quite familiar.

The Orlando City OSM map file is acquired from the OpenStreetMap Project interactive web application where a user can manually search and select an area of the city for export and download.

Using an API to acquire the OpenStreetMap XML dataset¶

There are several ways to acquire the XML OSM map file for a particular location. One method is to download a pre-selected area from MapZen and another method is to use the Overpass API to retrieve a map area tailored for your needs.

I decided to use the latter method in order to have control of the boundaries of the custom map area within the city of Orlando within which I would like to focus my study.

Moreover, I like to have the freedom to choose a map area large enough that will generate a sizeable file, preferably more than 50 MB. A large yet manageable OSM XML file will yield more data elements for analysis.

The get_map_file() function is used to interface with the Overpass web API application to generate the OSM XML map file that we need for this project.

def get_map_file(api_url, local_filename):
    """
    Use the Overpass API to download a custom OSM XML map file
    """
    total_size = 79567834  # 77.703 KB
    
    if os.path.exists(local_filename):
        if os.path.getsize(local_filename) == total_size:
            print 'The file {} already exists and verified. File size: {}'.format( \
                local_filename, os.path.getsize(local_filename))
            return None

    r = requests.get(api_url, stream=True)
    with open(local_filename, 'wb') as f:
        print "Downloading %s" % local_filename, '...'
        
        dl_size = 0
        for chunk in r.iter_content(chunk_size=4096):
            if chunk:
                f.write(chunk)
                dl_size += len(chunk)
                done = int(100 * dl_size / float(total_size))
                sys.stdout.write("\r[%s%s]" % ('-' * done, ' ' * (100 - done)))    
                sys.stdout.flush()

    print '\nDownload finished. {} is ready.'.format(local_filename)

To use the get_map_file() function, we pass in a couple of parameters: API_URL parameter is the URL signature that contains the bounding box information that will be used by the Overpass API to generate the map file. OSM_FILE parameter is a custom file name (and path) that will be assigned to the downloaded map file.

API_URL  = 'https://overpass-api.de/api/map?bbox=-81.5120,28.4558,-81.2205,28.6135'
OSM_FILE = 'orlando.osm'

get_map_file(API_URL, OSM_FILE)

The file orlando.osm already exists and verified. File size: 79567834

Here is a description of the raw XML OSM data file that we acquired from the OpenStreetMap web application.

File Property	Description
Data Source	OpenStreetMap Project API
Data Format	OSM XML
File Size	77.7 MB

A snapshot of the raw data file follows. This is generated by the OpenStreetMap Project web application.

It shows that the OSM data file makes use of UTF-8 character encoding, a dominant internet encoding standard. It also tells us that the file is marked up using the standard specifications for XML data format. We also get a hint of the presence of the osm version tag and attribute that OpenStreetMap also adheres to a specialized set of document standards and specifications.

Data Cleaning¶

Now that we have acquired the data in the form of an XML file we can begin to examine the file superficially paying particular attention to its makeup. By doing this, we can start forming some questions that enhance our understanding of its contents which might then lead to more curious questions later.

Getting To Know The Dataset¶

def count_tags(filename):
    """
    Returns a dictionary of name and count of top-level XML tags
    """
    tags_dict = {}  # initialize an empty dictionary

    # SAX parsing of the file
    for event, elem in ET.iterparse(filename, events=('start', )):
        if elem.tag not in tags_dict:
            tags_dict[elem.tag] = 1
        else:
            tags_dict[elem.tag] += 1

    return tags_dict

Since we know that the data file is in XML format, let us try to discover the top-level elements that comprise the dataset.

The function count_tags() above uses the Simple API for XML or SAX method of parsing to iteratively process the map file. The function returns a dictionary of unique OSM element names as keys and the cardinality of each element as values.

This will give us information about the type and volume of data we can expect to encounter in the map file.

OSM_FILE = 'orlando.osm'

tags = count_tags(OSM_FILE)
sorted_by_count = [(k, v) for (v, k) in sorted([(value, key) for (key, value) in tags.items()], reverse=True)]

df = pd.DataFrame(sorted_by_count)
df.columns = ['Element', 'Occurrences']
display(HTML("<h4>Types of elements and occurrences</h4>"))
display(HTML(df.to_html(index=False)))

The sorted report allowed us to discover the top-level elements that make up the OSM map data. They are the elements which have only one occurrence: bounds, meta, note, and osm.

Element	Description
osm	Information about the API application or service that generated the OSM file.
note	Additional information about the document.
meta	Usually holds timestamp information about the generated file.
bounds	Contains the map file bounding box longitude and latitude information.

The next set of elements worth paying attention to are the node, way and relation elements. OpenStreetMap's conceptual data model of the physical world makes use of these basic components and it is worth knowing what constitutes each of them.

OpenStreetMap's documentation on some of the other primary elements are briefly described below. It is worthwhile to be familiar with these data elements so we know the types of data that are available to us and where to find them.

Element	Description
Node (nd)	Specific point on the surface defined by latitude and longitude. They can be used to define the shape of a way. They can also be included as a member of a relation.
Way (way)	An ordered list of between 2 and 2,000 nodes Can represent area boundaries like buildings or forests
Relation (relation)	A multi-purpose data structure that describes a relationship between two or more data elements like nodes, ways and other relations. Primarily an ordered list of nodes, ways or other relations.
Tag (tag)	Tags can be a member of all types of data elements. They describe the meaning of the particular element to which they are attached. A tag is composed of two text fields: a unique key and a value. An element cannot have two tags with the same key.

XML atttributes provide additional information about the particular element to which they are associated.

The get_attributes() function will help us discover what and how many attributes exist in our dataset. We shall call this function and sort the results to determine the most frequent attributes that are in use in our data file.

def get_attributes(filename):
    """
    Identify and bin each attribute using a dictionary
    """
    attrs = {}
    for event, elem in ET.iterparse(filename, events=('start', 'end')):
        if event == 'end':
            for attr in elem.attrib:
                if attr not in attrs:
                    attrs[attr] = 1
                else:
                    attrs[attr] += 1
    return attrs

OSM_FILE = 'orlando.osm'

start_time = time.time()
attrs = get_attributes(OSM_FILE)
sorted_by_count = [(k, v) for (v, k) in sorted([(value, key) for (key, value) in attrs.items()], reverse=True)]

df = pd.DataFrame(sorted_by_count)
df.columns = ['Attribute', 'Occurrences']
display(HTML("<h4>Element Attributes and Occurrences</h4>"))
display(HTML(df.to_html(index=False)))

# print('\n--- %s seconds ---' % (time.time() - start_time))

Of particular interest to us are the k (key) and v (value) attributes which are displayed above as appearing 247,303 times in our dataset.

The key and value attributes are members of tag elements and as was described before, a tag describes specific map features of an OSM map element be they nodes, ways or relations. A key is unique within each tag element and it particularly shows a topic, category or type of a map feature (such as a highway or a name). The value, on the other hand, provides the details of the feature classified by the key associated with it.

def get_keys(filename):
    """
    Collect key attributes and occurrences in a dictionary
    """
    keys = {}
    for event, elem in ET.iterparse(filename, events=('start', 'end')):
        if event == 'end':
            key = elem.attrib.get('k')
            if key:
                if key not in keys:
                    keys[key] = 1
                else:
                    keys[key] += 1
    return keys

The get_keys() function above returns a list of dictionaries containing each key and associated frequency in the OSM data file.

# Returns the top five key-occurrences combinations from the OSM file
OSM_FILE = 'orlando.osm'

start_time = time.time()
keys = get_keys(OSM_FILE)
sorted_by_count = [(k, v) for (v, k) in sorted([(value, key) for (key, value) in keys.items()], reverse=True)]

df = pd.DataFrame(sorted_by_count)
df.columns = ['Key', 'Occurrences']
display(HTML("<h4>Key Attributes and Occurrences</h4>"))
display(HTML(df[:5].to_html(index=False)))

# print('\n--- %s seconds ---' % (time.time() - start_time))

While listing the different types of keys or map features available, we notice that there are a variety of different keys:

Lower case letters joined by an underscore symbol (old_name)
Lower case letters joined by a colon symbol (addr:city)
Lower case letters joined by a period (addr.place)

According to the OpenStreetMap documentation regarding keys and values:

keys can be qualified with prefixes, infixes, or suffixes (usually, separated with a colon), forming super- or sub-categories or namespace.

It is useful to know that key attributes can be hierarchical and that they themselves may contain other data structures.

Let us remind ourselves that the purpose of this data audit is to discover potential problems with the raw data with an eye towards improving the data quality before ultimately putting them into persistent storage.

The key_type() function will check the contents of each tag element's key attribute (k) through the use of three regular expressions to verify certain patterns. It will return the number of matches that it finds which will inform us of any potential tags that might require further examination.

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problem_chars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')


def key_type(element, keys):
    """
    Categorize and count each tag element
    """
    if element.tag == "tag":
        if lower.search(element.attrib['k']):
            keys['lower'] += 1
        elif lower_colon.search(element.attrib['k']):
            keys['lower_colon'] += 1
        elif problem_chars .search(element.attrib['k']):
            keys['problem_chars'] += 1
        else:
            keys['other'] += 1
        
    return keys


def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problem_chars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys

We now pass in our Orlando OSM data file as a parameter to the process_map() function which will iteratively parse through the XML file applying the key_type() function over the file's records.

OSM_FILE = 'orlando.osm'
keys = process_map(OSM_FILE)

sorted_by_count = [(k, v) for (v, k) in sorted([(value, key) for (key, value) in keys.items()], reverse=True)]

df = pd.DataFrame(sorted_by_count)
df.columns = ['Key Type', 'Occurrences']
display(HTML("<h4>Type of Key Formats and Occurrences</h4>"))
display(HTML(df.to_html(index=False)))

It looks like we are safe because the report did not show any occurrence of problematic keys. We can continue with our data wrangling tasks trusting that we can safely use the key attributes going forward.

Auditing Street Names¶

Among the key attributes that may be commonly expected in a map file are those that pertain to street names. The tag element holds the street name information in the value associated with its addr:street key attribute.

Let us list several addr:street values to identify errors and untidy data and to get an idea of what type of processing or cleaning we might have to do with those values. The code below helps us do this.

OSM_FILE = 'orlando.osm'
street_type_re = re.compile(r'\S+\.?$', re.IGNORECASE)


def audit_street_type(street_types, street_name):
    """
    Use the regex pattern to match street type part of the street name
    
    Args: 
        street_types(dictionary): Key-value collection of street types and occurrences
        street_name(string): The value of a tag pertaining to an address key
    """
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        street_types[street_type] += 1


def is_street_name(elem):
    """
    Check if the element is a tag and the key attribute is a street information
    
    Args:
        elem(dictionary): OSM tag object in key-value pair format
    Returns:
        bool: Indicates whether contents of dictionary item pertains to a street
    """
    return (elem.attrib['k'] == "addr:street")

    
def print_sorted_dict(d):
    """
    Displays the results from iterating through street type dictionary
    
    Args:
        d(dictionary): Key-value collection of street types and occurrences
    """
    sorted_by_count = sorted([(key, value) for (key, value) in d.items()], reverse=False)
    df = pd.DataFrame(sorted_by_count)
    df.columns = ['Street Type', 'Occurrences']
    display(HTML("<h4>Sample of Street Types and Occurrences</h4>"))
    display(HTML(df[:5].to_html(index=False)))
    
def street_type_audit(file_in):
    """
    Wrapper file iterator for the main street type audit routine
    
    Args:
        file_in(string): Input map file
    Returns:
        street_types(dictionary): Key-value collection of street types and occurrences
    """
    osm_file = open(file_in, "r")
    street_types = defaultdict(int)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
                    
    return street_types
            

if __name__ == '__main__':
    start_time = time.time()
    street_types = street_type_audit(OSM_FILE)
    print_sorted_dict(street_types)
    # print('\n--- %s seconds ---' % (time.time() - start_time))

The output of our street type audit shows that there are some indeterminable number and number-character patterns that were used (e. g. #303, #F, 136, etc.) in the street names. These can be further investigated in a different study.

What we are interested in for this project is noticing the variety of ways a particular street type is expressed in the raw OSM data file. For example, Avenue which is predominantly used (190 occurrences) has two other variants: Ave (used 6 times in the file) and Ave. (with a period, used once).

The discovery of this characteristic in our dataset provides a motivation for our first data cleaning task — we shall seek and update street type data in the street names and make certain street types conform to a uniform set of street type labels that we will specify. For instance, we will replace all occurrences of the terms Ave and Ave. with the standard term Avenue.

Data Cleaning Plan for Street Names¶

The data audit of the street types reveals the need for some work to be done in shaping the raw data into a form that we can use. But we should not undertake the cleaning of the data haphazardly. This is where having a data cleaning plan becomes very useful.

A data cleaning plan pushes us to be organized and methodical in our approach towards data cleaning.

I have prepared the following data cleaning plan for tidying the street names in the Orlando OSM dataset:

Use the information obtained from the data audit
Try to identify the causes of the dirty data
Define a set of operations that will correct the data
Test and verify the results of the cleaning operation.

Procedure number 1 has been accomplished by our recent data audit work on the Orlando map file street names. As for procedure number 2, we can be sure that the cause of variations and other possible problems are mostly attributable to human error.

What follows then is the need to develop some operations that will correct the data (procedure number 3).

The code below defines some helper functions that will verify the street type used in each street name. The idea is to first test the street type if it matches any item in the list of expected street types that we have prepared. A mismatch marks the street type as a candidate for possible replacement.

street_type_re = re.compile(r'\S+\.?$', re.IGNORECASE)

expected = ["Avenue", "Boulevard", "Circle", "Court", "Drive", "Lane",
           "Loop", "Parkway", "Place", "Plaza", "Road", "Run", "Street",
           "Trail", "Turnpike", "Way"]

mapping = { "pl": "Place",
            "st": "Street",
            "ave": "Avenue",
            "rd": "Road",
            "w": "West",
            "n": "North",
            "s": "South",
            "e": "East",
            "blvd":"Boulevard",
            "sr": "Drive",
            "ct": "Court",
            "ne": "Northeast",
            "se": "Southeast",
            "nw": "Northwest",
            "sw": "Southwest",
            "dr": "Drive",
            "sq": "Square",
            "ln": "Lane",
            "trl": "Trail",
            "pkwy": "Parkway",
            "ste": "Suite",
            "lp": "Loop",
            "hwy": "Highway"}

# updated version of audit_street_type
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

def is_street_name(elem):
    """
    Check if the element is a tag and the key attribute is a street information
    """
    return (elem.attrib['k'] == "addr:street")
            
def audit_street_names(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])

    return street_types

def get_valid_name(name, mapping):
    name_list = re.findall(r"[\w']+", name)
    end_of_street_name = len(name_list)
    for i in range(len(name_list)):
        word = name_list[i].lower()
        if word in mapping:
            end_of_street_name = i
            name_list[i] = mapping[word]
        
    name_list = name_list[:(end_of_street_name + 1)]
    better_name = ' '.join(name_list)
    return better_name

Our primary goal in auditing and cleaning the street type data is to homogenize the frequently used terms, that is, to make them uniform.

Because of the street type audit report that we produced, we are able to come up with a set of street type terms which we know will need to be checked and corrected if necessary and include them in a python list called expected.

We also prepared a dictionary of commonly used street type contractions (mapping) as keys and paired each of them with a corrected version that we like to use in the update.

Let us put our helper functions to work. The code below loads and iterates through the OSM map file gathering all street types and keeping each uncommon street name that it finds into a dictionary keyed on the street type (the street name contains a street type that is not in our list of expected street types).

The dictionary (street_types) is then put into another iterative process whereby the street name finds its corrected version based on a mapping of how we want a street type to appear when it finds an incorrect street type (get_valid_name). The proposed street name conversion is then displayed for each visit into the dictionary elements.

start_time = time.time()

OSM_FILE = 'orlando.osm'
street_types = audit_street_names(OSM_FILE)

display(HTML("<h4>Proposed street name conversions</h4>"))

for st_type, ways in street_types.iteritems():
    for street_name_in_file in ways:
        valid_name = get_valid_name(street_name_in_file, mapping)
        if street_name_in_file != valid_name:
            print street_name_in_file, "===>", valid_name

print('\n--- %s seconds ---' % (time.time() - start_time))

E Colonial Dr Orlando, FL 32801 ===> East Colonial Drive
International Dr.  Store 6G ===> International Drive
N Lakemonte Ave. ===> North Lakemonte Avenue
e. colonial dr ===> East colonial Drive
East Colonial Drive #303 ===> East Colonial Drive 303
E Robinson St ===> East Robinson Street
Livingston  St ===> Livingston Street
Curry Ford Rd ===> Curry Ford Road
S. Kirkman Rd ===> South Kirkman Road
Aloma Avenue #F ===> Aloma Avenue F
S Goldenrod Rd Ste 136 ===> South Goldenrod Road Suite
Central Blvd ===> Central Boulevard
S Ivanhole Blvd ===> South Ivanhole Boulevard
N Semoran Blvd ===> North Semoran Boulevard
Lee Vista Blvd ===> Lee Vista Boulevard
North Orange Ave ===> North Orange Avenue
South Park Ave ===> South Park Avenue
S Division Ave ===> South Division Avenue
W Fairbanks Ave ===> West Fairbanks Avenue
Seminole Ave ===> Seminole Avenue
South Rio Grande Ave ===> South Rio Grande Avenue
420-E STATE ROAD 434 ===> 420 East
Lake Ellenor Dr ===> Lake Ellenor Drive
New York Avenue ste 250 ===> New York Avenue Suite
N. Semoran blvd. ===> North Semoran Boulevard

--- 6.89700007439 seconds ---

The output shown above shows us the results of the proposed street name conversions that will happen should we execute the data cleaning operations that we have defined. Our efforts to ensure uniformity among certain street types which we identified during the data audit were effective.

The clean_street_names() function that follows will implement the data cleaning tasks on the street names by passing it our Orlando OSM file. The function also generates a seperate output file, cleaned_street_name.xml, which will be a version of the original OSM file in which the street names have undergone data transformation.

def clean_street_names(filename, cleaned_filename):
    """
    Apply data cleaning on street names
    """
    tree = ET.parse(filename)
    root = tree.getroot()

    for tag in root.findall('*/tag'):
        if is_street_name(tag):
            name = tag.get('v')
            better_name = get_valid_name(name, mapping)
            tag.set('v', better_name)

    return tree.write(cleaned_filename)


start_time = time.time()

OSM_FILE = 'orlando.osm'
cleaned_street_name = 'cleaned_street_name.xml'
clean_street_names(OSM_FILE, cleaned_street_name)

print('\n--- %s seconds ---' % (time.time() - start_time))

--- 34.6669998169 seconds ---

As the last step in this iteration of the street name data cleaning plan (specifically procedure 4), we shall run the get_valid_name() function on our dataset. However, we will be using our newly-created output file cleaned_street_name.xml to check whether some street type related issues still persist in our dataset.

I mentioned the word iteration because data cleaning is usually an iterative process. We have to repeat our data cleaning process not only to (1) verify the effectiveness of our data cleaning efforts but also (2) to examine if there is still more cleaning left to be done.

cleaned_street_name = 'cleaned_street_name.xml'
street_types = audit_street_names(cleaned_street_name)
vcount = 0
for st_type, ways in street_types.iteritems():
    for name in ways:
        better_name = get_valid_name(name, mapping)
        if name != better_name:
            vcount = vcount + 1
            print name, "=>", better_name
            
if vcount == 0:
    display(HTML("<br/><p style=\"color:green;\">SUCCESS: There were no street names to convert.</p>"))

Auditing Telephone Numbers¶

The general audit that we made on the entire Orlando OSM map file informed us of the existence of certain tag elements bearing the phone key label.

Let us explore a sample of 52 values that are contained in phone tag elements to get an idea of whether or not they require further processing or cleaning. Again, we are doing this within the context of the desire to homogenize our data whenever it is warranted and applicable.

OSM_FILE = 'cleaned_street_name.xml'

display(HTML("<h4>Sample Phone Numbers</h4>"))

count_max = 10
count_number = 0
for _, element in ET.iterparse(OSM_FILE, ('start', 'end')):
    if  element.tag == "tag":
        if element.attrib['k'] == 'phone' \
        and count_number < count_max:
            print(element.attrib['v'])
            count_number += 1
            
print '\nTotal count of sample phone numbers:', count_number

+1-407-380-9971
+1-407-380-9971
+1-407-290-6977
+1-407-290-6977
+1 407 303 1700
+1 407 303 1700
407-623-3300
407-623-3300
+1-407-645-5311
+1-407-645-5311

Total count of sample phone numbers: 10

The phone numbers take on different formats. This is expected because the individuals who volunteered to input the data may have different ideas on how a phone number should be entered.

Our data stores the phone numbers in the following common ways:

+1-234-567-8901
123-456-7890
1234567890
(123) 456-7890

Data Cleaning Plan for Telephone Numbers¶

The design for dealing with the tag elements containing the phone numbers in the OSM file is as follows:

Conduct a data audit of the telephone numbers
- At this point we have already used and examined the information contained in the map file.
Identify the cause of the issue that we would like to address.
- We pointed out that the multi-formatted values is the result of the differences in the style in which individuals prefer to enter the phone numbers.
Define a set of operations that will correct the data.
- We will define some functions that will iterate through the map file and update the phone numbers according to a predetermined standard format.
Test and verify the results of the cleaning operation
- The results of the data cleaning effors will be re-examined to check for lingering problems or discover new issues.

To give us an idea of the number of phone numbers the map file actually contains, we will re-use the code we ran earlier.

OSM_FILE = 'cleaned_street_name.xml'

count_number = 0
for _, element in ET.iterparse(OSM_FILE, ('start', 'end')):
    if  element.tag == "tag":
        if element.attrib['k'] == 'phone':
            count_number += 1
            

print '\nTotal count of phone numbers:', count_number

Total count of phone numbers: 236

Next, we will define our goal of cleaning that portion of the data that pertains to phone numbers. By this, cleaning the phone numbers basically means making all the values adhere to a uniform phone number format. And for our situation, the format would be to just retain all the digits and strip all symbols and inner white space (e.g. (407) 123-4567 will become 4071234567).

This policy is operationalized through the use of a regular expression patterns.

Let us now find out how many phone numbers actually adhere to our uniform format. The phone numbers can be classified into two groups: numbers_only and other_format (shown below with examples):

numbers_only	4077309919 407-879-7763
other_format	+014077303212 +1 407 3730622 (407) 622-2010

strictly_numbers = re.compile(r'\d+')

def count_phone_types(element, phone_types):
    if element.tag == "tag":
        if element.attrib['k'] == 'phone':
            if strictly_numbers.match(element.attrib['v']):
                phone_types['numbers_only'] += 1
            else:
                phone_types['other_format'] += 1
       
    return phone_types


def process_map_phone_types(filename):
    phone_types = {"numbers_only": 0, "other_format": 0}
    for _, element in ET.iterparse(filename, ('start', 'end')):
        phone_types = count_phone_types(element, phone_types)

    return phone_types

display(HTML("<h4>Phone Types and Occurrences</h4>"))
pprint.pprint(process_map_phone_types(cleaned_street_name))

{'numbers_only': 80, 'other_format': 156}

We will endeavor to clean the telephone data and expect to transform the values of the phone number elements into a numbers-only format. This format lends itself conveniently well for building a uniformly formatted set of phone numbers.

The clean_phone_number() function implements the data cleaning task required to standardize the telephone number format by applying a regular expression substitution method on the OSM file that has recently undergone street type data cleaning from a previous cleaning plan.

def clean_phone_number(filename, cleaned_filename):
    """
    Apply the data cleaning process on OSM file
    
    Args:
        filename(string): Name of uncleaned file
        cleaned_filename(string): Name of file after being processed
    """
    tree = ET.parse(filename)
    root = tree.getroot()

    for tag in root.findall('*/tag'):
        if tag.attrib['k'] == 'phone':
            phone_number = tag.get('v')
            if not strictly_numbers.match(phone_number):
                update_phone_number = re.sub(r'[\D|\s]+', '', phone_number)[-10: ]
                tag.set('v', update_phone_number)

    return tree.write(cleaned_filename)

Let us use the cleaned_street_name.xml OSM file and the name of the file that will hold the results of the data cleaning process (cleaned_phone_number.xml) as arguments to the clean_phone_number() function.

Right after generating the cleaned_phone_number.xml output file, we will use this file again as an attribute to the process_map_phone_types() function to check the results of our data cleaning efforts.

cleaned_street_name = 'cleaned_street_name.xml'

start_time = time.time()

cleaned_phone_number = 'cleaned_phone_number.xml'
clean_phone_number(cleaned_street_name, cleaned_phone_number)

display(HTML("<h4>Data cleaning of phone number results</h4>"))
pprint.pprint(process_map_phone_types(cleaned_phone_number))

# print('\n--- %s seconds ---' % (time.time() - start_time))

{'numbers_only': 236, 'other_format': 0}

The function returns a zero under the other_format group which means that the phone numbers have been properly transformed to only hold numbers.

The output of the telephone data cleaning process, cleaned_phone_number.xml, is going to be used in the next phase of our data cleaning efforts.

Auditing Postal Codes¶

When I set out to get the Orlando dataset from the Open Street Map project's interactive web application, I did so by arbitrarily drawing a square region on top of the area that encompasses much of Orlando. I understand that this may not be precise which means that I may have grabbed some locations that are actually outside the jurisdiction of Orlando.

In this section of the data cleaning project, we will have the opportunity to be more precise by detecting and keeping those data whose location information only belongs to city of Orlando. And we are going to do this using the zip or postal code information that are stored in the file.

An interesting element of the Open Street Map file are the postal code values associated with the key: addr:postcode. To get an idea of the various postal codes that are in the data as well as the number of data elements that are associated with each unique postal code, we have developed the count_postcodes() function.

OSM_FILE = 'cleaned_street_name.xml'

def count_postcodes(filename):
    """
    Extract postal codes and generate frequencies
    
    Args:
        filename(string): Current OSM XML file that is currently
                          going through the data cleaning pipeline
    Returns:
        postcodes(dictionary): Post code and frequency dictionary
    """
    postcodes = {}
    for event, elem in ET.iterparse(filename, events=('start', 'end')):
        if event == 'end':
            key = elem.attrib.get('k')
            if key == 'addr:postcode':
                postcode = elem.attrib.get('v')
                if postcode not in postcodes:
                    postcodes[postcode] = 1
                else:
                    postcodes[postcode] += 1
    return postcodes


start_time = time.time()

postcodes = count_postcodes(OSM_FILE)
sorted_by_occurrence = [(k, v) for (v, k) in sorted([(value, key) 
                                                     for (key, value) 
                                                     in postcodes.items()], reverse=True)]


df = pd.DataFrame(sorted_by_occurrence)
df.columns = ['Postal Code', 'Frequency']
display(HTML("<h4>Sample Postal Codes and Frequencies</h4>"))
display(HTML(df[:5].to_html(index=False)))

The preliminary audit results tells us that most of the postal codes tend to cluster around a small group of numbers (32835, 32789, 32803, 32806, 32801, 32839, 32804, 32822, 32819, 32792 and 32809).

It also shows that there are a few postcodes that were entered using a different format (e.g. 32819-7827 and the like) as well as erroneous entries (e.g. 'FL 32803'). Data cleaning sometimes involves manual corrections and we can deal with these few errant entries programmatically using the fix_select_zipcode() function that we have prepared. The function will replace these select post codes with the proper format or data based on the original values.

Data Accuracy¶

Looking into list of postal codes which our data audit revealed, it is possible that within our data, the postal code (and its associated entry) is not actually valid for our purposes. We therefore need to distinguish those codes that are within the jurisdiction of the city of Orlando and those which lie without.

In order to audit the addr:postcode field for accuracy, we are going to need some means of programmatically processing it and comparing the values that we encounter to some set of data that we can reliably assume is correct.

The gold standard data that we are going to use for this project is the set of postal codes from USPS which is corroborated by zip-codes.com. There is a total of 59 postal codes ranging from 32801 through 32899.

We are now able to perform another data audit given this gold standard data in hand. The following code will tell us how many node, way and relation elements are associated with postal codes (addr:postcode) that belong to the city of Orlando (orlando_postcodes) and those that do not (other_postcodes). It does so by iterating over the addr:postcode key values and comparing each with a list of official Orlando postal codes retrieved from a reliable source.

OSM_FILE = 'cleaned_street_name.xml'
valid_orlando_postcodes = [
    '32801','32802','32803','32804','32805','32806','32807','32808','32809',
    '32810','32811','32812','32814','32815','32816','32817','32818','32819',
    '32820','32821','32822','32824','32825','32826','32827','32828','32829',
    '32830','32831','32832','32833','32834','32835','32836','32837','32839',
    '32853','32854','32855','32856','32857','32858','32859','32860','32861',
    '32862','32867','32868','32869','32872','32877','32878','32885','32886',
    '32887','32891','32896','32897','32899']

def get_postcode(elem):
    """
    Extracts the postal code value from the dictionary item
    
    Args:
        elem(Object): OSM element object
    Returns"
        bool: The element is a postal code element
        value(string): The postal code that is inside the dictionary
    """
    if elem.tag in ['node', 'way', 'relation']:
        for tag in elem.iter():
            if tag.get('k') == 'addr:postcode':
                return True, tag.get('v')
        return False, None
    return False, None

def fix_select_zipcode(zipcode):
    """
    Converts the postal code to the correct format
    
    Args:
        zipcode(string): postal code value
    Returns"
        new_zip(string): corrected postal code value
    """
    new_zip = ''
    if zipcode == 'FL 32803':
        new_zip = '32803'
    elif zipcode == 'FL 32792':
        new_zip = '32792'
    elif zipcode == '32819-7827':
        new_zip = '32819'
    elif zipcode == '32817-4427':
        new_zip = '32817'
    elif zipcode == '32805-1100':
        new_zip = '32805'
    elif zipcode == '32803-4806':
        new_zip = 32803
    elif zipcode == '32792-6801':
        new_zip = 32792
    else:
        new_zip = zipcode

    return new_zip


def process_map(filename):
    """
    Accumulates postal codes that belong or do not belong to Orlando
    
    Args:
        filename(string): OSM map file name
    Returns:
        number_types(dictionary): Dictionary bins postal codes and counts
    """    
    number_types = {"orlando_postcodes": 0, "other_postcodes": 0}
    tree = ET.parse(filename)
    root = tree.getroot()
    
    for child in ['node', 'way', 'relation']:
        for element in root.findall(child):
            has_postcode, postcode_value = get_postcode(element)
            if has_postcode:
                postcode_value = fix_select_zipcode(postcode_value)
                if postcode_value in valid_orlando_postcodes:
                    number_types['orlando_postcodes'] += 1
                else:
                    number_types['other_postcodes'] += 1
    return number_types
    
display(HTML("<h4>Postal codes by jurisdiction</h4>"))
pprint.pprint(process_map(OSM_FILE))

{'orlando_postcodes': 283, 'other_postcodes': 64}

Our project's stated scope and limitations permits us to discard the node elements (64) which hold non-Orlando city postal codes.

Data Cleaning Plan for Orlando Postal Codes¶

The design for dealing with the tag elements containing the postal numbers in the workflow OSM file is as follows:

Conduct a data audit of the postal codes
- This has been done. A preliminary count of all available postal codes present in the dataset was reported which also revealed the presence of values that may need manual or programmatic correction.
Identify the cause of the issue that we would like to address.
- A data accuracy audit identifies those postal codes that will be considered as invalid and unusable for the purposes of this project because they are situated outside the jurisdiction of the city of Orlando.
Define a set of operations that will correct the data.
- We will define a process using code that will iterate through the workflow map file and discard node elements which have failed our data accuracy test.
Test and verify the results of the cleaning operation
- We will re-apply the data audit process on the output OSM file to verify the results of the postal code data cleaning program.

Continuing on with procedure number 3 from above, we will take the workflow OSM data file that had undergone street type data cleaning (cleaned_street_name.xml) and pass it as one of the arguments to the clean_postcode() function. The other argument would be the cleaned_postcode variable which contains the arbitrary name of the file that will hold the results of our postal code data cleaning process.

OSM_FILE = 'cleaned_street_name.xml'

def clean_postcode(filename, cleaned_filename):
    tree = ET.parse(filename)
    root = tree.getroot()
    
    for child in ['node', 'way', 'relation']:
        for elem in root.findall(child):
            has_postcode, postcode_value = get_postcode(elem)
            if has_postcode:
                if postcode_value not in valid_orlando_postcodes:
                    root.remove(elem)
    
    return tree.write(cleaned_filename)

start_time = time.time()

cleaned_postcode = 'cleaned_postcode.xml'
clean_postcode(OSM_FILE, cleaned_postcode)


print('\n--- %s seconds ---' % (time.time() - start_time))

--- 38.001999855 seconds ---

The clean_postcode() function visits each set of node, way or relation nodes from the OSM file and tries to find and extract any available addr:postcode value. It then performs a test and decides whether to discard the node that contains the postal code. This decision is based on whether or not the postal code can be found in the standard Orlando postal code list.

CLEANED_POSTCODE_FILE = 'cleaned_postcode.xml'
display(HTML("<h4>Postal codes by jurisdiction (after data cleaning)</h4>"))
pprint.pprint(process_map(CLEANED_POSTCODE_FILE))

{'orlando_postcodes': 278, 'other_postcodes': 0}

Finally, the code above implements the last step of the postal code data cleaning plan when it resubmits the newly generated OSM file as an argument to the process_map() function. This task will show the number of nodes that contain invalid postal codes. If the the function reports no remaining invalid post codes, then we are through with the data cleaning tasks that we set out to tackle for the purposes of this project.

Our OSM file has undergone three data cleaning processes producing the final file, cleaned_postcode.xml, which we load into persistent storage to facilitate further data analysis.

SQL Database Queries¶

This section of the project would like to focus on the variety of data analysis options which become available once the dataset has been loaded into a relational database like SQLite. Thus, I shall briefly outline the work that I have done to import the cleaned OSM file into the database.

Generate the input CSV files. The cleaned Orlando OSM file (cleaned_postcode.xml) is used as input to a python application that I developed which parses and segregates the primary nodes and tags into their own CSV files, namely: nodes.csv, nodes_tags.csv, ways.csv, ways_nodes.csv and ways_tags.csv.

Create the SQLite database and tables. Several Data Definition Language (DDL) queries describing the schema are constructed and executed to create the SQLite database (orlando_osm.db) and the tables that will hold the OSM data. The tables were given names that match the CSV files with which they are populated. A few of them are shown below.

Load the CSV files into SQLite Database. The SQLite shell commands below were used to import all the CSV files as well as to remove the carriage return characters that I have found on the timestamp attributes.

.mode csv
.import nodes.csv nodes
.import nodes_tags.csv nodes_tags
.import ways.csv ways
.import ways_nodes.csv ways_nodes
.import ways_tags.csv ways_tags 

UPDATE nodes SET timestamp = REPLACE(timestamp, X'0D', '');
UPDATE ways SET timestamp = REPLACE(timestamp, X'0D', '');

The SQLite Database Tables¶

Let us verify the results of the CSV file imports. The sqlite3 Python module allows us to connect and query our SQLite orlando_osm.db database file.

File Name	File Size
Original Orlando OSM file	75.8 MB
nodes.csv	24.4 MB
nodes_tags.csv	1.26 MB
ways.csv	2.98 MB
ways_nodes.csv	9.29 MB
ways_tags.csv	8.00 MB

Code Re-use¶

Notice that in the code that follows we will be using the execute_query() function every now and then as a convenient way of executing the various SQL queries. In this fashion, we will only be altering the query itself, which is contained in the static literal variable called QUERY.

Data Table Record Counts¶

Among the first things we would like to know right after loading our file into our database is to get an idea of how many records were loaded in each database table. The following code wil provide us this information.

import sqlite3
import pandas as pd
from collections import defaultdict
from IPython.display import display, HTML

DATABASE = 'orlando_osm.db'

QUERY = '''
SELECT 'nodes' AS nodes, COUNT(*) 
FROM nodes
UNION
SELECT 'nodes_tags' AS nodes_tags, COUNT(*) 
FROM nodes_tags
UNION
SELECT 'ways' AS ways, COUNT(*) 
FROM ways
UNION
SELECT 'ways_nodes' AS ways_nodes, COUNT(*) 
FROM ways_nodes
UNION
SELECT 'ways_tags' AS ways_tags, COUNT(*) 
FROM ways_tags
'''

def create_connection(db_file):
    """
    Create a database connection to the SQLite database
    """
    try:
        conn = sqlite3.connect(db_file)
        return conn
    except Error as e:
        print(e)
    return None

def execute_query(db, query):
    """
    Returns the query results as a list
    """
    conn = create_connection(db)
    with conn:
        cur = conn.cursor()
        cur.execute(query)
        return cur.fetchall()

rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows)
df.columns = ['Table Name', 'Count']
display(HTML(df.to_html(index=False, header=True)))

The tables that have the greatest share of the Open Street Map records are the nodes, ways_nodes and ways_tags tables.

nodes¶

The query below should return the first five records of the nodes table. The interesting attributes of this table are the longitude and latitude coordinates which confirm what a node stands for, an Open Street Map node is a particular point on the earth's surface. The node also contains the the user name of the contributing person.

import sqlite3
import pandas as pd

DATABASE = 'orlando_osm.db'

QUERY = '''
SELECT id, lat, lon, user, uid, timestamp 
FROM nodes 
LIMIT 6;
'''
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows[1:])
df.columns = ['nodes_id', 'latitude', 'longitude', 'user', 'user_id', 'timestamp']
display(HTML(df.to_html(index=False, header=True)))

nodes tags¶

The nodes_tags provides additional information about the node elements of the Open Street Map dataset. The information consists of key-value pairs describing information about road structures such as highway traffic signals, crossings, etc.

QUERY = '''
SELECT id, key, value
FROM nodes_tags
LIMIT 6;
'''
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows[1:])
df.columns = ['nodes_id', 'key', 'value']
display(HTML(df.to_html(index=False, header=True)))

ways¶

The Open Street Map data dictionary specification for the ways element describes it as a an ordered list of nodes. By itself, our ways table consists of metadata pertaining to the element. The records in this table will be useful when paired with the ways_nodes intervening table as it will allow us to access the many side of the one-many relationship between the ways entities and the various nodes entities with which they are associated.

QUERY = '''
SELECT id, user, uid, version, changeset, timestamp
FROM ways
LIMIT 6;
'''
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows[1:])
df.columns = ['ways_id', 'user', 'user_id', 'version', 'changeset', 'timestamp']
display(HTML(df.to_html(index=False, header=True)))

ways nodes¶

As introduced earlier, the ways_nodes table concretizes the one-many relationship between the ways and nodes entities. A way is seen as a collection of node elements.

QUERY = '''
SELECT id, node_id, position
FROM ways_nodes
LIMIT 6;
'''
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows[1:])
df.columns = ['ways_id', 'nodes_id', 'position']
display(HTML(df.to_html(index=False, header=True)))

ways tags¶

Just like the nodes_tags record, the ways_tags record keeps descriptive information about its parent element in the form of key-value pairs (e.g. {building: apartments}, {highway: pedestrian}, {landuse: grass}).

QUERY = '''
SELECT id, key, value, type
FROM ways_tags
LIMIT 6;
'''
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows[1:])
df.columns = ['ways_id', 'key', 'value', 'type']
display(HTML(df.to_html(index=False, header=True)))

Now that we have been briefly acquainted with the tables in our OSM database, we can start exploring it to glean some insights from the dataset.

Querying unique contributors¶

The Open Street Map Project is what can be called a living dataset in that it relies on its users who create, edit and update the data that make up the map system. The query that follows would provide information about the number of unique users who have volunteered to generate and maintain our special OSM map file for Orlando.

QUERY = '''
SELECT COUNT(DISTINCT(U1.uid))
FROM (SELECT uid 
      FROM nodes 
      UNION ALL 
      SELECT uid 
      FROM ways) U1
'''
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows)
df.columns = ['unique_uid_count']
display(HTML("<h4>Number of Unique Contributors</h4>"))
display(HTML(df.to_html(index=False, header=True)))

There are 709 users, who among them have contributed the most? The query below lists ten of the most prolific contributors along with their number of contributions as well as their contribution's percentage share of the total contributions in our dataset.

QUERY = '''
SELECT x2.user, x2.cnt, round(((x2.cnt*1.0/x2.tot)*100.0),0) as pct
FROM (
  SELECT
    user,
    cnt,
    (SELECT COUNT(*)
    FROM (
      SELECT user
      FROM nodes
      UNION ALL
      SELECT user
      FROM ways
      )
    ) as tot
  FROM (
    SELECT
      U2.user,
      COUNT(*) AS cnt
    FROM (SELECT user
          FROM nodes
          UNION ALL
          SELECT user
          FROM ways) U2
    GROUP BY U2.user
  )
) x2
order by x2.cnt DESC
LIMIT 10;
'''
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows)
df.columns = ['user_id','num_contributions','pct_of_total']
display(HTML("<h4>Top 10 Orlando Open Street Map Contributors</h4>"))
display(HTML(df.to_html(index=False, header=True)))

Querying about postal codes¶

I would like to revisit the postal codes which were the subject of our data audit and data cleaning tasks a brief while ago.

Which postal codes had the highest frequency of use in our Orlando Open Street Map file?

QUERY = '''
SELECT A1.value, COUNT(*)
FROM
(
    SELECT X1.value
    FROM ways_tags X1
    WHERE X1.key LIKE '%postcode%' 
    UNION ALL
    SELECT Y1.value
    FROM nodes_tags Y1
    WHERE Y1.key LIKE '%postcode%' 
) A1
GROUP BY A1.value
ORDER BY COUNT(*) DESC
'''
rows = execute_query(DATABASE, QUERY)
pcodes1_df = pd.DataFrame(rows)
pcodes1_df.columns = ['postal_code', 'count']
display(HTML(pcodes1_df.to_html(index=False, header=True)))

The top five open street map contributions where a postal code was used came from the following areas:

Postal Code	Record Count	Area Description
32835	36	The southwest,Metrowest area
32803	31	Colonialtown-Audubon Park area
32806	29	Delaney Park area
32801	24	Downtown Orlando area
32839	24	Millenia Mall, Americana Blvd area

These areas are among the busiest places in the city so it makes sense that the OSM contributions reflect this reality. The bar plot below provides a visualization of the post code frequencies that can be found in our dataset.

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

QUERY = '''
    SELECT X1.value
    FROM ways_tags X1
    WHERE X1.key LIKE '%postcode%' 
    UNION ALL
    SELECT Y1.value
    FROM nodes_tags Y1
    WHERE Y1.key LIKE '%postcode%' 
'''
rows = execute_query(DATABASE, QUERY)
pcodes2_df = pd.DataFrame(rows)
pcodes2_df.columns = ['postal_code']

fig, ax = plt.subplots(figsize=(14,6))
plt.title("Number of OSM Elements with Post Code Data")
sns.set(style="whitegrid", color_codes=True)
g = sns.countplot(x="postal_code", 
                  data=pcodes2_df, 
                  palette="RdYlBu",
                  order = pcodes2_df['postal_code'].value_counts().index)
g.set_xticklabels(g.get_xticklabels(),rotation=90);

Querying about restaurants¶

Among the things that I would like to explore from the Orlando Open Street Map dataset is information that would interest a foodie like me.

I would like to know how many nodes and ways records hold information about restaurants.

QUERY = '''
SELECT 'nodes_tags', COUNT(*)
FROM nodes_tags
WHERE key == 'amenity' AND value LIKE '%restaurant%'
UNION ALL
SELECT 'ways_tags', COUNT(*)
FROM ways_tags
WHERE key == 'amenity' AND value LIKE '%restaurant%'
'''
rows = execute_query(DATABASE, QUERY)
df = pd.DataFrame(rows)
df.columns = ['Table', 'Restaurants']
display(HTML(df.to_html(index=False, header=True)))

Building up on what we have so far, let us associate each node or way tag elements with its postal code. We intend to gather some information about restaurants from each of these collections for use later.

QUERY = '''
SELECT 'ways_tags', X1.id, X1.key, X1.value, X2.key, X2.value
FROM ways_tags X1
     JOIN ways_tags X2 on X1.id = X2.id
WHERE X1.key LIKE '%postcode%' 
      AND (X2.key == 'amenity' AND X2.value LIKE '%restaurant%')
UNION 
SELECT 'nodes_tags', Y1.id, Y1.key, Y1.value, Y2.key, Y2.value
FROM nodes_tags Y1
     JOIN nodes_tags Y2 ON Y1.id = Y2.id
WHERE Y1.key LIKE '%postcode%' 
      AND (Y2.key == 'amenity' AND Y2.value LIKE '%restaurant%');
'''
rows = execute_query(DATABASE, QUERY)
rest1_df = pd.DataFrame(rows)
rest1_df.columns = ['source', 'id', 'type_1', 'postal_code', 'type_2', 'value_2']
display(HTML(rest1_df.to_html(index=False, header=True)))

The code above identifies the nodes and ways tag elements that pertain to restaurants as well as their associated postal codes. We could use this information to determine the number of restaurants that are located within a particular postal code.

QUERY = '''
SELECT A1.value, COUNT(*)
FROM
(
    SELECT 'ways_tags', X1.id, X1.key, X1.value, X2.key, X2.value
    FROM ways_tags X1
    JOIN ways_tags X2 on X1.id = X2.id
    WHERE X1.key LIKE '%postcode%' 
        AND (X2.key == 'amenity' AND X2.value LIKE '%restaurant%')
    UNION ALL
    SELECT 'nodes_tags', Y1.id, Y1.key, Y1.value, Y2.key, Y2.value
    FROM nodes_tags Y1
    JOIN nodes_tags Y2 ON Y1.id = Y2.id
    WHERE Y1.key LIKE '%postcode%' 
        AND (Y2.key == 'amenity' AND Y2.value LIKE '%restaurant%')
) A1
GROUP BY value;
'''
rows = execute_query(DATABASE, QUERY)
rest2_df = pd.DataFrame(rows)
rest2_df.columns = ['postal_code', 'restaurants']
display(HTML(rest2_df.to_html(index=False, header=True)))

Limitation: Missing postal codes in restaurants¶

It is quite interesting to notice that our query resulted in only 18 restaurants and yet we know from a previous query that there were 147 ways and nodes entries pertaining to restaurant.

This is because these are the only node and ways elements that had the accompanying postal code information when the OSM record was created. Out of the 147 OSM contributions that pertain to restaurant, OSM users gave additional postal code information on only 18 of them which leaves us with 129 restaurant records with missing postal codes.

This situation is understandable and albeit not unexpected owing to the voluntary nature of the Open Street Map system. Users are not constrained to enter some additional information that might be otherwise be relevant.

Let's return to our discussion of the restaurant dataset. It is more useful to visualize what is going on with the restaurant and postal code dataset by putting them in a bar plot.

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
    
fig, ax = plt.subplots(figsize=(11,6))
plt.title("Number of Restaurants by Postal Code", fontsize=16)
plt.ylabel('Count',fontsize=16)
plt.xlabel('Postal Code', fontsize=16)
sns.set(style="whitegrid", color_codes=True)
sns.countplot(x="postal_code", data=rest1_df,
              palette="muted",
              order = rest1_df['postal_code'].value_counts().index);

It is quite revealing to show how, given our limited number of query results, the data still supports the notion of which areas in Orlando can be assumed to have a bigger share of the restaurant market.

Postal code 32806 represents the Delaney Park area which is about a mile southeast of downtown Orlando. Numerous restaurants in the area have taken advantage of workers coming from the three big businesses there: the Orlando Regional Medical Center, Winnie Palmer Hospital for Children and Orlando's Amtrak Station.

Postal code 32801 is the main downtown Orlando area. Perhaps it does not need much to explain how restaurants like Le Gourmet Break, The Boheme, Fratelli's, and numerous others exist to cater to the customers and workers who work and do leisure downtown.

Postal code 32819 stands for the southwest Orlando and Dr. Phillips-Bay Hill areas, most popularly known locally as the "Restaurant Belt". Numerous restaurants line up the major avenues of this neighborhood and they mostly serve Orlando tourists and theme park patrons.

Cuisines¶

Next, I would like to explore the variety of dishes that are being served in restaurants. The Orlando Open Street Map dataset have entries that have cuisine as key which is being used in both node_tags and ways_tags elements.

I encountered some values that had a cuisine key that I thought would need further data cleaning. For example, there were multivalued field items that have been put into one field like mediterranean;mexican;lebanese and oriental;thai;sushi. For the purposes of this study, I have excluded these ambiguous data items.

We would like to further qualify our query to exclude fast food items such as burger, donut, pizza, chicken sandwich, etc.

QUERY = '''
SELECT tags.value, COUNT(*) as count 
FROM (SELECT key,
             CASE value
               WHEN 'chinese' then 'asian'
               ELSE value
             END as value
      FROM nodes_tags
      UNION ALL
      SELECT key,
             CASE value
               WHEN 'chinese' then 'asian'
               ELSE value
             END as value
      FROM ways_tags) tags
WHERE tags.key LIKE 'cuisine'
    AND tags.value NOT IN ('coffee_shop', 'sandwich',
    'pizza','barbecue,_american', 'chicken_sandwich',
    'chocolate', 'coffee_and_donut_shop', 'doughnut',
    'ice_cream;italian_pizza;noodles', 'subs','regional',
    'chicken', 'burger', 'mediterranean;mexican;lebanese',
    'oriental;thai;sushi')
GROUP BY tags.value
ORDER BY count DESC;
'''
rows = execute_query(DATABASE, QUERY)
cuis1_df = pd.DataFrame(rows)
cuis1_df.columns = ['Cuisine', 'Frequency']
display(HTML(cuis1_df.to_html(index=False, header=True)))

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

QUERY = '''
SELECT value
FROM (SELECT key,
             CASE value
               WHEN 'chinese' then 'asian'
               ELSE value
             END as value
      FROM nodes_tags
      UNION ALL
      SELECT key,
             CASE value
               WHEN 'chinese' then 'asian'
               ELSE value
             END as value
      FROM ways_tags) tags
WHERE tags.key LIKE 'cuisine'
    AND tags.value NOT IN ('coffee_shop', 'sandwich',
    'pizza','barbecue,_american', 'chicken_sandwich',
    'chocolate', 'coffee_and_donut_shop', 'doughnut',
    'ice_cream;italian_pizza;noodles', 'subs','regional',
    'chicken','burger', 'mediterranean;mexican;lebanese',
    'oriental;thai;sushi');
'''
rows = execute_query(DATABASE, QUERY)
cuis2_df = pd.DataFrame(rows)
cuis2_df .columns = ['value']

fig, ax = plt.subplots(figsize=(12,6))
plt.title("Number of OSM elements by Cuisine")
sns.set(style="whitegrid", color_codes=True)
g2 = sns.countplot(x="value", data=cuis2_df, palette="Set2"
             ,order = cuis2_df['value'].value_counts().index)
g2.set_xticklabels(g2.get_xticklabels(),rotation=90);

The American cuisine tops the list from among the captured cuisine data in the Orlando OSM dataset. This is closely followed by the Mexican and Asian cuisines.

Querying exact locations of restaurants¶

We tried to retrieve and present information about restaurants and their postal codes earlier. But that work did not produce much result because of the insufficiency of user-provided postal code data for each associated restaurant.

However, we will not give up so easily because there is another way of determining a similar feature such as the location for each restaurant. And this one is more precise.

It involves getting at the longitude and latitude values which we know are present for each node and way element. I start off with our basic nodes_tags and ways_tags tables which both contain information about which node and way elements pertain to restaurants (i.e. {key: amenity, value: restaurant}) and do a natural join with the main nodes database table via the id attribute. The nodes tuple or record contain informaton about the latitude and longitude data for each node.

import sqlite3
import pandas as pd
import pprint
from collections import defaultdict
    
DATABASE = 'orlando_osm.db'

def convert_to_restaurant_list(df):
    resto_list = []
    
    for index, row in resto_df.iterrows():
        location = []
        dict_item = {}
        location.append(row['lon'])
        location.append(row['lat'])
        rname = row['name'].encode("utf-8")
        rname = rname.replace("'", "")
        dict_item = {'loc': location, 
                     'label': rname}
        resto_list.append(dict_item)
    
    return resto_list
    
QUERY = '''
SELECT DISTINCT nodes.lat, nodes.lon, N1.value
FROM (
    SELECT *
    FROM nodes_tags
    UNION ALL
    SELECT *
    FROM ways_tags
) b
JOIN nodes ON b.id = nodes.id
JOIN (
    SELECT id, key, value
    FROM nodes_tags
    WHERE key = 'name'
) N1 ON N1.id = nodes.id
WHERE b.key = 'amenity' AND b.value = 'restaurant';
'''
rows = execute_query(DATABASE, QUERY)
resto_df = pd.DataFrame(rows)
resto_df.columns = ['lon','lat','name']
display(HTML(resto_df[:10].to_html(index=False, header=True)))
resto_list = convert_to_restaurant_list(resto_df)

Notice that the name of the restaurant is also displayed in the query result set. I was able to include this by making another natural join to the nodes_tags records using the node ID reference keys.

The outcome of our query returned a dataframe that contains the map coordinates and names of the restaurants in our dataset. These are the components that we can use to visualize the data points on an interactive map using the folium module. But before we can do so, I developed the convert_to_restaurant_list() function to convert our dataframe into a structured collection (list of dictionaries) that folium neads to generate an interactive map.

%matplotlib inline
import folium

width, height = 650, 650
restaurant_map = folium.Map(location=[28.539714, -81.379414], zoom_start=11,
                            tiles='OpenStreetMap')
# Add markers to the map
for restaurant in resto_list:
    marker = folium.Marker(location=restaurant['loc'], popup=restaurant['label'])
    marker.add_to(restaurant_map)
    
restaurant_map

It is quite satisfying whenever we see how our data visualization reveals and corraborates the informed assumption that I made earlier when we explored the number of restaurants that are within the jurisdiction of various postal codes, to wit:

Postal code 32819 stands for the southwest Orlando and Dr. Phillips-Bay Hill areas, most popularly known locally as the "Restaurant Belt". Numerous restaurants line up the major avenues of this neighborhood and they mostly serve Orlando tourists and theme park patrons.

The southwest portion of the interactive map of Orlando is densely dotted by restaurant markers and we can safely infer that this area most probably falls within the postal code 32819 jurisdiction - Orlando's Restaurant Belt.

Suggestions for Improvement of the Data and Its Analysis¶

The Open Street Map Project is a very commendable undertaking in that it satisfies a public need (providing map information) while doing so at the least amount of upkeep (through the voluntary participation of its users). Its effectiveness and its ability to stay useful and relevant may be attributed to its flexibility and ease of use. Users seem to be allowed to edit and update the map file data arbitrarily. This modest feature may come at a cost in terms of data integrity and consistency.

I will explain what I meant by consistency using a particular experience that I had while doing data analysis. During the part when I was exploring the data for culinary cuisines, I came across some data that were problematic.

QUERY = '''
SELECT tags.value, COUNT(*) as count
FROM (SELECT key,
             value
      FROM nodes_tags
      UNION ALL
      SELECT key,
            value
      FROM ways_tags) tags
WHERE tags.key LIKE 'cuisine'
    AND tags.value NOT IN ('american', 'mexican',
    'italian','Tex-Mex', 'barbecue', 'cuban','vegetarian')
GROUP BY tags.value
ORDER BY count DESC;
'''
rows = execute_query(DATABASE, QUERY)
cuis1_df = pd.DataFrame(rows)
cuis1_df.columns = ['Cuisine', 'Frequency']
display(HTML(cuis1_df.to_html(index=False, header=True)))

I would like to discuss the issues posed by these entries that were meant to satisfy the concept of a cuisine.

What's a cuisine?¶

The data items such as chicken and chocolate that were entered as values for the key of cuisine does not seem to go within the purview of the definition of a cuisine. According to Wikipedia:

A cuisine is a style of cooking characterized by distinctive ingredients, techniques and dishes, and usually associated with a specific culture or geographic region.

Data items chicken and chocolate are not cuisines in themselves, but they can be used as ingredients for a particular cuisine (i.e, chocolate and chicken are used in a dish called Chocolate Chicken Mole, a Mexican cuisine dish).

The part is not the whole¶

Data items like burger, sandwich, pizza, doughnut, and subs can properly be considered as dishes or menu items which cater to a particular cuisine. Thus, a burger and donut may instead be classified as belonging to the American cuisine; pizza under Italian, and sandwich and subs may perhaps be included in some catch-all cuisine as every geographic reqion has its version of these common foods.

The shop is not a cuisine¶

There are data items that perhaps try to describe the type of business rather than the cuisine. Items like coffee_shop and coffee_and_donut_shop may need to be re-entered or manually cleaned.

A common thread¶

Some cuisines that were entered can reasonably be classified into one common cuisine. For example asian, chinese and japanese can all be consolidated under just the asian cuisine.

Multivalued Data¶

Perhaps a rather more problematic issue is the case when a field is allowed to contain multiple values. Some examples of this that can be found in the Orlando map dataset are barbecue,_american, ice_cream;italian_pizza;noodles, mediterranean;mexican;lebanese and oriental; thai; sushi. Queries made when these values are present are possible to produce ambiguous and inconsistent results. There are also cuisine data items like regional that are inherently ambiguous by themselves (what particular region pertains to a regional cuisine?)

How Can We Do Better?¶

A data field that is prone to be filled with arbitrary values and/or even multple (non-atomic) values run the risk of data quality (DQ) issues. Such DQ issues often affect the results of SQL queries offering non-results or even false results.

One way of dealing with multiple values is to disallow users from entering them by the use of option or dropdown elements where users are constrained to select from a predetermined set of options as inputs. However, this may be difficult to implement because the key part of the key-value relationship (cuisine) in the node or way elements are themselves arbitrary. A user may or may not choose to add a cuisine-type key-value data item pair.

If this is the case, perhaps the Open Street Map project may be able to add a feature that allows users to tell the system about the data item type that they are thinking of creating. If the user intends to augment a node or way element with a new multi-value key-value type entry, then there should be a toggle or check box on the screen for the user to specify the multi-valued characteristic of the data entry. Along with this, the system should keep information of the most often used multi-valued data item sets (i.e. cuisine) along with their proper predetermined value selection set.

The system should then be able to provide edit notifications to the user should the user attempt to include multiple items in the input field which has been configured to be multi-valued or the system can also offer the predetermined value selection set to help the user choose the correct value data item.

Moreover, the "cuisine-aware" system can also maintain an internal list of value items that can fall under the common fast food cuisine. Thus, items like chicken sandwich, burger, ice cream, subs, pizza and the like which can be found in our Orlando map dataset can appropriately and automatically be assigned to this category.

Conclusion¶

We are able to illustrate the typical tasks involved with wrangling data from a raw data file to a form that can be efficiently used for data analysis.

The major process of data cleaning was begun with gathering of as much information about the components of the map file. This involved some demonstrations of the loading and parsing of various XML files.

Using this information, data on the street names, telephone numbers and postal codes were audited in order to evaluate their accuracy, completeness, consistency and validity. The data audits were performed giving particular emphasis on the importance of adhering to a data cleaning plan to ensure a systematic approach at working with the data.

Afterwards, data cleaning were performed programmatically and at then end of each phase of the data cleaning exercise, a clean version of the original map file was generated and tested to serve as input to the succeeding phase of the data cleaning process.

We also discussed how we parsed the cleaned map file into several CSV files. These flat files were then loaded into a SQLite database where we performed various SQL queries to support and facilitate our data exploration tasks.

Our primary takeaway from all of these exercises is to demonstrate the necessary data wrangling processes that have to take place before a proper data analysis can happen.

Attribute	Occurrences
ref	451669
version	368096
user	368095
uid	368095
timestamp	368095
id	368095
changeset	368095
lon	309681
lat	309681
v	247303
k	247303
type	54731
role	54731
osm_base	1
minlon	1
minlat	1
maxlon	1
maxlat	1
generator	1

Key	Occurrences
highway	42180
name	23953
sidewalk	19416
boundary	12806
surface	12551

nodes_id	latitude	longitude	user	user_id	timestamp
26771646	28.456316	-81.309911	KindredCoda	14293	2009-09-19T12:48:58Z
26787543	28.466059	-81.276102	Mongo Poker	4256778	2018-01-30T19:48:25Z
26787546	28.469043	-81.309970	NE2	207745	2013-03-11T03:10:23Z
26787551	28.467188	-81.304381	NE2	207745	2012-01-17T13:28:51Z
26787552	28.468235	-81.306048	NE2	207745	2010-09-01T18:07:39Z

nodes_id	key	value
26787543	highway	traffic_signals
26787546	highway	traffic_signals
26915727	highway	traffic_signals
26927708	railway	switch
26928881	railway	crossing

ways_id	user	user_id	version	changeset	timestamp
4450379	NE2	207745	9	13900685	2012-11-16T23:56:13Z
4450380	carciofo	2533093	8	46601436	2017-03-05T17:19:14Z
4450381	NE2	207745	7	13900685	2012-11-16T23:56:04Z
4450382	Mongo Poker	4256778	14	54913285	2017-12-25T18:38:04Z
4450385	NE2	207745	7	15348216	2013-03-13T07:43:17Z

Element	Occurrences
nd	396938
node	309681
tag	247303
member	54731
way	54250
relation	4164
osm	1
note	1
meta	1
bounds	1

Table Name	Count
nodes	309637
nodes_tags	34188
ways	54231
ways_nodes	396734
ways_tags	194130

ways_id	nodes_id	position
4450379	27336999	0
4450379	498745566	1
4450379	498745568	2
4450379	498745571	3
4450379	498745577	4

ways_id	key	value	type
4450379	bicycle	no	regular\r
4450379	bridge	yes	regular\r
4450379	bridge_number	750486	regular\r
4450379	foot	no	regular\r
4450379	highway	motorway_link	regular\r

user_id	num_contributions	pct_of_total
NE2	148676	41.0
Mongo Poker	34089	9.0
crystalwalrein	24348	7.0
RuckRunnerGRT	20705	6.0
epcotfan	18929	5.0
dale_p	10622	3.0
3yoda	9489	3.0
Brian@Brea	8876	2.0
pyram	5240	1.0
Rub21	5211	1.0

source	id	type_1	postal_code	type_2	value_2
nodes_tags	605014875	postcode	32835	amenity	restaurant
nodes_tags	3059662260	postcode	32801	amenity	restaurant
nodes_tags	3059667206	postcode	32801	amenity	restaurant
nodes_tags	3330802043	postcode	32819	amenity	restaurant
nodes_tags	3846680191	postcode	32803	amenity	restaurant
nodes_tags	4142734111	postcode	32801	amenity	restaurant
nodes_tags	4161915546	postcode	32819	amenity	restaurant
nodes_tags	4613278490	postcode	32819	amenity	restaurant
nodes_tags	4834972122	postcode	32819	amenity	restaurant
nodes_tags	4834976721	postcode	32806	amenity	restaurant
nodes_tags	4835237321	postcode	32804	amenity	restaurant
nodes_tags	5194583683	postcode	32803	amenity	restaurant
ways_tags	183282408	postcode	32822	amenity	restaurant
ways_tags	306611107	postcode	32801	amenity	restaurant
ways_tags	313780404	postcode	32818	amenity	restaurant
ways_tags	406324418	postcode	32806	amenity	restaurant
ways_tags	406324420	postcode	32806	amenity	restaurant
ways_tags	406324425	postcode	32806	amenity	restaurant

Cuisine	Frequency
american	10
mexican	9
asian	6
italian	2
Tex-Mex	1
barbecue	1
cuban	1
japanese	1
vegetarian	1

lon	lat	name
28.488172	-81.428251	Olive Garden
28.488533	-81.428385	BJs Bar and Grill
28.489898	-81.430359	TGI Fridays
28.503706	-81.458043	Wing House
28.503323	-81.458029	Ruby Tuesdays
28.502807	-81.458022	Dennys
28.593953	-81.306419	Wing House
28.472068	-81.469385	Confisco Grille
28.472419	-81.471485	Mythos
28.516199	-81.481889	Tavern on the Lake

Cuisine	Frequency
burger	76
chicken	8
coffee_shop	7
sandwich	6
pizza	4
asian	3
chinese	3
barbecue,_american	1
chicken_sandwich	1
chocolate	1
coffee_and_donut_shop	1
doughnut	1
ice_cream;italian_pizza;noodles	1
japanese	1
mediterranean;mexican;lebanese	1
oriental;thai;sushi	1
regional	1
subs	1