ABSTRACT   Typical tasks involved in data wrangling are exhibited by applying the techniques of data acquisition, data extraction and parsing, data quality audit, data transformation and data integration on the Orlando Open Street Map data. Eventually, the processed map data is prepared for storage in a persistent SQL data store for further data exploration and analysis.

Introduction

Data that is used to support inferential models or exploratory and analytical studies usually do not come tidy. That is, raw and messy data do not come in a form that can readily be consumed by downstream processes.

This is especially true for data that were acquired by unconventional means such as web scraping or whose data elements were collected from an unknown provenance. Extra work has to be done on the data to make them easier to work with and useful for whatever purpose we may need them for later.


The OpenStreetMap (OSM) data is an example of a likely messy dataset. OpenStreetMap makes available user-contributed map data about local roads and businesses. Because information about roads, highways and other map objects are done by individuals, issues about the accuracy of the data may occur or at most be reasonably expected. This makes an OSM file a very suitable candidate for illustrating the data wrangling process.

Let us begin by loading the Python modules which we will be using in this project.

In [53]:
import codecs
import os
import os.path
import pandas as pd
import pprint
import pymongo
import re
import requests
import sys
import time
import xml.etree.cElementTree as ET
import json
from collections import defaultdict
from IPython.display import display, HTML


Data Acquisition

The OpenStreetMap data which we will be working on in this project is taken from Orlando, Florida. I chose this place for personal reasons; it is a city with which I am quite familiar.

The Orlando City OSM map file is acquired from the OpenStreetMap Project interactive web application where a user can manually search and select an area of the city for export and download.

Using an API to acquire the OpenStreetMap XML dataset

There are several ways to acquire the XML OSM map file for a particular location. One method is to download a pre-selected area from MapZen and another method is to use the Overpass API to retrieve a map area tailored for your needs.

I decided to use the latter method in order to have control of the boundaries of the custom map area within the city of Orlando within which I would like to focus my study.

Moreover, I like to have the freedom to choose a map area large enough that will generate a sizeable file, preferably more than 50 MB. A large yet manageable OSM XML file will yield more data elements for analysis.

The get_map_file() function is used to interface with the Overpass web API application to generate the OSM XML map file that we need for this project.

In [54]:
def get_map_file(api_url, local_filename):
    """
    Use the Overpass API to download a custom OSM XML map file
    """
    total_size = 79567834  # 77.703 KB
    
    if os.path.exists(local_filename):
        if os.path.getsize(local_filename) == total_size:
            print 'The file {} already exists and verified. File size: {}'.format( \
                local_filename, os.path.getsize(local_filename))
            return None

    r = requests.get(api_url, stream=True)
    with open(local_filename, 'wb') as f:
        print "Downloading %s" % local_filename, '...'
        
        dl_size = 0
        for chunk in r.iter_content(chunk_size=4096):
            if chunk:
                f.write(chunk)
                dl_size += len(chunk)
                done = int(100 * dl_size / float(total_size))
                sys.stdout.write("\r[%s%s]" % ('-' * done, ' ' * (100 - done)))    
                sys.stdout.flush()

    print '\nDownload finished. {} is ready.'.format(local_filename)

To use the get_map_file() function, we pass in a couple of parameters: API_URL parameter is the URL signature that contains the bounding box information that will be used by the Overpass API to generate the map file. OSM_FILE parameter is a custom file name (and path) that will be assigned to the downloaded map file.

In [55]:
API_URL  = 'https://overpass-api.de/api/map?bbox=-81.5120,28.4558,-81.2205,28.6135'
OSM_FILE = 'orlando.osm'

get_map_file(API_URL, OSM_FILE)
The file orlando.osm already exists and verified. File size: 79567834

Here is a description of the raw XML OSM data file that we acquired from the OpenStreetMap web application.

File Property Description
Data Source OpenStreetMap Project API
Data Format OSM XML
File Size 77.7 MB

A snapshot of the raw data file follows. This is generated by the OpenStreetMap Project web application.

It shows that the OSM data file makes use of UTF-8 character encoding, a dominant internet encoding standard. It also tells us that the file is marked up using the standard specifications for XML data format. We also get a hint of the presence of the osm version tag and attribute that OpenStreetMap also adheres to a specialized set of document standards and specifications.

Data Cleaning

Now that we have acquired the data in the form of an XML file we can begin to examine the file superficially paying particular attention to its makeup. By doing this, we can start forming some questions that enhance our understanding of its contents which might then lead to more curious questions later.

Getting To Know The Dataset

In [56]:
def count_tags(filename):
    """
    Returns a dictionary of name and count of top-level XML tags
    """
    tags_dict = {}  # initialize an empty dictionary

    # SAX parsing of the file
    for event, elem in ET.iterparse(filename, events=('start', )):
        if elem.tag not in tags_dict:
            tags_dict[elem.tag] = 1
        else:
            tags_dict[elem.tag] += 1

    return tags_dict

Since we know that the data file is in XML format, let us try to discover the top-level elements that comprise the dataset.

The function count_tags() above uses the Simple API for XML or SAX method of parsing to iteratively process the map file. The function returns a dictionary of unique OSM element names as keys and the cardinality of each element as values.

This will give us information about the type and volume of data we can expect to encounter in the map file.

In [57]:
OSM_FILE = 'orlando.osm'

tags = count_tags(OSM_FILE)
sorted_by_count = [(k, v) for (v, k) in sorted([(value, key) for (key, value) in tags.items()], reverse=True)]

df = pd.DataFrame(sorted_by_count)
df.columns = ['Element', 'Occurrences']
display(HTML("<h4>Types of elements and occurrences</h4>"))
display(HTML(df.to_html(index=False)))

Types of elements and occurrences

Element Occurrences
nd 396938
node 309681
tag 247303
member 54731
way 54250
relation 4164
osm 1
note 1
meta 1
bounds 1

The sorted report allowed us to discover the top-level elements that make up the OSM map data. They are the elements which have only one occurrence: bounds, meta, note, and osm.

Element Description
osm Information about the API application or service that generated the OSM file.
note Additional information about the document.
meta Usually holds timestamp information about the generated file.
bounds Contains the map file bounding box longitude and latitude information.

The next set of elements worth paying attention to are the node, way and relation elements. OpenStreetMap's conceptual data model of the physical world makes use of these basic components and it is worth knowing what constitutes each of them.

OpenStreetMap's documentation on some of the other primary elements are briefly described below. It is worthwhile to be familiar with these data elements so we know the types of data that are available to us and where to find them.

Element Description
Node (nd)
  • Specific point on the surface defined by latitude and longitude.
  • They can be used to define the shape of a way.
  • They can also be included as a member of a relation.
  • Way (way)
  • An ordered list of between 2 and 2,000 nodes
  • Can represent area boundaries like buildings or forests
  • Relation (relation)
  • A multi-purpose data structure that describes a relationship between two or more data elements like nodes, ways and other relations.
  • Primarily an ordered list of nodes, ways or other relations.
  • Tag (tag)
  • Tags can be a member of all types of data elements.
  • They describe the meaning of the particular element to which they are attached.
  • A tag is composed of two text fields: a unique key and a value.
  • An element cannot have two tags with the same key.
  • XML atttributes provide additional information about the particular element to which they are associated.

    The get_attributes() function will help us discover what and how many attributes exist in our dataset. We shall call this function and sort the results to determine the most frequent attributes that are in use in our data file.

    In [58]:
    def get_attributes(filename):
        """
        Identify and bin each attribute using a dictionary
        """
        attrs = {}
        for event, elem in ET.iterparse(filename, events=('start', 'end')):
            if event == 'end':
                for attr in elem.attrib:
                    if attr not in attrs:
                        attrs[attr] = 1
                    else:
                        attrs[attr] += 1
        return attrs
    
    OSM_FILE = 'orlando.osm'
    
    start_time = time.time()
    attrs = get_attributes(OSM_FILE)
    sorted_by_count = [(k, v) for (v, k) in sorted([(value, key) for (key, value) in attrs.items()], reverse=True)]
    
    df = pd.DataFrame(sorted_by_count)
    df.columns = ['Attribute', 'Occurrences']
    display(HTML("<h4>Element Attributes and Occurrences</h4>"))
    display(HTML(df.to_html(index=False)))
    
    # print('\n--- %s seconds ---' % (time.time() - start_time))
    

    Element Attributes and Occurrences

    Attribute Occurrences
    ref 451669
    version 368096
    user 368095
    uid 368095
    timestamp 368095
    id 368095
    changeset 368095
    lon 309681
    lat 309681
    v 247303
    k 247303
    type 54731
    role 54731
    osm_base 1
    minlon 1
    minlat 1
    maxlon 1
    maxlat 1
    generator 1

    Of particular interest to us are the k (key) and v (value) attributes which are displayed above as appearing 247,303 times in our dataset.

    The key and value attributes are members of tag elements and as was described before, a tag describes specific map features of an OSM map element be they nodes, ways or relations. A key is unique within each tag element and it particularly shows a topic, category or type of a map feature (such as a highway or a name). The value, on the other hand, provides the details of the feature classified by the key associated with it.

    In [59]:
    def get_keys(filename):
        """
        Collect key attributes and occurrences in a dictionary
        """
        keys = {}
        for event, elem in ET.iterparse(filename, events=('start', 'end')):
            if event == 'end':
                key = elem.attrib.get('k')
                if key:
                    if key not in keys:
                        keys[key] = 1
                    else:
                        keys[key] += 1
        return keys
    

    The get_keys() function above returns a list of dictionaries containing each key and associated frequency in the OSM data file.

    In [60]:
    # Returns the top five key-occurrences combinations from the OSM file
    OSM_FILE = 'orlando.osm'
    
    start_time = time.time()
    keys = get_keys(OSM_FILE)
    sorted_by_count = [(k, v) for (v, k) in sorted([(value, key) for (key, value) in keys.items()], reverse=True)]
    
    df = pd.DataFrame(sorted_by_count)
    df.columns = ['Key', 'Occurrences']
    display(HTML("<h4>Key Attributes and Occurrences</h4>"))
    display(HTML(df[:5].to_html(index=False)))
    
    # print('\n--- %s seconds ---' % (time.time() - start_time))
    

    Key Attributes and Occurrences

    Key Occurrences
    highway 42180
    name 23953
    sidewalk 19416
    boundary 12806
    surface 12551

    While listing the different types of keys or map features available, we notice that there are a variety of different keys:

    • Lower case letters joined by an underscore symbol (old_name)
    • Lower case letters joined by a colon symbol (addr:city)
    • Lower case letters joined by a period (addr.place)

    According to the OpenStreetMap documentation regarding keys and values:

    keys can be qualified with prefixes, infixes, or suffixes (usually, separated with a colon), forming super- or sub-categories or namespace.

    It is useful to know that key attributes can be hierarchical and that they themselves may contain other data structures.

    Let us remind ourselves that the purpose of this data audit is to discover potential problems with the raw data with an eye towards improving the data quality before ultimately putting them into persistent storage.

    The key_type() function will check the contents of each tag element's key attribute (k) through the use of three regular expressions to verify certain patterns. It will return the number of matches that it finds which will inform us of any potential tags that might require further examination.

    In [61]:
    lower = re.compile(r'^([a-z]|_)*$')
    lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
    problem_chars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
    
    
    def key_type(element, keys):
        """
        Categorize and count each tag element
        """
        if element.tag == "tag":
            if lower.search(element.attrib['k']):
                keys['lower'] += 1
            elif lower_colon.search(element.attrib['k']):
                keys['lower_colon'] += 1
            elif problem_chars .search(element.attrib['k']):
                keys['problem_chars'] += 1
            else:
                keys['other'] += 1
            
        return keys
    
    
    def process_map(filename):
        keys = {"lower": 0, "lower_colon": 0, "problem_chars": 0, "other": 0}
        for _, element in ET.iterparse(filename):
            keys = key_type(element, keys)
    
        return keys
    

    We now pass in our Orlando OSM data file as a parameter to the process_map() function which will iteratively parse through the XML file applying the key_type() function over the file's records.

    In [62]:
    OSM_FILE = 'orlando.osm'
    keys = process_map(OSM_FILE)
    
    sorted_by_count = [(k, v) for (v, k) in sorted([(value, key) for (key, value) in keys.items()], reverse=True)]
    
    df = pd.DataFrame(sorted_by_count)
    df.columns = ['Key Type', 'Occurrences']
    display(HTML("<h4>Type of Key Formats and Occurrences</h4>"))
    display(HTML(df.to_html(index=False)))
    

    Type of Key Formats and Occurrences

    Key Type Occurrences
    lower 198838
    lower_colon 39379
    other 9086
    problem_chars 0

    It looks like we are safe because the report did not show any occurrence of problematic keys. We can continue with our data wrangling tasks trusting that we can safely use the key attributes going forward.

    Auditing Street Names

    Among the key attributes that may be commonly expected in a map file are those that pertain to street names. The tag element holds the street name information in the value associated with its addr:street key attribute.

    Let us list several addr:street values to identify errors and untidy data and to get an idea of what type of processing or cleaning we might have to do with those values. The code below helps us do this.

    In [63]:
    OSM_FILE = 'orlando.osm'
    street_type_re = re.compile(r'\S+\.?$', re.IGNORECASE)
    
    
    def audit_street_type(street_types, street_name):
        """
        Use the regex pattern to match street type part of the street name
        
        Args: 
            street_types(dictionary): Key-value collection of street types and occurrences
            street_name(string): The value of a tag pertaining to an address key
        """
        m = street_type_re.search(street_name)
        if m:
            street_type = m.group()
            street_types[street_type] += 1
    
    
    def is_street_name(elem):
        """
        Check if the element is a tag and the key attribute is a street information
        
        Args:
            elem(dictionary): OSM tag object in key-value pair format
        Returns:
            bool: Indicates whether contents of dictionary item pertains to a street
        """
        return (elem.attrib['k'] == "addr:street")
    
        
    def print_sorted_dict(d):
        """
        Displays the results from iterating through street type dictionary
        
        Args:
            d(dictionary): Key-value collection of street types and occurrences
        """
        sorted_by_count = sorted([(key, value) for (key, value) in d.items()], reverse=False)
        df = pd.DataFrame(sorted_by_count)
        df.columns = ['Street Type', 'Occurrences']
        display(HTML("<h4>Sample of Street Types and Occurrences</h4>"))
        display(HTML(df[:5].to_html(index=False)))
        
    def street_type_audit(file_in):
        """
        Wrapper file iterator for the main street type audit routine
        
        Args:
            file_in(string): Input map file
        Returns:
            street_types(dictionary): Key-value collection of street types and occurrences
        """
        osm_file = open(file_in, "r")
        street_types = defaultdict(int)
        for event, elem in ET.iterparse(osm_file, events=("start",)):
            if elem.tag == "node" or elem.tag == "way":
                for tag in elem.iter("tag"):
                    if is_street_name(tag):
                        audit_street_type(street_types, tag.attrib['v'])
                        
        return street_types
                
    
    if __name__ == '__main__':
        start_time = time.time()
        street_types = street_type_audit(OSM_FILE)
        print_sorted_dict(street_types)
        # print('\n--- %s seconds ---' % (time.time() - start_time))
    

    Sample of Street Types and Occurrences

    Street Type Occurrences
    #303 1
    #F 1
    136 1
    250 1
    32801 1

    The output of our street type audit shows that there are some indeterminable number and number-character patterns that were used (e. g. #303, #F, 136, etc.) in the street names. These can be further investigated in a different study.

    What we are interested in for this project is noticing the variety of ways a particular street type is expressed in the raw OSM data file. For example, Avenue which is predominantly used (190 occurrences) has two other variants: Ave (used 6 times in the file) and Ave. (with a period, used once).

    The discovery of this characteristic in our dataset provides a motivation for our first data cleaning task — we shall seek and update street type data in the street names and make certain street types conform to a uniform set of street type labels that we will specify. For instance, we will replace all occurrences of the terms Ave and Ave. with the standard term Avenue.

    Data Cleaning Plan for Street Names

    The data audit of the street types reveals the need for some work to be done in shaping the raw data into a form that we can use. But we should not undertake the cleaning of the data haphazardly. This is where having a data cleaning plan becomes very useful.

    A data cleaning plan pushes us to be organized and methodical in our approach towards data cleaning.

    I have prepared the following data cleaning plan for tidying the street names in the Orlando OSM dataset:

    1. Use the information obtained from the data audit
    2. Try to identify the causes of the dirty data
    3. Define a set of operations that will correct the data
    4. Test and verify the results of the cleaning operation.

    Procedure number 1 has been accomplished by our recent data audit work on the Orlando map file street names. As for procedure number 2, we can be sure that the cause of variations and other possible problems are mostly attributable to human error.

    What follows then is the need to develop some operations that will correct the data (procedure number 3).

    The code below defines some helper functions that will verify the street type used in each street name. The idea is to first test the street type if it matches any item in the list of expected street types that we have prepared. A mismatch marks the street type as a candidate for possible replacement.

    In [64]:
    street_type_re = re.compile(r'\S+\.?$', re.IGNORECASE)
    
    expected = ["Avenue", "Boulevard", "Circle", "Court", "Drive", "Lane",
               "Loop", "Parkway", "Place", "Plaza", "Road", "Run", "Street",
               "Trail", "Turnpike", "Way"]
    
    mapping = { "pl": "Place",
                "st": "Street",
                "ave": "Avenue",
                "rd": "Road",
                "w": "West",
                "n": "North",
                "s": "South",
                "e": "East",
                "blvd":"Boulevard",
                "sr": "Drive",
                "ct": "Court",
                "ne": "Northeast",
                "se": "Southeast",
                "nw": "Northwest",
                "sw": "Southwest",
                "dr": "Drive",
                "sq": "Square",
                "ln": "Lane",
                "trl": "Trail",
                "pkwy": "Parkway",
                "ste": "Suite",
                "lp": "Loop",
                "hwy": "Highway"}
    
    # updated version of audit_street_type
    def audit_street_type(street_types, street_name):
        m = street_type_re.search(street_name)
        if m:
            street_type = m.group()
            if street_type not in expected:
                street_types[street_type].add(street_name)
    
    def is_street_name(elem):
        """
        Check if the element is a tag and the key attribute is a street information
        """
        return (elem.attrib['k'] == "addr:street")
                
    def audit_street_names(osmfile):
        osm_file = open(osmfile, "r")
        street_types = defaultdict(set)
        for event, elem in ET.iterparse(osm_file, events=("start",)):
            if elem.tag == "node" or elem.tag == "way":
                for tag in elem.iter("tag"):
                    if is_street_name(tag):
                        audit_street_type(street_types, tag.attrib['v'])
    
        return street_types
    
    def get_valid_name(name, mapping):
        name_list = re.findall(r"[\w']+", name)
        end_of_street_name = len(name_list)
        for i in range(len(name_list)):
            word = name_list[i].lower()
            if word in mapping:
                end_of_street_name = i
                name_list[i] = mapping[word]
            
        name_list = name_list[:(end_of_street_name + 1)]
        better_name = ' '.join(name_list)
        return better_name
    

    Our primary goal in auditing and cleaning the street type data is to homogenize the frequently used terms, that is, to make them uniform.

    Because of the street type audit report that we produced, we are able to come up with a set of street type terms which we know will need to be checked and corrected if necessary and include them in a python list called expected.

    We also prepared a dictionary of commonly used street type contractions (mapping) as keys and paired each of them with a corrected version that we like to use in the update.

    Let us put our helper functions to work. The code below loads and iterates through the OSM map file gathering all street types and keeping each uncommon street name that it finds into a dictionary keyed on the street type (the street name contains a street type that is not in our list of expected street types).

    The dictionary (street_types) is then put into another iterative process whereby the street name finds its corrected version based on a mapping of how we want a street type to appear when it finds an incorrect street type (get_valid_name). The proposed street name conversion is then displayed for each visit into the dictionary elements.

    In [65]:
    start_time = time.time()
    
    OSM_FILE = 'orlando.osm'
    street_types = audit_street_names(OSM_FILE)
    
    display(HTML("<h4>Proposed street name conversions</h4>"))
    
    for st_type, ways in street_types.iteritems():
        for street_name_in_file in ways:
            valid_name = get_valid_name(street_name_in_file, mapping)
            if street_name_in_file != valid_name:
                print street_name_in_file, "===>", valid_name
    
    print('\n--- %s seconds ---' % (time.time() - start_time))
    

    Proposed street name conversions

    E Colonial Dr Orlando, FL 32801 ===> East Colonial Drive
    International Dr.  Store 6G ===> International Drive
    N Lakemonte Ave. ===> North Lakemonte Avenue
    e. colonial dr ===> East colonial Drive
    East Colonial Drive #303 ===> East Colonial Drive 303
    E Robinson St ===> East Robinson Street
    Livingston  St ===> Livingston Street
    Curry Ford Rd ===> Curry Ford Road
    S. Kirkman Rd ===> South Kirkman Road
    Aloma Avenue #F ===> Aloma Avenue F
    S Goldenrod Rd Ste 136 ===> South Goldenrod Road Suite
    Central Blvd ===> Central Boulevard
    S Ivanhole Blvd ===> South Ivanhole Boulevard
    N Semoran Blvd ===> North Semoran Boulevard
    Lee Vista Blvd ===> Lee Vista Boulevard
    North Orange Ave ===> North Orange Avenue
    South Park Ave ===> South Park Avenue
    S Division Ave ===> South Division Avenue
    W Fairbanks Ave ===> West Fairbanks Avenue
    Seminole Ave ===> Seminole Avenue
    South Rio Grande Ave ===> South Rio Grande Avenue
    420-E STATE ROAD 434 ===> 420 East
    Lake Ellenor Dr ===> Lake Ellenor Drive
    New York Avenue ste 250 ===> New York Avenue Suite
    N. Semoran blvd. ===> North Semoran Boulevard
    
    --- 6.89700007439 seconds ---
    

    The output shown above shows us the results of the proposed street name conversions that will happen should we execute the data cleaning operations that we have defined. Our efforts to ensure uniformity among certain street types which we identified during the data audit were effective.

    The clean_street_names() function that follows will implement the data cleaning tasks on the street names by passing it our Orlando OSM file. The function also generates a seperate output file, cleaned_street_name.xml, which will be a version of the original OSM file in which the street names have undergone data transformation.

    In [66]:
    def clean_street_names(filename, cleaned_filename):
        """
        Apply data cleaning on street names
        """
        tree = ET.parse(filename)
        root = tree.getroot()
    
        for tag in root.findall('*/tag'):
            if is_street_name(tag):
                name = tag.get('v')
                better_name = get_valid_name(name, mapping)
                tag.set('v', better_name)
    
        return tree.write(cleaned_filename)
    
    
    start_time = time.time()
    
    OSM_FILE = 'orlando.osm'
    cleaned_street_name = 'cleaned_street_name.xml'
    clean_street_names(OSM_FILE, cleaned_street_name)
    
    print('\n--- %s seconds ---' % (time.time() - start_time))
    
    --- 34.6669998169 seconds ---
    

    As the last step in this iteration of the street name data cleaning plan (specifically procedure 4), we shall run the get_valid_name() function on our dataset. However, we will be using our newly-created output file cleaned_street_name.xml to check whether some street type related issues still persist in our dataset.

    I mentioned the word iteration because data cleaning is usually an iterative process. We have to repeat our data cleaning process not only to (1) verify the effectiveness of our data cleaning efforts but also (2) to examine if there is still more cleaning left to be done.

    In [67]:
    cleaned_street_name = 'cleaned_street_name.xml'
    street_types = audit_street_names(cleaned_street_name)
    vcount = 0
    for st_type, ways in street_types.iteritems():
        for name in ways:
            better_name = get_valid_name(name, mapping)
            if name != better_name:
                vcount = vcount + 1
                print name, "=>", better_name
                
    if vcount == 0:
        display(HTML("<br/><p style=\"color:green;\">SUCCESS: There were no street names to convert.</p>"))
    

    SUCCESS: There were no street names to convert.

    Auditing Telephone Numbers

    The general audit that we made on the entire Orlando OSM map file informed us of the existence of certain tag elements bearing the phone key label.

    Let us explore a sample of 52 values that are contained in phone tag elements to get an idea of whether or not they require further processing or cleaning. Again, we are doing this within the context of the desire to homogenize our data whenever it is warranted and applicable.

    In [68]:
    OSM_FILE = 'cleaned_street_name.xml'
    
    display(HTML("<h4>Sample Phone Numbers</h4>"))
    
    count_max = 10
    count_number = 0
    for _, element in ET.iterparse(OSM_FILE, ('start', 'end')):
        if  element.tag == "tag":
            if element.attrib['k'] == 'phone' \
            and count_number < count_max:
                print(element.attrib['v'])
                count_number += 1
                
    print '\nTotal count of sample phone numbers:', count_number
    

    Sample Phone Numbers

    +1-407-380-9971
    +1-407-380-9971
    +1-407-290-6977
    +1-407-290-6977
    +1 407 303 1700
    +1 407 303 1700
    407-623-3300
    407-623-3300
    +1-407-645-5311
    +1-407-645-5311
    
    Total count of sample phone numbers: 10
    

    The phone numbers take on different formats. This is expected because the individuals who volunteered to input the data may have different ideas on how a phone number should be entered.

    Our data stores the phone numbers in the following common ways:

    +1-234-567-8901
    123-456-7890
    1234567890
    (123) 456-7890
    

    Data Cleaning Plan for Telephone Numbers

    The design for dealing with the tag elements containing the phone numbers in the OSM file is as follows:

    1. Conduct a data audit of the telephone numbers
      • At this point we have already used and examined the information contained in the map file.
    2. Identify the cause of the issue that we would like to address.
      • We pointed out that the multi-formatted values is the result of the differences in the style in which individuals prefer to enter the phone numbers.
    3. Define a set of operations that will correct the data.
      • We will define some functions that will iterate through the map file and update the phone numbers according to a predetermined standard format.
    4. Test and verify the results of the cleaning operation
      • The results of the data cleaning effors will be re-examined to check for lingering problems or discover new issues.

    To give us an idea of the number of phone numbers the map file actually contains, we will re-use the code we ran earlier.

    In [69]:
    OSM_FILE = 'cleaned_street_name.xml'
    
    count_number = 0
    for _, element in ET.iterparse(OSM_FILE, ('start', 'end')):
        if  element.tag == "tag":
            if element.attrib['k'] == 'phone':
                count_number += 1
                
    
    print '\nTotal count of phone numbers:', count_number
    
    Total count of phone numbers: 236
    

    Next, we will define our goal of cleaning that portion of the data that pertains to phone numbers. By this, cleaning the phone numbers basically means making all the values adhere to a uniform phone number format. And for our situation, the format would be to just retain all the digits and strip all symbols and inner white space (e.g. (407) 123-4567 will become 4071234567).

    This policy is operationalized through the use of a regular expression patterns.

    Let us now find out how many phone numbers actually adhere to our uniform format. The phone numbers can be classified into two groups: numbers_only and other_format (shown below with examples):

    numbers_only
  • 4077309919
  • 407-879-7763
  • other_format
  • +014077303212
  • +1 407 3730622
  • (407) 622-2010
  • In [70]:
    strictly_numbers = re.compile(r'\d+')
    
    def count_phone_types(element, phone_types):
        if element.tag == "tag":
            if element.attrib['k'] == 'phone':
                if strictly_numbers.match(element.attrib['v']):
                    phone_types['numbers_only'] += 1
                else:
                    phone_types['other_format'] += 1
           
        return phone_types
    
    
    def process_map_phone_types(filename):
        phone_types = {"numbers_only": 0, "other_format": 0}
        for _, element in ET.iterparse(filename, ('start', 'end')):
            phone_types = count_phone_types(element, phone_types)
    
        return phone_types
    
    display(HTML("<h4>Phone Types and Occurrences</h4>"))
    pprint.pprint(process_map_phone_types(cleaned_street_name))
    

    Phone Types and Occurrences

    {'numbers_only': 80, 'other_format': 156}
    

    We will endeavor to clean the telephone data and expect to transform the values of the phone number elements into a numbers-only format. This format lends itself conveniently well for building a uniformly formatted set of phone numbers.

    The clean_phone_number() function implements the data cleaning task required to standardize the telephone number format by applying a regular expression substitution method on the OSM file that has recently undergone street type data cleaning from a previous cleaning plan.

    In [71]:
    def clean_phone_number(filename, cleaned_filename):
        """
        Apply the data cleaning process on OSM file
        
        Args:
            filename(string): Name of uncleaned file
            cleaned_filename(string): Name of file after being processed
        """
        tree = ET.parse(filename)
        root = tree.getroot()
    
        for tag in root.findall('*/tag'):
            if tag.attrib['k'] == 'phone':
                phone_number = tag.get('v')
                if not strictly_numbers.match(phone_number):
                    update_phone_number = re.sub(r'[\D|\s]+', '', phone_number)[-10: ]
                    tag.set('v', update_phone_number)
    
        return tree.write(cleaned_filename)
    

    Let us use the cleaned_street_name.xml OSM file and the name of the file that will hold the results of the data cleaning process (cleaned_phone_number.xml) as arguments to the clean_phone_number() function.

    Right after generating the cleaned_phone_number.xml output file, we will use this file again as an attribute to the process_map_phone_types() function to check the results of our data cleaning efforts.

    In [72]:
    cleaned_street_name = 'cleaned_street_name.xml'
    
    start_time = time.time()
    
    cleaned_phone_number = 'cleaned_phone_number.xml'
    clean_phone_number(cleaned_street_name, cleaned_phone_number)
    
    display(HTML("<h4>Data cleaning of phone number results</h4>"))
    pprint.pprint(process_map_phone_types(cleaned_phone_number))
    
    # print('\n--- %s seconds ---' % (time.time() - start_time))
    

    Data cleaning of phone number results

    {'numbers_only': 236, 'other_format': 0}
    

    The function returns a zero under the other_format group which means that the phone numbers have been properly transformed to only hold numbers.

    The output of the telephone data cleaning process, cleaned_phone_number.xml, is going to be used in the next phase of our data cleaning efforts.

    Auditing Postal Codes

    When I set out to get the Orlando dataset from the Open Street Map project's interactive web application, I did so by arbitrarily drawing a square region on top of the area that encompasses much of Orlando. I understand that this may not be precise which means that I may have grabbed some locations that are actually outside the jurisdiction of Orlando.

    In this section of the data cleaning project, we will have the opportunity to be more precise by detecting and keeping those data whose location information only belongs to city of Orlando. And we are going to do this using the zip or postal code information that are stored in the file.

    An interesting element of the Open Street Map file are the postal code values associated with the key: addr:postcode. To get an idea of the various postal codes that are in the data as well as the number of data elements that are associated with each unique postal code, we have developed the count_postcodes() function.

    In [73]:
    OSM_FILE = 'cleaned_street_name.xml'
    
    def count_postcodes(filename):
        """
        Extract postal codes and generate frequencies
        
        Args:
            filename(string): Current OSM XML file that is currently
                              going through the data cleaning pipeline
        Returns:
            postcodes(dictionary): Post code and frequency dictionary
        """
        postcodes = {}
        for event, elem in ET.iterparse(filename, events=('start', 'end')):
            if event == 'end':
                key = elem.attrib.get('k')
                if key == 'addr:postcode':
                    postcode = elem.attrib.get('v')
                    if postcode not in postcodes:
                        postcodes[postcode] = 1
                    else:
                        postcodes[postcode] += 1
        return postcodes
    
    
    start_time = time.time()
    
    postcodes = count_postcodes(OSM_FILE)
    sorted_by_occurrence = [(k, v) for (v, k) in sorted([(value, key) 
                                                         for (key, value) 
                                                         in postcodes.items()], reverse=True)]
    
    
    df = pd.DataFrame(sorted_by_occurrence)
    df.columns = ['Postal Code', 'Frequency']
    display(HTML("<h4>Sample Postal Codes and Frequencies</h4>"))
    display(HTML(df[:5].to_html(index=False)))
    

    Sample Postal Codes and Frequencies

    Postal Code Frequency
    32835 37
    32789 36
    32803 33
    32806 30
    32801 26

    The preliminary audit results tells us that most of the postal codes tend to cluster around a small group of numbers (32835, 32789, 32803, 32806, 32801, 32839, 32804, 32822, 32819, 32792 and 32809).

    It also shows that there are a few postcodes that were entered using a different format (e.g. 32819-7827 and the like) as well as erroneous entries (e.g. 'FL 32803'). Data cleaning sometimes involves manual corrections and we can deal with these few errant entries programmatically using the fix_select_zipcode() function that we have prepared. The function will replace these select post codes with the proper format or data based on the original values.

    Data Accuracy

    Looking into list of postal codes which our data audit revealed, it is possible that within our data, the postal code (and its associated entry) is not actually valid for our purposes. We therefore need to distinguish those codes that are within the jurisdiction of the city of Orlando and those which lie without.

    In order to audit the addr:postcode field for accuracy, we are going to need some means of programmatically processing it and comparing the values that we encounter to some set of data that we can reliably assume is correct.

    The gold standard data that we are going to use for this project is the set of postal codes from USPS which is corroborated by zip-codes.com. There is a total of 59 postal codes ranging from 32801 through 32899.

    We are now able to perform another data audit given this gold standard data in hand. The following code will tell us how many node, way and relation elements are associated with postal codes (addr:postcode) that belong to the city of Orlando (orlando_postcodes) and those that do not (other_postcodes). It does so by iterating over the addr:postcode key values and comparing each with a list of official Orlando postal codes retrieved from a reliable source.

    In [74]:
    OSM_FILE = 'cleaned_street_name.xml'
    valid_orlando_postcodes = [
        '32801','32802','32803','32804','32805','32806','32807','32808','32809',
        '32810','32811','32812','32814','32815','32816','32817','32818','32819',
        '32820','32821','32822','32824','32825','32826','32827','32828','32829',
        '32830','32831','32832','32833','32834','32835','32836','32837','32839',
        '32853','32854','32855','32856','32857','32858','32859','32860','32861',
        '32862','32867','32868','32869','32872','32877','32878','32885','32886',
        '32887','32891','32896','32897','32899']
    
    def get_postcode(elem):
        """
        Extracts the postal code value from the dictionary item
        
        Args:
            elem(Object): OSM element object
        Returns"
            bool: The element is a postal code element
            value(string): The postal code that is inside the dictionary
        """
        if elem.tag in ['node', 'way', 'relation']:
            for tag in elem.iter():
                if tag.get('k') == 'addr:postcode':
                    return True, tag.get('v')
            return False, None
        return False, None
    
    def fix_select_zipcode(zipcode):
        """
        Converts the postal code to the correct format
        
        Args:
            zipcode(string): postal code value
        Returns"
            new_zip(string): corrected postal code value
        """
        new_zip = ''
        if zipcode == 'FL 32803':
            new_zip = '32803'
        elif zipcode == 'FL 32792':
            new_zip = '32792'
        elif zipcode == '32819-7827':
            new_zip = '32819'
        elif zipcode == '32817-4427':
            new_zip = '32817'
        elif zipcode == '32805-1100':
            new_zip = '32805'
        elif zipcode == '32803-4806':
            new_zip = 32803
        elif zipcode == '32792-6801':
            new_zip = 32792
        else:
            new_zip = zipcode
    
        return new_zip
    
    
    def process_map(filename):
        """
        Accumulates postal codes that belong or do not belong to Orlando
        
        Args:
            filename(string): OSM map file name
        Returns:
            number_types(dictionary): Dictionary bins postal codes and counts
        """    
        number_types = {"orlando_postcodes": 0, "other_postcodes": 0}
        tree = ET.parse(filename)
        root = tree.getroot()
        
        for child in ['node', 'way', 'relation']:
            for element in root.findall(child):
                has_postcode, postcode_value = get_postcode(element)
                if has_postcode:
                    postcode_value = fix_select_zipcode(postcode_value)
                    if postcode_value in valid_orlando_postcodes:
                        number_types['orlando_postcodes'] += 1
                    else:
                        number_types['other_postcodes'] += 1
        return number_types
        
    display(HTML("<h4>Postal codes by jurisdiction</h4>"))
    pprint.pprint(process_map(OSM_FILE))
    

    Postal codes by jurisdiction

    {'orlando_postcodes': 283, 'other_postcodes': 64}
    

    Our project's stated scope and limitations permits us to discard the node elements (64) which hold non-Orlando city postal codes.

    Data Cleaning Plan for Orlando Postal Codes

    The design for dealing with the tag elements containing the postal numbers in the workflow OSM file is as follows:

    1. Conduct a data audit of the postal codes
      • This has been done. A preliminary count of all available postal codes present in the dataset was reported which also revealed the presence of values that may need manual or programmatic correction.
    2. Identify the cause of the issue that we would like to address.
      • A data accuracy audit identifies those postal codes that will be considered as invalid and unusable for the purposes of this project because they are situated outside the jurisdiction of the city of Orlando.
    3. Define a set of operations that will correct the data.
      • We will define a process using code that will iterate through the workflow map file and discard node elements which have failed our data accuracy test.
    4. Test and verify the results of the cleaning operation
      • We will re-apply the data audit process on the output OSM file to verify the results of the postal code data cleaning program.

    Continuing on with procedure number 3 from above, we will take the workflow OSM data file that had undergone street type data cleaning (cleaned_street_name.xml) and pass it as one of the arguments to the clean_postcode() function. The other argument would be the cleaned_postcode variable which contains the arbitrary name of the file that will hold the results of our postal code data cleaning process.

    In [75]:
    OSM_FILE = 'cleaned_street_name.xml'
    
    def clean_postcode(filename, cleaned_filename):
        tree = ET.parse(filename)
        root = tree.getroot()
        
        for child in ['node', 'way', 'relation']:
            for elem in root.findall(child):
                has_postcode, postcode_value = get_postcode(elem)
                if has_postcode:
                    if postcode_value not in valid_orlando_postcodes:
                        root.remove(elem)
        
        return tree.write(cleaned_filename)
    
    start_time = time.time()
    
    cleaned_postcode = 'cleaned_postcode.xml'
    clean_postcode(OSM_FILE, cleaned_postcode)
    
    
    print('\n--- %s seconds ---' % (time.time() - start_time))
    
    --- 38.001999855 seconds ---
    

    The clean_postcode() function visits each set of node, way or relation nodes from the OSM file and tries to find and extract any available addr:postcode value. It then performs a test and decides whether to discard the node that contains the postal code. This decision is based on whether or not the postal code can be found in the standard Orlando postal code list.

    In [76]:
    CLEANED_POSTCODE_FILE = 'cleaned_postcode.xml'
    display(HTML("<h4>Postal codes by jurisdiction (after data cleaning)</h4>"))
    pprint.pprint(process_map(CLEANED_POSTCODE_FILE))
    

    Postal codes by jurisdiction (after data cleaning)

    {'orlando_postcodes': 278, 'other_postcodes': 0}
    

    Finally, the code above implements the last step of the postal code data cleaning plan when it resubmits the newly generated OSM file as an argument to the process_map() function. This task will show the number of nodes that contain invalid postal codes. If the the function reports no remaining invalid post codes, then we are through with the data cleaning tasks that we set out to tackle for the purposes of this project.

    Our OSM file has undergone three data cleaning processes producing the final file, cleaned_postcode.xml, which we load into persistent storage to facilitate further data analysis.

    SQL Database Queries

    This section of the project would like to focus on the variety of data analysis options which become available once the dataset has been loaded into a relational database like SQLite. Thus, I shall briefly outline the work that I have done to import the cleaned OSM file into the database.

    • Generate the input CSV files. The cleaned Orlando OSM file (cleaned_postcode.xml) is used as input to a python application that I developed which parses and segregates the primary nodes and tags into their own CSV files, namely: nodes.csv, nodes_tags.csv, ways.csv, ways_nodes.csv and ways_tags.csv.
    • Create the SQLite database and tables. Several Data Definition Language (DDL) queries describing the schema are constructed and executed to create the SQLite database (orlando_osm.db) and the tables that will hold the OSM data. The tables were given names that match the CSV files with which they are populated. A few of them are shown below.
    • Load the CSV files into SQLite Database. The SQLite shell commands below were used to import all the CSV files as well as to remove the carriage return characters that I have found on the timestamp attributes.
      .mode csv
      .import nodes.csv nodes
      .import nodes_tags.csv nodes_tags
      .import ways.csv ways
      .import ways_nodes.csv ways_nodes
      .import ways_tags.csv ways_tags 
      UPDATE nodes SET timestamp = REPLACE(timestamp, X'0D', ''); UPDATE ways SET timestamp = REPLACE(timestamp, X'0D', '');

    The SQLite Database Tables

    Let us verify the results of the CSV file imports. The sqlite3 Python module allows us to connect and query our SQLite orlando_osm.db database file.

    File Name File Size
    Original Orlando OSM file 75.8 MB
    nodes.csv 24.4 MB
    nodes_tags.csv 1.26 MB
    ways.csv 2.98 MB
    ways_nodes.csv 9.29 MB
    ways_tags.csv 8.00 MB

    Code Re-use

    Notice that in the code that follows we will be using the execute_query() function every now and then as a convenient way of executing the various SQL queries. In this fashion, we will only be altering the query itself, which is contained in the static literal variable called QUERY.

    Data Table Record Counts

    Among the first things we would like to know right after loading our file into our database is to get an idea of how many records were loaded in each database table. The following code wil provide us this information.

    In [77]:
    import sqlite3
    import pandas as pd
    from collections import defaultdict
    from IPython.display import display, HTML
    
    DATABASE = 'orlando_osm.db'
    
    QUERY = '''
    SELECT 'nodes' AS nodes, COUNT(*) 
    FROM nodes
    UNION
    SELECT 'nodes_tags' AS nodes_tags, COUNT(*) 
    FROM nodes_tags
    UNION
    SELECT 'ways' AS ways, COUNT(*) 
    FROM ways
    UNION
    SELECT 'ways_nodes' AS ways_nodes, COUNT(*) 
    FROM ways_nodes
    UNION
    SELECT 'ways_tags' AS ways_tags, COUNT(*) 
    FROM ways_tags
    '''
    
    def create_connection(db_file):
        """
        Create a database connection to the SQLite database
        """
        try:
            conn = sqlite3.connect(db_file)
            return conn
        except Error as e:
            print(e)
        return None
    
    def execute_query(db, query):
        """
        Returns the query results as a list
        """
        conn = create_connection(db)
        with conn:
            cur = conn.cursor()
            cur.execute(query)
            return cur.fetchall()
    
    rows = execute_query(DATABASE, QUERY)
    df = pd.DataFrame(rows)
    df.columns = ['Table Name', 'Count']
    display(HTML(df.to_html(index=False, header=True)))
    
    Table Name Count
    nodes 309637
    nodes_tags 34188
    ways 54231
    ways_nodes 396734
    ways_tags 194130

    The tables that have the greatest share of the Open Street Map records are the nodes, ways_nodes and ways_tags tables.

    nodes

    The query below should return the first five records of the nodes table. The interesting attributes of this table are the longitude and latitude coordinates which confirm what a node stands for, an Open Street Map node is a particular point on the earth's surface. The node also contains the the user name of the contributing person.

    In [78]:
    import sqlite3
    import pandas as pd
    
    DATABASE = 'orlando_osm.db'
    
    QUERY = '''
    SELECT id, lat, lon, user, uid, timestamp 
    FROM nodes 
    LIMIT 6;
    '''
    rows = execute_query(DATABASE, QUERY)
    df = pd.DataFrame(rows[1:])
    df.columns = ['nodes_id', 'latitude', 'longitude', 'user', 'user_id', 'timestamp']
    display(HTML(df.to_html(index=False, header=True)))
    
    nodes_id latitude longitude user user_id timestamp
    26771646 28.456316 -81.309911 KindredCoda 14293 2009-09-19T12:48:58Z
    26787543 28.466059 -81.276102 Mongo Poker 4256778 2018-01-30T19:48:25Z
    26787546 28.469043 -81.309970 NE2 207745 2013-03-11T03:10:23Z
    26787551 28.467188 -81.304381 NE2 207745 2012-01-17T13:28:51Z
    26787552 28.468235 -81.306048 NE2 207745 2010-09-01T18:07:39Z

    nodes tags

    The nodes_tags provides additional information about the node elements of the Open Street Map dataset. The information consists of key-value pairs describing information about road structures such as highway traffic signals, crossings, etc.

    In [79]:
    QUERY = '''
    SELECT id, key, value
    FROM nodes_tags
    LIMIT 6;
    '''
    rows = execute_query(DATABASE, QUERY)
    df = pd.DataFrame(rows[1:])
    df.columns = ['nodes_id', 'key', 'value']
    display(HTML(df.to_html(index=False, header=True)))
    
    nodes_id key value
    26787543 highway traffic_signals
    26787546 highway traffic_signals
    26915727 highway traffic_signals
    26927708 railway switch
    26928881 railway crossing

    ways

    The Open Street Map data dictionary specification for the ways element describes it as a an ordered list of nodes. By itself, our ways table consists of metadata pertaining to the element. The records in this table will be useful when paired with the ways_nodes intervening table as it will allow us to access the many side of the one-many relationship between the ways entities and the various nodes entities with which they are associated.

    In [80]:
    QUERY = '''
    SELECT id, user, uid, version, changeset, timestamp
    FROM ways
    LIMIT 6;
    '''
    rows = execute_query(DATABASE, QUERY)
    df = pd.DataFrame(rows[1:])
    df.columns = ['ways_id', 'user', 'user_id', 'version', 'changeset', 'timestamp']
    display(HTML(df.to_html(index=False, header=True)))
    
    ways_id user user_id version changeset timestamp
    4450379 NE2 207745 9 13900685 2012-11-16T23:56:13Z
    4450380 carciofo 2533093 8 46601436 2017-03-05T17:19:14Z
    4450381 NE2 207745 7 13900685 2012-11-16T23:56:04Z
    4450382 Mongo Poker 4256778 14 54913285 2017-12-25T18:38:04Z
    4450385 NE2 207745 7 15348216 2013-03-13T07:43:17Z

    ways nodes

    As introduced earlier, the ways_nodes table concretizes the one-many relationship between the ways and nodes entities. A way is seen as a collection of node elements.

    In [81]:
    QUERY = '''
    SELECT id, node_id, position
    FROM ways_nodes
    LIMIT 6;
    '''
    rows = execute_query(DATABASE, QUERY)
    df = pd.DataFrame(rows[1:])
    df.columns = ['ways_id', 'nodes_id', 'position']
    display(HTML(df.to_html(index=False, header=True)))
    
    ways_id nodes_id position
    4450379 27336999 0
    4450379 498745566 1
    4450379 498745568 2
    4450379 498745571 3
    4450379 498745577 4

    ways tags

    Just like the nodes_tags record, the ways_tags record keeps descriptive information about its parent element in the form of key-value pairs (e.g. {building: apartments}, {highway: pedestrian}, {landuse: grass}).

    In [82]:
    QUERY = '''
    SELECT id, key, value, type
    FROM ways_tags
    LIMIT 6;
    '''
    rows = execute_query(DATABASE, QUERY)
    df = pd.DataFrame(rows[1:])
    df.columns = ['ways_id', 'key', 'value', 'type']
    display(HTML(df.to_html(index=False, header=True)))
    
    ways_id key value type
    4450379 bicycle no regular\r
    4450379 bridge yes regular\r
    4450379 bridge_number 750486 regular\r
    4450379 foot no regular\r
    4450379 highway motorway_link regular\r

    Now that we have been briefly acquainted with the tables in our OSM database, we can start exploring it to glean some insights from the dataset.

    Querying unique contributors

    The Open Street Map Project is what can be called a living dataset in that it relies on its users who create, edit and update the data that make up the map system. The query that follows would provide information about the number of unique users who have volunteered to generate and maintain our special OSM map file for Orlando.

    In [83]:
    QUERY = '''
    SELECT COUNT(DISTINCT(U1.uid))
    FROM (SELECT uid 
          FROM nodes 
          UNION ALL 
          SELECT uid 
          FROM ways) U1
    '''
    rows = execute_query(DATABASE, QUERY)
    df = pd.DataFrame(rows)
    df.columns = ['unique_uid_count']
    display(HTML("<h4>Number of Unique Contributors</h4>"))
    display(HTML(df.to_html(index=False, header=True)))
    

    Number of Unique Contributors

    unique_uid_count
    709

    There are 709 users, who among them have contributed the most? The query below lists ten of the most prolific contributors along with their number of contributions as well as their contribution's percentage share of the total contributions in our dataset.

    In [84]:
    QUERY = '''
    SELECT x2.user, x2.cnt, round(((x2.cnt*1.0/x2.tot)*100.0),0) as pct
    FROM (
      SELECT
        user,
        cnt,
        (SELECT COUNT(*)
        FROM (
          SELECT user
          FROM nodes
          UNION ALL
          SELECT user
          FROM ways
          )
        ) as tot
      FROM (
        SELECT
          U2.user,
          COUNT(*) AS cnt
        FROM (SELECT user
              FROM nodes
              UNION ALL
              SELECT user
              FROM ways) U2
        GROUP BY U2.user
      )
    ) x2
    order by x2.cnt DESC
    LIMIT 10;
    '''
    rows = execute_query(DATABASE, QUERY)
    df = pd.DataFrame(rows)
    df.columns = ['user_id','num_contributions','pct_of_total']
    display(HTML("<h4>Top 10 Orlando Open Street Map Contributors</h4>"))
    display(HTML(df.to_html(index=False, header=True)))
    

    Top 10 Orlando Open Street Map Contributors

    user_id num_contributions pct_of_total
    NE2 148676 41.0
    Mongo Poker 34089 9.0
    crystalwalrein 24348 7.0
    RuckRunnerGRT 20705 6.0
    epcotfan 18929 5.0
    dale_p 10622 3.0
    3yoda 9489 3.0
    Brian@Brea 8876 2.0
    pyram 5240 1.0
    Rub21 5211 1.0

    Querying about postal codes

    I would like to revisit the postal codes which were the subject of our data audit and data cleaning tasks a brief while ago.

    Which postal codes had the highest frequency of use in our Orlando Open Street Map file?

    In [85]:
    QUERY = '''
    SELECT A1.value, COUNT(*)
    FROM
    (
        SELECT X1.value
        FROM ways_tags X1
        WHERE X1.key LIKE '%postcode%' 
        UNION ALL
        SELECT Y1.value
        FROM nodes_tags Y1
        WHERE Y1.key LIKE '%postcode%' 
    ) A1
    GROUP BY A1.value
    ORDER BY COUNT(*) DESC
    '''
    rows = execute_query(DATABASE, QUERY)
    pcodes1_df = pd.DataFrame(rows)
    pcodes1_df.columns = ['postal_code', 'count']
    display(HTML(pcodes1_df.to_html(index=False, header=True)))
        
    
    postal_code count
    32835 36
    32803 31
    32806 29
    32801 24
    32839 24
    32804 20
    32822 18
    32819 15
    32809 12
    32829 9
    32807 7
    32811 7
    32805 6
    32817 6
    32808 5
    32810 4
    32818 3
    32825 3
    32812 1

    The top five open street map contributions where a postal code was used came from the following areas:


    Postal Code Record Count Area Description
    32835 36 The southwest,Metrowest area
    32803 31 Colonialtown-Audubon Park area
    32806 29 Delaney Park area
    32801 24 Downtown Orlando area
    32839 24 Millenia Mall, Americana Blvd area


    These areas are among the busiest places in the city so it makes sense that the OSM contributions reflect this reality. The bar plot below provides a visualization of the post code frequencies that can be found in our dataset.

    In [86]:
    %matplotlib inline
    import numpy as np
    import pandas as pd
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    QUERY = '''
        SELECT X1.value
        FROM ways_tags X1
        WHERE X1.key LIKE '%postcode%' 
        UNION ALL
        SELECT Y1.value
        FROM nodes_tags Y1
        WHERE Y1.key LIKE '%postcode%' 
    '''
    rows = execute_query(DATABASE, QUERY)
    pcodes2_df = pd.DataFrame(rows)
    pcodes2_df.columns = ['postal_code']
    
    fig, ax = plt.subplots(figsize=(14,6))
    plt.title("Number of OSM Elements with Post Code Data")
    sns.set(style="whitegrid", color_codes=True)
    g = sns.countplot(x="postal_code", 
                      data=pcodes2_df, 
                      palette="RdYlBu",
                      order = pcodes2_df['postal_code'].value_counts().index)
    g.set_xticklabels(g.get_xticklabels(),rotation=90);
    

    Querying about restaurants

    Among the things that I would like to explore from the Orlando Open Street Map dataset is information that would interest a foodie like me.

    I would like to know how many nodes and ways records hold information about restaurants.

    In [87]:
    QUERY = '''
    SELECT 'nodes_tags', COUNT(*)
    FROM nodes_tags
    WHERE key == 'amenity' AND value LIKE '%restaurant%'
    UNION ALL
    SELECT 'ways_tags', COUNT(*)
    FROM ways_tags
    WHERE key == 'amenity' AND value LIKE '%restaurant%'
    '''
    rows = execute_query(DATABASE, QUERY)
    df = pd.DataFrame(rows)
    df.columns = ['Table', 'Restaurants']
    display(HTML(df.to_html(index=False, header=True)))
    
    Table Restaurants
    nodes_tags 104
    ways_tags 43

    Building up on what we have so far, let us associate each node or way tag elements with its postal code. We intend to gather some information about restaurants from each of these collections for use later.

    In [88]:
    QUERY = '''
    SELECT 'ways_tags', X1.id, X1.key, X1.value, X2.key, X2.value
    FROM ways_tags X1
         JOIN ways_tags X2 on X1.id = X2.id
    WHERE X1.key LIKE '%postcode%' 
          AND (X2.key == 'amenity' AND X2.value LIKE '%restaurant%')
    UNION 
    SELECT 'nodes_tags', Y1.id, Y1.key, Y1.value, Y2.key, Y2.value
    FROM nodes_tags Y1
         JOIN nodes_tags Y2 ON Y1.id = Y2.id
    WHERE Y1.key LIKE '%postcode%' 
          AND (Y2.key == 'amenity' AND Y2.value LIKE '%restaurant%');
    '''
    rows = execute_query(DATABASE, QUERY)
    rest1_df = pd.DataFrame(rows)
    rest1_df.columns = ['source', 'id', 'type_1', 'postal_code', 'type_2', 'value_2']
    display(HTML(rest1_df.to_html(index=False, header=True)))
    
    source id type_1 postal_code type_2 value_2
    nodes_tags 605014875 postcode 32835 amenity restaurant
    nodes_tags 3059662260 postcode 32801 amenity restaurant
    nodes_tags 3059667206 postcode 32801 amenity restaurant
    nodes_tags 3330802043 postcode 32819 amenity restaurant
    nodes_tags 3846680191 postcode 32803 amenity restaurant
    nodes_tags 4142734111 postcode 32801 amenity restaurant
    nodes_tags 4161915546 postcode 32819 amenity restaurant
    nodes_tags 4613278490 postcode 32819 amenity restaurant
    nodes_tags 4834972122 postcode 32819 amenity restaurant
    nodes_tags 4834976721 postcode 32806 amenity restaurant
    nodes_tags 4835237321 postcode 32804 amenity restaurant
    nodes_tags 5194583683 postcode 32803 amenity restaurant
    ways_tags 183282408 postcode 32822 amenity restaurant
    ways_tags 306611107 postcode 32801 amenity restaurant
    ways_tags 313780404 postcode 32818 amenity restaurant
    ways_tags 406324418 postcode 32806 amenity restaurant
    ways_tags 406324420 postcode 32806 amenity restaurant
    ways_tags 406324425 postcode 32806 amenity restaurant

    The code above identifies the nodes and ways tag elements that pertain to restaurants as well as their associated postal codes. We could use this information to determine the number of restaurants that are located within a particular postal code.

    In [89]:
    QUERY = '''
    SELECT A1.value, COUNT(*)
    FROM
    (
        SELECT 'ways_tags', X1.id, X1.key, X1.value, X2.key, X2.value
        FROM ways_tags X1
        JOIN ways_tags X2 on X1.id = X2.id
        WHERE X1.key LIKE '%postcode%' 
            AND (X2.key == 'amenity' AND X2.value LIKE '%restaurant%')
        UNION ALL
        SELECT 'nodes_tags', Y1.id, Y1.key, Y1.value, Y2.key, Y2.value
        FROM nodes_tags Y1
        JOIN nodes_tags Y2 ON Y1.id = Y2.id
        WHERE Y1.key LIKE '%postcode%' 
            AND (Y2.key == 'amenity' AND Y2.value LIKE '%restaurant%')
    ) A1
    GROUP BY value;
    '''
    rows = execute_query(DATABASE, QUERY)
    rest2_df = pd.DataFrame(rows)
    rest2_df.columns = ['postal_code', 'restaurants']
    display(HTML(rest2_df.to_html(index=False, header=True)))
    
    postal_code restaurants
    32801 4
    32803 2
    32804 1
    32806 4
    32818 1
    32819 4
    32822 1
    32835 1

    Limitation: Missing postal codes in restaurants

    It is quite interesting to notice that our query resulted in only 18 restaurants and yet we know from a previous query that there were 147 ways and nodes entries pertaining to restaurant.

    This is because these are the only node and ways elements that had the accompanying postal code information when the OSM record was created. Out of the 147 OSM contributions that pertain to restaurant, OSM users gave additional postal code information on only 18 of them which leaves us with 129 restaurant records with missing postal codes.

    This situation is understandable and albeit not unexpected owing to the voluntary nature of the Open Street Map system. Users are not constrained to enter some additional information that might be otherwise be relevant.

    Let's return to our discussion of the restaurant dataset. It is more useful to visualize what is going on with the restaurant and postal code dataset by putting them in a bar plot.

    In [90]:
    %matplotlib inline
    import numpy as np
    import pandas as pd
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    import seaborn as sns
        
    fig, ax = plt.subplots(figsize=(11,6))
    plt.title("Number of Restaurants by Postal Code", fontsize=16)
    plt.ylabel('Count',fontsize=16)
    plt.xlabel('Postal Code', fontsize=16)
    sns.set(style="whitegrid", color_codes=True)
    sns.countplot(x="postal_code", data=rest1_df,
                  palette="muted",
                  order = rest1_df['postal_code'].value_counts().index);
    

    It is quite revealing to show how, given our limited number of query results, the data still supports the notion of which areas in Orlando can be assumed to have a bigger share of the restaurant market.

    Postal code 32806 represents the Delaney Park area which is about a mile southeast of downtown Orlando. Numerous restaurants in the area have taken advantage of workers coming from the three big businesses there: the Orlando Regional Medical Center, Winnie Palmer Hospital for Children and Orlando's Amtrak Station.

    Postal code 32801 is the main downtown Orlando area. Perhaps it does not need much to explain how restaurants like Le Gourmet Break, The Boheme, Fratelli's, and numerous others exist to cater to the customers and workers who work and do leisure downtown.

    Postal code 32819 stands for the southwest Orlando and Dr. Phillips-Bay Hill areas, most popularly known locally as the "Restaurant Belt". Numerous restaurants line up the major avenues of this neighborhood and they mostly serve Orlando tourists and theme park patrons.

    Cuisines

    Next, I would like to explore the variety of dishes that are being served in restaurants. The Orlando Open Street Map dataset have entries that have cuisine as key which is being used in both node_tags and ways_tags elements.

    I encountered some values that had a cuisine key that I thought would need further data cleaning. For example, there were multivalued field items that have been put into one field like mediterranean;mexican;lebanese and oriental;thai;sushi. For the purposes of this study, I have excluded these ambiguous data items.

    We would like to further qualify our query to exclude fast food items such as burger, donut, pizza, chicken sandwich, etc.

    In [91]:
    QUERY = '''
    SELECT tags.value, COUNT(*) as count 
    FROM (SELECT key,
                 CASE value
                   WHEN 'chinese' then 'asian'
                   ELSE value
                 END as value
          FROM nodes_tags
          UNION ALL
          SELECT key,
                 CASE value
                   WHEN 'chinese' then 'asian'
                   ELSE value
                 END as value
          FROM ways_tags) tags
    WHERE tags.key LIKE 'cuisine'
        AND tags.value NOT IN ('coffee_shop', 'sandwich',
        'pizza','barbecue,_american', 'chicken_sandwich',
        'chocolate', 'coffee_and_donut_shop', 'doughnut',
        'ice_cream;italian_pizza;noodles', 'subs','regional',
        'chicken', 'burger', 'mediterranean;mexican;lebanese',
        'oriental;thai;sushi')
    GROUP BY tags.value
    ORDER BY count DESC;
    '''
    rows = execute_query(DATABASE, QUERY)
    cuis1_df = pd.DataFrame(rows)
    cuis1_df.columns = ['Cuisine', 'Frequency']
    display(HTML(cuis1_df.to_html(index=False, header=True)))
    
    Cuisine Frequency
    american 10
    mexican 9
    asian 6
    italian 2
    Tex-Mex 1
    barbecue 1
    cuban 1
    japanese 1
    vegetarian 1
    In [92]:
    %matplotlib inline
    import numpy as np
    import pandas as pd
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    QUERY = '''
    SELECT value
    FROM (SELECT key,
                 CASE value
                   WHEN 'chinese' then 'asian'
                   ELSE value
                 END as value
          FROM nodes_tags
          UNION ALL
          SELECT key,
                 CASE value
                   WHEN 'chinese' then 'asian'
                   ELSE value
                 END as value
          FROM ways_tags) tags
    WHERE tags.key LIKE 'cuisine'
        AND tags.value NOT IN ('coffee_shop', 'sandwich',
        'pizza','barbecue,_american', 'chicken_sandwich',
        'chocolate', 'coffee_and_donut_shop', 'doughnut',
        'ice_cream;italian_pizza;noodles', 'subs','regional',
        'chicken','burger', 'mediterranean;mexican;lebanese',
        'oriental;thai;sushi');
    '''
    rows = execute_query(DATABASE, QUERY)
    cuis2_df = pd.DataFrame(rows)
    cuis2_df .columns = ['value']
    
    fig, ax = plt.subplots(figsize=(12,6))
    plt.title("Number of OSM elements by Cuisine")
    sns.set(style="whitegrid", color_codes=True)
    g2 = sns.countplot(x="value", data=cuis2_df, palette="Set2"
                 ,order = cuis2_df['value'].value_counts().index)
    g2.set_xticklabels(g2.get_xticklabels(),rotation=90);
    

    The American cuisine tops the list from among the captured cuisine data in the Orlando OSM dataset. This is closely followed by the Mexican and Asian cuisines.

    Querying exact locations of restaurants

    We tried to retrieve and present information about restaurants and their postal codes earlier. But that work did not produce much result because of the insufficiency of user-provided postal code data for each associated restaurant.

    However, we will not give up so easily because there is another way of determining a similar feature such as the location for each restaurant. And this one is more precise.

    It involves getting at the longitude and latitude values which we know are present for each node and way element. I start off with our basic nodes_tags and ways_tags tables which both contain information about which node and way elements pertain to restaurants (i.e. {key: amenity, value: restaurant}) and do a natural join with the main nodes database table via the id attribute. The nodes tuple or record contain informaton about the latitude and longitude data for each node.

    In [93]:
    import sqlite3
    import pandas as pd
    import pprint
    from collections import defaultdict
        
    DATABASE = 'orlando_osm.db'
    
    def convert_to_restaurant_list(df):
        resto_list = []
        
        for index, row in resto_df.iterrows():
            location = []
            dict_item = {}
            location.append(row['lon'])
            location.append(row['lat'])
            rname = row['name'].encode("utf-8")
            rname = rname.replace("'", "")
            dict_item = {'loc': location, 
                         'label': rname}
            resto_list.append(dict_item)
        
        return resto_list
        
    QUERY = '''
    SELECT DISTINCT nodes.lat, nodes.lon, N1.value
    FROM (
        SELECT *
        FROM nodes_tags
        UNION ALL
        SELECT *
        FROM ways_tags
    ) b
    JOIN nodes ON b.id = nodes.id
    JOIN (
        SELECT id, key, value
        FROM nodes_tags
        WHERE key = 'name'
    ) N1 ON N1.id = nodes.id
    WHERE b.key = 'amenity' AND b.value = 'restaurant';
    '''
    rows = execute_query(DATABASE, QUERY)
    resto_df = pd.DataFrame(rows)
    resto_df.columns = ['lon','lat','name']
    display(HTML(resto_df[:10].to_html(index=False, header=True)))
    resto_list = convert_to_restaurant_list(resto_df)
    
    lon lat name
    28.488172 -81.428251 Olive Garden
    28.488533 -81.428385 BJs Bar and Grill
    28.489898 -81.430359 TGI Fridays
    28.503706 -81.458043 Wing House
    28.503323 -81.458029 Ruby Tuesdays
    28.502807 -81.458022 Dennys
    28.593953 -81.306419 Wing House
    28.472068 -81.469385 Confisco Grille
    28.472419 -81.471485 Mythos
    28.516199 -81.481889 Tavern on the Lake

    Notice that the name of the restaurant is also displayed in the query result set. I was able to include this by making another natural join to the nodes_tags records using the node ID reference keys.

    The outcome of our query returned a dataframe that contains the map coordinates and names of the restaurants in our dataset. These are the components that we can use to visualize the data points on an interactive map using the folium module. But before we can do so, I developed the convert_to_restaurant_list() function to convert our dataframe into a structured collection (list of dictionaries) that folium neads to generate an interactive map.

    In [94]:
    %matplotlib inline
    import folium
    
    width, height = 650, 650
    restaurant_map = folium.Map(location=[28.539714, -81.379414], zoom_start=11,
                                tiles='OpenStreetMap')
    # Add markers to the map
    for restaurant in resto_list:
        marker = folium.Marker(location=restaurant['loc'], popup=restaurant['label'])
        marker.add_to(restaurant_map)
        
    restaurant_map  
    
    Out[94]:

    It is quite satisfying whenever we see how our data visualization reveals and corraborates the informed assumption that I made earlier when we explored the number of restaurants that are within the jurisdiction of various postal codes, to wit:

    Postal code 32819 stands for the southwest Orlando and Dr. Phillips-Bay Hill areas, most popularly known locally as the "Restaurant Belt". Numerous restaurants line up the major avenues of this neighborhood and they mostly serve Orlando tourists and theme park patrons.

    The southwest portion of the interactive map of Orlando is densely dotted by restaurant markers and we can safely infer that this area most probably falls within the postal code 32819 jurisdiction - Orlando's Restaurant Belt.

    Suggestions for Improvement of the Data and Its Analysis

    The Open Street Map Project is a very commendable undertaking in that it satisfies a public need (providing map information) while doing so at the least amount of upkeep (through the voluntary participation of its users). Its effectiveness and its ability to stay useful and relevant may be attributed to its flexibility and ease of use. Users seem to be allowed to edit and update the map file data arbitrarily. This modest feature may come at a cost in terms of data integrity and consistency.

    I will explain what I meant by consistency using a particular experience that I had while doing data analysis. During the part when I was exploring the data for culinary cuisines, I came across some data that were problematic.

    In [43]:
    QUERY = '''
    SELECT tags.value, COUNT(*) as count
    FROM (SELECT key,
                 value
          FROM nodes_tags
          UNION ALL
          SELECT key,
                value
          FROM ways_tags) tags
    WHERE tags.key LIKE 'cuisine'
        AND tags.value NOT IN ('american', 'mexican',
        'italian','Tex-Mex', 'barbecue', 'cuban','vegetarian')
    GROUP BY tags.value
    ORDER BY count DESC;
    '''
    rows = execute_query(DATABASE, QUERY)
    cuis1_df = pd.DataFrame(rows)
    cuis1_df.columns = ['Cuisine', 'Frequency']
    display(HTML(cuis1_df.to_html(index=False, header=True)))
    
    Cuisine Frequency
    burger 76
    chicken 8
    coffee_shop 7
    sandwich 6
    pizza 4
    asian 3
    chinese 3
    barbecue,_american 1
    chicken_sandwich 1
    chocolate 1
    coffee_and_donut_shop 1
    doughnut 1
    ice_cream;italian_pizza;noodles 1
    japanese 1
    mediterranean;mexican;lebanese 1
    oriental;thai;sushi 1
    regional 1
    subs 1

    I would like to discuss the issues posed by these entries that were meant to satisfy the concept of a cuisine.

    What's a cuisine?

    The data items such as chicken and chocolate that were entered as values for the key of cuisine does not seem to go within the purview of the definition of a cuisine. According to Wikipedia:

    A cuisine is a style of cooking characterized by distinctive ingredients, techniques and dishes, and usually associated with a specific culture or geographic region.

    Data items chicken and chocolate are not cuisines in themselves, but they can be used as ingredients for a particular cuisine (i.e, chocolate and chicken are used in a dish called Chocolate Chicken Mole, a Mexican cuisine dish).

    The part is not the whole

    Data items like burger, sandwich, pizza, doughnut, and subs can properly be considered as dishes or menu items which cater to a particular cuisine. Thus, a burger and donut may instead be classified as belonging to the American cuisine; pizza under Italian, and sandwich and subs may perhaps be included in some catch-all cuisine as every geographic reqion has its version of these common foods.

    The shop is not a cuisine

    There are data items that perhaps try to describe the type of business rather than the cuisine. Items like coffee_shop and coffee_and_donut_shop may need to be re-entered or manually cleaned.

    A common thread

    Some cuisines that were entered can reasonably be classified into one common cuisine. For example asian, chinese and japanese can all be consolidated under just the asian cuisine.

    Multivalued Data

    Perhaps a rather more problematic issue is the case when a field is allowed to contain multiple values. Some examples of this that can be found in the Orlando map dataset are barbecue,_american, ice_cream;italian_pizza;noodles, mediterranean;mexican;lebanese and oriental; thai; sushi. Queries made when these values are present are possible to produce ambiguous and inconsistent results. There are also cuisine data items like regional that are inherently ambiguous by themselves (what particular region pertains to a regional cuisine?)

    How Can We Do Better?

    A data field that is prone to be filled with arbitrary values and/or even multple (non-atomic) values run the risk of data quality (DQ) issues. Such DQ issues often affect the results of SQL queries offering non-results or even false results.

    One way of dealing with multiple values is to disallow users from entering them by the use of option or dropdown elements where users are constrained to select from a predetermined set of options as inputs. However, this may be difficult to implement because the key part of the key-value relationship (cuisine) in the node or way elements are themselves arbitrary. A user may or may not choose to add a cuisine-type key-value data item pair.

    If this is the case, perhaps the Open Street Map project may be able to add a feature that allows users to tell the system about the data item type that they are thinking of creating. If the user intends to augment a node or way element with a new multi-value key-value type entry, then there should be a toggle or check box on the screen for the user to specify the multi-valued characteristic of the data entry. Along with this, the system should keep information of the most often used multi-valued data item sets (i.e. cuisine) along with their proper predetermined value selection set.

    The system should then be able to provide edit notifications to the user should the user attempt to include multiple items in the input field which has been configured to be multi-valued or the system can also offer the predetermined value selection set to help the user choose the correct value data item.

    Moreover, the "cuisine-aware" system can also maintain an internal list of value items that can fall under the common fast food cuisine. Thus, items like chicken sandwich, burger, ice cream, subs, pizza and the like which can be found in our Orlando map dataset can appropriately and automatically be assigned to this category.

    Conclusion

    We are able to illustrate the typical tasks involved with wrangling data from a raw data file to a form that can be efficiently used for data analysis.

    The major process of data cleaning was begun with gathering of as much information about the components of the map file. This involved some demonstrations of the loading and parsing of various XML files.

    Using this information, data on the street names, telephone numbers and postal codes were audited in order to evaluate their accuracy, completeness, consistency and validity. The data audits were performed giving particular emphasis on the importance of adhering to a data cleaning plan to ensure a systematic approach at working with the data.

    Afterwards, data cleaning were performed programmatically and at then end of each phase of the data cleaning exercise, a clean version of the original map file was generated and tested to serve as input to the succeeding phase of the data cleaning process.

    We also discussed how we parsed the cleaned map file into several CSV files. These flat files were then loaded into a SQLite database where we performed various SQL queries to support and facilitate our data exploration tasks.

    Our primary takeaway from all of these exercises is to demonstrate the necessary data wrangling processes that have to take place before a proper data analysis can happen.