Monday, June 24, 2013

KMZ/KML file parsing with Python

Python version: 2.7.5
Source: kmz_parser.py

Brief:
This totorial describes a method for writing a python script to extracting coordinate and label information from kmz/kml files then exporting to a csv file.

1. Unzip the KMZ and extract doc.kml

Keyhole Markup Language (KMZ) files are google earth files that can contain points and lines and shapes from google earth. They are simply zipped archives. Inside they contain a plain text xml file doc.kml. To look at this file rename your .kmz to .zip, extract it and open doc.kml. 

Python provides many nice built in libaries, the first we are going to use is zipfile

from zipfile import ZipFile

filename = 'test.kmz'

kmz = ZipFile(filename, 'r')
kml = kmz.open('doc.kml', 'r')


This opens the doc.kml file as a standard file for reading.  You can now parse the file.

2. Examine the KML file to determine the type of information you want and how it's stored.

IDLE (Python included editor) is a good editor for viewing kml files.  For extracting the names of items and their grids we need to look at three tags, <Placemark>, <name> and <coordinates>.  <Placemark> tags surround each item, inside they have a <name> and <coordinates> tag.  Of note is the parser we are going to use calls tags "Elements".

3. Write a SAX handler

Simple API for XML (SAX) allows us to parse XML files.  Python naturally has a built in library for this, xml.sax.  To make this work we use ineritance and create our own custom xlm.sax.handler.ContentHandler class.  To understand how the SAX parser will work, when we feed it our file and our ContentHandler object it will call the methods within our ContentHandler object at certian times.  To make sure that it knows the names of our functions we are required to create an child class of the ContentHandler class that contains dummy methods for all these different events.  By overriding the functionality of these functions we can make the parser do our work when it reaches each of these events.  The documentation on the ContentHandler base class in the python documentation has the names of these methods and a description of when they are called.  The ones we are interested in are these:
  • __init__(self)
    • constructor, called when the object is created
  • startElement(self, name, attributes)
    • called at start elements (i.e. '<Placemark>', and <name>, etc.)
  • characters(self, data)
    • called at text between elements
  • endElement(self, name)
    • called at end elements (i.e. '</Placemark>', and <name>, etc.)

The data we are going to capture will be stored in a nested dictionary object.  Each Placemark's <name> attribute data will be a key maped to a second dictionary object.  Inside this object each Element name will become a key mapped to the data contained in that Element.  This will allow us to extract all the data contained within each placemark, including the 'coordinates' attribute.  See code below:

import xml.sax, xml.sax.handler
class PlacemarkHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self.inName = False # handle XML parser events
        self.inPlacemark = False
        self.mapping = {}
        self.buffer = ""
        self.name_tag = ""
       
    def startElement(self, name, attributes):
        if name == "Placemark": # on start Placemark tag
            self.inPlacemark = True
            self.buffer = ""

        if self.inPlacemark:
            if name == "name": # on start title tag
                self.inName = True # save name text to follow
           
    def characters(self, data):
        if self.inPlacemark: # on text within tag
            self.buffer += data # save text if in title
           

    def endElement(self, name):
        self.buffer = self.buffer.strip('\n\t')
       
        if name == "Placemark":
            self.inPlacemark = False
            self.name_tag = "" #clear current name
       
        elif name == "name" and self.inPlacemark:
            self.inName = False # on end title tag           
            self.name_tag = self.buffer.strip()
            self.mapping[self.name_tag] = {}

        elif self.inPlacemark:
            if name in self.mapping[self.name_tag]:
                self.mapping[self.name_tag][name] += self.buffer
            else:
                self.mapping[self.name_tag][name] = self.buffer

        self.buffer = ""

4. Create a Parser, set the Handler, and parse the file.

To parse the file we need to create a parser object, set it's handler object to an instance of the custom object we created, execute the parse function on the file, and close the file.  After this our mapping dictionary is ready to be used.
parser = xml.sax.make_parser()
handler = PlacemarkHandler()
parser.setContentHandler(handler)
parser.parse(kml)
kmz.close()

5. Build the CSV table for output

The mapping created cointains a great amount of data that we don't need, however there is one thing of note within there.  Points contain the tag <LookAt>, lines contain <LineString>, and shapes contain <Polygon>.  By testing for these values we are able to sort our output table so all the points will be together, then the lines, then the polygons.  Below is a function to build the table:
def build_table(mapping):
    sep = ','
       
    output = 'Name' + sep + 'Coordinates\n'
    points = ''
    lines = ''
    shapes = ''
    for key in mapping:
        coord_str = mapping[key]['coordinates'] + sep
       
        if 'LookAt' in mapping[key]: #points
            points += key + sep + coord_str + "\n"
        elif 'LineString' in mapping[key]: #lines
            lines += key + sep + coord_str + "\n"
        else: #shapes
            shapes += key + sep + coord_str + "\n"
    output += points + lines + shapes
    return output

6. Save the new file, output the data.

outstr = build_table(handler.mapping)
out_filename = filename[:-3] + "csv" #output filename same as input plus .csv
f = open(out_filename, "w")
f.write(outstr)
f.close()
print outstr

12 comments:

  1. I found it very helpful when working with kml, and it's to the point. When executing the script an error occured " NameError: name 'outstr' is not defined, how do you go about it?, thanks

    ReplyDelete
    Replies
    1. you need to put brackets around outstr.
      print (outstr)

      Delete
  2. Good script, this is what i was looking for, thanks for post it

    ReplyDelete
  3. Seriously! This is awesome! I'd been writing my own code to parse KML-- didn't even know SAX handlers existed. Thanks!

    ReplyDelete
  4. A simple great job!

    For who have non-ascii characters in the KMZ file, use the "codecs" Python library to open the output CSV using the desired encoding format.

    ReplyDelete
    Replies
    1. Can you please help me to understand encoding part? In which part of the code and what exactly I need to encode. thank you

      Delete
    2. This comment has been removed by the author.

      Delete
    3. When output to file, I do below change and work.
      with open(out_filename, "w") as f:
      f.write(outstr.encode('utf8'))

      Delete
  5. This is amazing. I'm a complete beginner in Python and this was a huge jump start to my project. I'm running Python 3.4 and only had to add parenthesis around outstr in the final command. After that, this thing run like a charm and organized my KMZ data perfectly. Top marks, Tyler!

    ReplyDelete
  6. Thank you for posting all of this, Tyler. It works when I run it but the output CSV file is empty. Any ideas why that might be?

    ReplyDelete
  7. Hi, I have an error when trying to unzip the KMZ file, can anyone please help me with "Bad magic number for central directory" - error when using ZipFile?

    ReplyDelete