Return to Home Page
      Blog     Consulting     Seminars     Calendar     Books     CD-ROMS     Newsletter     About     FAQ      Search
 

12-09-04 XML Odyssey

I just spent a couple of days rewriting a tool written by one of the interns for Thinking in Java 3rd edition, which creates Ant build.xml files from the code tree produced by extracting all the code files from the book. Some of the code files have extra comment tags that give additional information to this tool.

I rewrote it because it was hard to modify, mostly because it did a lot of things by hand -- for example, much of the XML code was produced with print statements. There was a lot of distributed logic and it didn't really focus the change points, so modifying it was complex and scary. The intern that wrote it did the best he could as he was just learning Python, objects, etc., and I was not particularly good at coaching him while he developed this. So I don't fault him for it. But I needed to keep changing it and it was too problematic.

One of the things I wanted to do was use an XML library so that the result was automatically correct XML without having to worry about the details. I started with ElementTree since I kept hearing about it, and it was OK but when I produced output I discovered it was unreadable and there was not prettifier that I could discern. I hunted around, ran across some C ones that seemed focused on HTML, and finally discovered that the Python xml.dom library has a toprettyxml() function. So that switched me from ElementTree to xml.dom.minidom (the somewhat simpler facade over xml.dom).

Maybe it's just me, but the documentation for xml.dom.minidom and xml.dom seems incomplete. I finally had to go to the source code and poke around, which worked surprisingly well, so I'm assuming that's what everyone else ends up doing. In the end, using the built-in Python libraries seemed about the same complexity as using ElementTree, but the built-in libraries have more functionality; the thing that finally pushed me over the edge was the need for a cloneNode() function which xml.dom has and ElementTree doesn't.

toprettyxml() is useful but limited; it didn't produce particularly readable XML but it could be used in code that does. Here's what I created in order to create what I consider readable code (for my needs on this project). Perhaps it will work for you, too, or you can modify it to do so.

(12-10-04: an improved version)

"""ReadifyXML.py: Reformat XML, breaking up long tags to
make it more human-readable"""
from xml.dom.minidom import parseString
import re

triggerLineLength = 60 # Try to reformat lines longer than this

# Amount of indentation to use when reformatting:
indentStep = "  "

def makeXMLReadable(xml):
    xml = xml.toprettyxml("", "\n")
    # Remove trailing whitespace from each line:
    xml = "\n".join(
        [line.rstrip() for line in xml.split("\n")])
    # Remove all empty lines before opening '<':
    while xml.find("\n\n<") != -1:
        xml = xml.replace("\n\n<", "\n<")
    xml = parseString(xml).toprettyxml(indentStep, "")
    # Now all the xml lines are tight,
    # and we can insert spaces and line breaks:
    xml = addBreaks(xml)
    xml = reformatTooLongLines(xml)
    return xml

############# Support functions #############

addBreak = re.compile("\s{0,%d}<[a-zA-Z]" % len(indentStep))
def addBreaks(xml):
    '''Add line breaks to elements with
    zero or one indent level'''
    result = ""
    for line in xml.split("\n"):
        if addBreak.match(line):
            result += "\n"
        result += line + "\n"
    return result

def reformatTooLongLines(xml):
    result = ""
    for line in [line.rstrip() for line in xml.split("\n")]:
        if len(line) < triggerLineLength or not line.lstrip().startswith("<"):
            result += line + "\n"
        else:
            result += reformatLongLine(line) + "\n"
    return result

oneLiner = re.compile("([ ]*)(<\w+)(.*)>")
keyValuePair = re.compile('\w+="[^"]*?"')
def reformatLongLine(line):
    '''Reformat an xml tag to put each key-value
    element on a single indented line, for readability'''
    matchobj = oneLiner.match(line.rstrip())
    baseIndent = matchobj.group(1)
    result = baseIndent + matchobj.group(2) + "\n"
    indent = baseIndent + " " # Match indent level of tag
    for pair in keyValuePair.findall(matchobj.group(3)):
        result += indent + pair + "\n"
    result = result.rstrip() + " />"
    return result

If you see some improvements, please suggest them on the feedback page.

Note: the following wiki page is editable, as an experiment.

Feedback Wiki Page

    Links I Read
Cafe Au Lait
Artima
Daily Python URL
Martin Fowler
Joel on Software
Paul Graham
Cringely
Search     Home     Web Log     Articles     Calendar     Books     CD-ROMS     Seminars     Services     Newsletter     About     Contact     Site Feedback     Site Design     Server Maintenance     Powered by Zope
©2003 MindView, Inc.