12-09-04 XML Odyssey
I just spent a couple of days rewriting a tool written by
one of the interns for Thinking in Java 3rd edition, which creates
Ant build.xml files from the code tree produced by extracting all
the code files from the book. Some of the code files have extra
comment tags that give additional information to this tool.
I rewrote it because it was hard to modify, mostly because
it did a lot of things by hand -- for example, much of the XML code
was produced with print statements. There was a lot of distributed
logic and it didn't really focus the change points, so modifying
it was complex and scary. The intern that wrote it did the best
he could as he was just learning Python, objects, etc., and I was
not particularly good at coaching him while he developed this. So I
don't fault him for it. But I needed to keep changing it and it
was too problematic.
One of the things I wanted to do was use an XML library so that
the result was automatically correct XML without having to worry
about the details. I started with ElementTree since I kept hearing
about it, and it was OK but when I produced output I discovered it
was unreadable and there was not prettifier that I could discern.
I hunted around, ran across some C ones that seemed focused on HTML,
and finally discovered that the Python xml.dom library has
a toprettyxml() function. So that switched me from ElementTree
to xml.dom.minidom (the somewhat simpler facade over xml.dom).
Maybe it's just me, but the documentation for xml.dom.minidom
and xml.dom seems incomplete. I finally had to go to the source
code and poke around, which worked surprisingly well, so I'm assuming
that's what everyone else ends up doing. In the end, using the
built-in Python libraries seemed about the same complexity as
using ElementTree, but the built-in libraries have more functionality;
the thing that finally pushed me over the edge was the need for a
cloneNode() function which xml.dom has and ElementTree
doesn't.
toprettyxml() is useful but limited; it didn't produce particularly
readable XML but it could be used in code that does. Here's what I created
in order to create what I consider readable code (for my needs on
this project). Perhaps it will work for you, too, or you can modify it to
do so.
(12-10-04: an improved version)
"""ReadifyXML.py: Reformat XML, breaking up long tags to
make it more human-readable"""
from xml.dom.minidom import parseString
import re
triggerLineLength = 60 # Try to reformat lines longer than this
# Amount of indentation to use when reformatting:
indentStep = " "
def makeXMLReadable(xml):
xml = xml.toprettyxml("", "\n")
# Remove trailing whitespace from each line:
xml = "\n".join(
[line.rstrip() for line in xml.split("\n")])
# Remove all empty lines before opening '<':
while xml.find("\n\n<") != -1:
xml = xml.replace("\n\n<", "\n<")
xml = parseString(xml).toprettyxml(indentStep, "")
# Now all the xml lines are tight,
# and we can insert spaces and line breaks:
xml = addBreaks(xml)
xml = reformatTooLongLines(xml)
return xml
############# Support functions #############
addBreak = re.compile("\s{0,%d}<[a-zA-Z]" % len(indentStep))
def addBreaks(xml):
'''Add line breaks to elements with
zero or one indent level'''
result = ""
for line in xml.split("\n"):
if addBreak.match(line):
result += "\n"
result += line + "\n"
return result
def reformatTooLongLines(xml):
result = ""
for line in [line.rstrip() for line in xml.split("\n")]:
if len(line) < triggerLineLength or not line.lstrip().startswith("<"):
result += line + "\n"
else:
result += reformatLongLine(line) + "\n"
return result
oneLiner = re.compile("([ ]*)(<\w+)(.*)>")
keyValuePair = re.compile('\w+="[^"]*?"')
def reformatLongLine(line):
'''Reformat an xml tag to put each key-value
element on a single indented line, for readability'''
matchobj = oneLiner.match(line.rstrip())
baseIndent = matchobj.group(1)
result = baseIndent + matchobj.group(2) + "\n"
indent = baseIndent + " " # Match indent level of tag
for pair in keyValuePair.findall(matchobj.group(3)):
result += indent + pair + "\n"
result = result.rstrip() + " />"
return result
If you see some improvements, please suggest them on the feedback page.
Note: the following wiki page is editable, as an experiment.