aglyph.compat.ipyetree — an ElementTree parser for IronPython

Release:2.1.1

This module defines an xml.etree.ElementTree.XMLParser that delegates to the .NET System.Xml.XmlReader XML parser to parse an Aglyph XML context document.

IronPython is not able to load CPython’s xml.parsers.expat module, and so the default parser used by ElementTree does not exist.

New in version 2.0.0: To address the missing xml.parsers.expat module, this module now defines the CLRXMLParser class, which replaces XmlReaderTreeBuilder and is used by aglyph.context.XMLContext as the default parser when running under IronPython.

Alternatively, IronPython developers may wish to install expat or an expat-compatible library as a site package. However, this has not been tested with Aglyph.

class aglyph.compat.ipyetree.CLRXMLParser(target=None, validating=False)[source]

Bases: xml.etree.ElementTree.XMLParser

An xml.etree.ElementTree.XMLParser that delegates parsing to the .NET System.Xml.XmlReader parser.

If target is omitted, a standard TreeBuilder instance is used.

If validating is True, the System.Xml.XmlReader parser will be configured for DTD validation.

feed(data)

Add more XML data to be parsed.

Parameters:data (str) – raw XML read from a stream

Note

All data across calls to this method are buffered internally; the parser itself is not actually created until the close() method is called.

close()

Parse the XML from the internal buffer to build an element tree.

Returns:the root element of the XML document
Return type:xml.etree.ElementTree.ElementTree
class aglyph.compat.ipyetree.XmlReaderTreeBuilder(validating=False)[source]

Bases: aglyph.compat.ipyetree.CLRXMLParser

Build an ElementTree using the .NET System.Xml.XmlReader XML parser.

Changed in version 2.0.0: It is no longer necessary for IronPython applications to use this class explicitly. aglyph.context.XMLContext now uses CLRXMLParser by default if running under IronPython.

Deprecated since version 2.0.0: This class has been renamed to CLRXMLParser. XmlReaderTreeBuilder will be removed in release 3.0.0.

A note on IronPython Unicode issues

IronPython does not have an encoded-bytes str type; rather, the str and unicode types are one and the same:

>>> str is unicode
True

Unfortunately, this means that IronPython cannot not properly decode byte streams/sequences to Unicode strings using Python language facilities. Consider the simple example of a UTF-8-encoded XML file test.xml:

<?xml version="1.0" encoding="utf-8"?>
<test>façade</test>

CPython

>>> open("test.xml", "rb").read()
'<?xml version="1.0" encoding="utf-8"?>\n<test>fa\xc3\xa7ade</test>\n'

IronPython

>>> open("test.xml", "rb").read()
u'<?xml version="1.0" encoding="utf-8"?>\n<test>fa\xc3\xa7ade</test>\n'

The byte sequence C3 A7 in UTF-8-encoded byte string represents a single Unicode code point (U+00E7 LATIN SMALL LETTER C WITH CEDILLA), while the character sequence C3 A7 in a Unicode string are the Unicode code points U+00C3 LATIN CAPITAL LETTER A WITH TILDE followed by U+00A7 SECTION SIGN. Clearly the latter is incorrect.

In many cases, this difference between CPython and IronPython will be transparent. For example:

CPython

>>> "fa\xc3\xa7ade".decode("utf-8")
u'fa\xe7ade'

IronPython

>>> u"fa\xc3\xa7ade".decode("utf-8")
u'fa\xe7ade'

However, IronPython‘s behavior poses a problem for Aglyph XML context parsing because the xml.etree.ElementTree.ElementTree class uses open(source, "rb") (as in the first comparison) to access the file contents when the source argument to xml.etree.ElementTree.ElementTree.parse() is a string (filename). This would cause the XML parser to return the Unicode string u"fa\xc3\xa7ade" as the value of the text node under <test>. If, for example, this was in an Aglyph <str> or <bytes> element (e.g. <str encoding="iso-8859-1">façade</str>), Aglyph would attempt (correctly) to encode the Unicode string using ISO-8859-1, which would result in an incorrect ISO-8859-1 string under IronPython:

>>> u"fa\xc3\xa7ade".encode("iso-8859-1")
u'fa\xc3\xa7ade'

This happens because both '\xc3' and '\xa7' represent valid ISO-8859-1 characters (LATIN SMALL LETTER C WITH CEDILLA and SECTION SIGN, respectively).

One workaround is to use the .NET System.IO.StreamReader class instead of the Python built-in function open():

>>> from System.IO import StreamReader
>>> from System.Text import Encoding
>>> sr = StreamReader("test.xml", Encoding.UTF8)
>>> sr.ReadToEnd()
u'<?xml version="1.0" encoding="utf-8"?>\n<test>fa\xe7ade</test>\n'

Unfortunately, this requires knowledge of the file encoding prior to reading, which isn’t always possible when parsing XML. (Arguably, it should not need to be known in advance for XML parsing, since the XML declaration should convey this piece of metadata to the XML parser.)

Aglyph’s aglyph.compat.ipyetree.XmlReaderTreeBuilder takes a two-step approach to work around IronPython‘s Unicode issues when parsing an Aglyph XML context document:

  1. Save the document encoding from the XML declaration.
  2. Use the document encoding to decode data before handing it off to aglyph.context.XMLContext.

Step #1 is possible because, luckily, the System.Xml.XmlReader class reports XmlNodeType.XmlDeclaration.

Note

If the XML document does not specify an explicit encoding in the XML declaration, XmlReaderTreeBuilder assumes UTF-8.

Step #2 works because the same “glitch” that causes IronPython‘s Unicode issues can be exploited to work around it:

>>> str is unicode
True
>>> u"fa\xc3\xa7ade".decode("utf-8")
u'fa\xe7ade'
>>> "no non-ascii bytes".decode("utf-8")
'no non-ascii bytes'

Because of this, the text node string u"fa\xc3\xa7ade" can actually be decoded to u"fa\xe7ade" before being handed off to aglyph.context.XMLContext, allowing XMLContext to remain ignorant of IronPython‘s Unicode issues.