Variations on a (data object) theme

In this article, I’ll illustrate the evolution of a simple Python data object type (something akin to a Data Transfer Object (DTO) or Parameter Object) from its simplest-possible implementation to a final “recipe” for a pure-Python data object type that provides excellent get/set speed and memory consumption while also retaining the extensibility of a basic class.

The simplest-possible data object type in Python looks like this:

>>> class SimpleDataObject:
...     pass

This bare-bones definition can’t be beat in situations that call for extreme simplicity (rapid prototyping and, arguably, configuration-as-code come to mind). Any kind of data object can be created and populated with minimal code:

>>> attribute = SimpleDataObject()
>>> attribute.name = 'color'
>>> attribute.value = 'orange'

While this may suffice for very simple use cases, it’s not really suitable for use in any moderately complex application. For starters, it’s not explicit enough – there is no indication, other than the code used to create/populate it, of what kind of data object is being created, nor which fields are expected to be defined.  Furthermore, instances of this base class will raise AttributeError if any field has not been explicitly assigned a value. Finally, there is no meaningful string representation of any object created with this base class.

Fortunately, these issues are easy to address by subclassing:

>>> class SimpleDataObject:
...     def __getattr__(self, name):
...         return None
...     def __repr__(self):
...         return '<%s %r>' % (self.__class__.__name__, self.__dict__)
... 
>>> class Attribute(SimpleDataObject):
...     def __init__(self, name=None, value=None):
...         self.name = name
...         self.value = value

The __getattr__ method is called only when an attribute is not found in an object’s __dict__.  SimpleDataObject takes advantage of this behavior to return None as a default value for undefined fields.  All SimpleDataObjects also have a meaningful string representation.

Additional data object types can be defined in a similar fashion, and it is trivial to extend a new data object type from an existing one.  Some additional methods that may make sense to define in the base class include __eq__, __ne__, __hash__, and __iter__.

This approach or a similar variation should suffice for the vast majority of use cases.  But there are two aspects of this approach that can still be improved upon as needs dictate:

  1. Arbitrary fields can be created on any instance.  For the purposes of a DTO or Parameter Object, this is usually undesirable.
  2. Instance __dict__s are hash tables, so the great lookup time comes at the expense of memory space. An application that uses many (hundreds, thousands) instances may consume significant amounts of memory.

In Python, both concerns can be addressed by making use of a lesser-known feature of Python called __slots__. Declaring a data object’s “expected” fields as slots will prevent the creation of an instance __dict__ and disallow the creation of arbitrary fields.

However, because using __slots__ prevents the creation of an instance __dict__, instances of this data object class lose the ability to be pickled. If pickling support is needed, the __getstate__ and __setstate__ methods should be overridden.

The final recipe for a high-performance, yet versatile data object class looks like this:

>>> class DataObject(object):
...     __slots__ = []
...     def __getattr__(self, name):
...         return None
...     def __getstate__(self):
...         state = {}
...         for cls in self.__class__.__mro__:
...             slots = getattr(cls, '__slots__', [])
...             if (isinstance(slots, str)):
...                 slots = [slots]
...             for slot in slots:
...                 state[slot] = getattr(self, slot)
...             return state
...     def __setstate__(self, state):
...         for (name, value) in state.items():
...             setattr(self, name, value)
...     def __eq__(self, other):
...         return (isinstance(other, self.__class__) and 
...                 (self.__getstate__() == other.__getstate__()))
...     def __ne__(self, other):
...         return (not self.__eq__(other))
...     def __repr__(self):
...         return '<%s %r>' % (self.__class__.__name__, self.__getstate__())

Notice that this base class defines an empty ([]) __slots__. This is necessary because otherwise DataObject would have an instance __dict__ (and defining __slots__ in subclasses would be meaningless).

Types of data objects can now be defined by subclassing DataObject and defining __slots__. Notice how the creation of an undeclared attribute (description) raises AttributeError, as expected:

>>> class Attribute(DataObject):
...     __slots__ = ['name', 'value']
... 
>>> attribute = Attribute()
>>> attribute.name = 'color'
>>> attribute.value = 'orange'
>>> attribute.description = 'The color orange (RGB: 255, 165, 0).'
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'Attribute' object has no attribute 'description'

A subclass of Attribute can be defined to account for an attribute data object that supports a text description:

>>> class AttributeWithDescription(Attribute):
...     __slots__ = ['description'] # in addition to name and value
... 
>>> attribute = AttributeWithDescription()
>>> attribute.name = 'color'
>>> attribute.value = 'orange'
>>> attribute.description = 'The color orange (RGB: 255, 165, 0).'

Summary

The final recipe defines a base class for DTO- or Parameter Object-like data object types that exhibits the following behaviors:

  • performs on-par with simple class objects in terms of get/set speed
  • consumes significantly less memory than simple class objects (20% – 50% savings, depending on version of Python)
  • uses None as a reasonable default value for fields that have not been assigned a value
  • is subclassable, pickleable, and equality/inequality-comparable
  • has a meaningful string representation (for debugging, logging, etc.)

The best part about this recipe is that it can be easily tailored to suit the needs of any application, simply by implementing (or not) methods such as __init__, __repr__, __eq__, __ne__, __iter__, and __getstate__/__setstate__.

The PseudoStruct gist on GitHub provides a reasonable starting point:

Discuss or critique PseudoStruct in the comments below, in the gist comments, or at ActiveState Code Recipes.

Comparative timings of the PseudoStruct base class versus several alternatives (simple class, namedtuple, and recordtype) are available in my pseudostruct BitBucket sandbox.

Published by

Matt

I am a software developer living and working in Northeast Ohio. I spend most of my time hacking on various Python projects.