Collections in Python’s standard library: dict

Warning: this post’s contents is based on Python 3.6rc1 source code – most recent available one at the moment of writing.

Dicts are omnipresent

A lot of things in Python are dicts. Your programs use dictionaries extensively even if you do not explicitly instantiate them.  There is a global registry of loaded modules, so they do not have to be read from HDD every time you request them using import statement.

>> import sys
>>> type(sys.modules)
<type 'dict'>

Every class and every object uses dict internally to store attributes – methods and variables. Unless they use __slots__, but that’s a different story.

class A:
    """Demo of using dicts in classes"""
    # class-level property
    SOME_PROPERTY = 123
    def __init__(self, some_property):
        # instance-level property
        self.some_property = some_property

A.__dict__  # {'__module__': '__main__', '__doc__': 'Demo of using dicts in classes', '__init__': <function __init__ at 0x10f109e60>, 'SOME_PROPERTY': 123}

Notice that class stores in its __dict__ all methods and class-level variables. Presence of __doc__ here may also be surprising, but makes perfect sense.

a = A('some_value')
a.__dict__  # {'some_property': 'some_value'}

Creating an instance of A and assigning some_property in __init__ method results in saving this property in instance’s __dict__.

In the end we are able to reach class-level properties from instance using exact same notation as if we were to ask object for its instance-level attribute.

a.SOME_PROPERTY

This works because Python uses both __dict__’s to find requested attribute. Firstly instance-level __dict__ is checked, then the interpreter moves to class-level __dict__. Finally, if no attribute was found then AttributeError is raised.

These examples are just a tip of the iceberg. Nevertheless they show how heavily dict is utilized in Python.

Dicts are meant to be fast

…which is not trivial regarding

being all things to all people

as was once stated in Beatiful Code. There are some interesting details about Python’s dict implementation. This blog post will actually revolve around them, pointing out how they can be leveraged.

A good dictionary is expected to provide efficient lookup operation. In other words, if we ask dict about presence of value under given key, we expect it to be very fast. Inserting new pair (key, value) into dictionary is second most commonly optimized operation.

Most ordinary dictionary implementation is based on so called hash table. It means that it resides in a memory as a contiguous area where individual values are stored in seperate cells.

Resembles a list, doesn’t it? Indeed, since under the hood it uses the same data structure. The special part about dict is that underlying table is sparse – there are many unused cells. So, if dict’s values are stored in a numerical indexed table (hey, that’s just a C-table of PyObject pointers!) how they are mapped into nice, human-readable keys? This is where hashing comes into play.

In simple words, hashing is a process of translating a key into numerical value that will be used to index underlying table. An ideal hash function would never translate different keys to the same index (we then speak of “collision”) and would never leave an empty space in table. As we all know, there are no ideal algorithms, only solutions that solve certain problem in a specific context and cause another ones, or more diplomatically saying – “having many undesired features”. In real life, collisions occur.

Collision, the same index (0) was assigned to k1 and k2

In real life, there are many empty, unused cells in tables used for dictionaries that stand for nothing but a memory waste.

Actually, these two features are connected. Extra space is always left to better cope with collisions, because when the latter occurs, we can not just pretend that nothing happened and neither discard old key-value pair nor ignore attempt of inserting new value.

In CPython there is a quite sophisticated algorithm for finding free space if collision occurs, but this one is beyond scope of this post.

Instead, let’s talk about space occupied or  how many empty cells has to be maintained? The source code says, that dictionaries are filled up to 2/3 are extended using formula:

used*2 + capacity/2

where used are blue cells, filled with data and capacity is an overall number of  cells, no matter if they are used or not. Less obvious thing coming from this formula is that dicts double in size when growing without any deletions.

Dict implementation changes slightly with every version of CPython. Python 3.6 introduced another feature, that is…

dict preserves order of inserted keys since Python 3.6

That’s implementation detail. You should never ever rely on it! This may obviously change in future Python releases. If you like this feature, then use collections.OrderedDict instead.

Prior to version 3.6 keys’ order was random and could change due to dict’s growth and reindexing whole structure when dict was mutated.

General idea of new implementation may be found here.

Conclusion

Dicts are powerful, yet memory-costly collections. They are very fast once populated. However, big dictionaries are much less efficient when it comes to adding/removing elements continously.

 

0

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.