I often end up having configuration dictionaries for my projects. They provide a convenient way of storing parameters during runtime. But what if we want to uniquely identify a configuration dictionary? I want to check and potentially rerun identical configurations again. Hash functions are often used to determine changes to content, for example you can try running md5sum [file] to see a fixed size digest of the file. The MD5 message-digest algorithm is very common for simple non-secure hashing requirements. On the other hand, the main use cases of the Python hash function is to compare dictionary keys during a lookup. Anything that is hashable can be used as a key in a dictionary, for example {(1,2): "hi there"}.

This situation sets us up for a simple MD5 based hashing of dictionaries. Here are some things I thought we need to assume and look out for:

  • The hash needs to be aware that {'a': 1, 'b': 2} and {'b': 2, 'a': 1} are the same dictionaries. Ordering of keys should not matter.
  • There could be any values, such as lists, floats and other types. We will assume any value is serialisable as a string.
  • And finally we assume the keys are strings which allows us to order them.

With these constraints, we cover a reasonable range of dictionaries we would like to hash. Here is the dictionary hashing function I ended up using:

from typing import Dict, Any
import hashlib
import json

def dict_hash(dictionary: Dict[str, Any]) -> str:
    """MD5 hash of a dictionary."""
    dhash = hashlib.md5()
    # We need to sort arguments so {'a': 1, 'b': 2} is
    # the same as {'b': 2, 'a': 1}
    encoded = json.dumps(dictionary, sort_keys=True).encode()
    dhash.update(encoded)
    return dhash.hexdigest()

which uses json as a serialisation method and ensures the keys are sorted to handle the first constraint. It is also safe across platforms since it does not rely on the internal hash function. You can think of this snippet of code as the start of a bigger library for handling hyper-parameters, configuration and more. I prefer small libraries that can be extendable as opposed to a Python library that does it all; and this snippet provides and easy-to-understand starting point for handling dictionary hashes.