UPDATE: After some consideration and requests from people, I have put together a simpler argparse based multiple file configuration handling in this post.

When building machine learning scripts there are often a lot of things we just set as hyper-parameters such as number of layers, number of hidden neurons, batch size and even number of epochs to train needs to be specified. There are really 4 common ways to handle configuration:

  • Monster configuration object CONFIG = {'host': 'localhost'} that is stored in a module which is imported when needed. The user modifies the configuration file to adjust settings. This is commonly used in web servers where you specify global variables in a module such as in Flask. I find this works when you have a fairly static configuration and it is in Python so can use other functions to determine configuration such as getting operating system. The main disadvantage is it is not portable, you are fixed to Python which is a programming language playing the role of a configuration file.
  • A safe option is to use the built-in ConfigParser that will read in a .ini configuration file and essentially provide a dictionary style object to read values from. This setup is a good start for smaller projects with simple parameters. By simple, I mean it can only produce string parameters. EPOCHS = 20 needs to be cast to integer which I find creates an overhead when reading parameters.
  • Something not to overlook is environment variables which can be used to store environment specific options such as host name os.environ.get('HOST') but it is mainly ill suited for any machine learning style hyper-parameter storing task.
  • Finally and most commonly, a lot of people use argparse to pass parameters from the command line to change for example the number of epochs. You’ll often see python3 train.py --epochs 20 --input_file data.txt style parameter trains being specified to configure settings of the script. In fact, I was part of this group desperately trying to manage the ever increasing number of parameters and orchestrating global variables around modules that store the arguments.

Having played around with almost all configuration options, I decided to come up with a more subtle but scalable option for my need. Let’s take a simple example of a common function such as:

def train(dataf="data.txt", epochs=20, batch_size=32, model="lstm", layers=2):
    d = load_data(dataf)
    m = create_model(model, layers)
    m.fit(d, epochs=epochs, batch_size=32)

I end up either creating a monster global variable that is passed around or like in the above example start cascading configuration. The train function shouldn’t worry about the data file or the model, it is interested in just the number of epochs and batch size while load_data and create_model should handle those. I found either option - having global arguments or cascading - rather ugly and difficult to scale. Let’s look at the current setup I have:

# train.py
from config import extern

@extern
def load_data(dataf="data.txt")
    # loads data...
    return data

@extern
def create_model(model="lstm", layers=2):
    # creates model...
    return model

@extern
def train(epochs=20, batch_size=32):
    d = load_data()
    m = create_model()
    m.fit(d, epochs=epochs, batch_size=32)

Then I specify the keyword arguments through a YAML file. Now the reason I chose YAML is it handles type conversion so I don’t have to cast numbers to actual integers or floats and it allows for arrays and even Python objects. Basically, it is very expressive. As a starting point I create my configuration file:

# default.yaml
train.py:
  load_data:
    dataf: "some/path/data.txt"
  create_model:
    model: "gru"
    # don't have to specify layers, default from code will be used
  train:
    epochs: 40
    batch_size: 16

This is just one way of organising the configuration that is granular at the function level of a file. Then I create a decorator that loads and updates keyword arguments only based on the configuration file:

# config.py
import argparse
from pathlib import Path
import yaml

# We only specify the yaml file from argparse and handle rest
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("-f", "--config_file", default="configs/default.yaml", help="Configuration file to load.")
ARGS = parser.parse_args()

# Let's load the yaml file here
with open(ARGS.config_file, 'r') as f:
    config = yaml.load(f)
print(f"Loaded configuration file {ARGS.config_file}")

def extern(func):
    """Wraps keyword arguments from configuration."""
    def wrapper(*args, **kwargs):
        """Injects configuration keywords."""
        # We get the file name in which the function is defined, ex: train.py
        fname = Path(func.__globals__['__file__']).name
        # Then we extract arguments corresponding to the function name
        # ex: train.py -> load_data
        conf = config[fname][func.__name__]
        # And update the keyword arguments with any specified arguments
        # if it isn't specified then the default values still hold
        conf.update(kwargs)
        return func(*args, **conf)
    return wrapper

Jobs done. Now whenever you need keyword arguments to a function, any function can use from config import extern and @extern decorator which will read from a YAML configuration file to update passed keyword arguments on the fly. When trying different hyper-parameters I create different YAML files which clearly state what is my configuration for that particular train.py or model.py so things don’t spiral out of control.