Training neural networks with CBO#

This notebook shows how to train a simple neural network on the MNIST dataset. We employ Pytorch, which creates a very convenient machine learning environment. In particular, we will see how CBX can run on the GPU.

We start by loading the necessary packages.

[ ]:
%load_ext autoreload
%autoreload 2
import cbx as cbx
from cbx.dynamics.cbo import CBO
import torch
import torch.nn as nn
import torchvision
import cbx.utils.resampling as rsmp

Load the data: MNIST#

We load the train and test data. In this case we use the MNIST dataset, which we assume to be available. You can specify the path with the variable data_path to point to the right directory.

[ ]:
data_path = "../../../../datasets/" # This path directs to one level above the CBX package
transform = torchvision.transforms.ToTensor()
train_data = torchvision.datasets.MNIST(data_path, train=True, transform=transform, download=False)
test_data = torchvision.datasets.MNIST(data_path, train=False, transform=transform, download=False)
train_loader = torch.utils.data.DataLoader(train_data, batch_size=64,shuffle=True, num_workers=0)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=64,shuffle=False, num_workers=0)

Specify the device#

We now specify the device to run everything on. If cuda is available, we perform most of the calculations on the GPU!

[ ]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

Load model type#

We now load the model that we want to employ in the following. The variable model_class specifies the kind of network we want to use, in this case a Perceptron with one hidden layer. Since

[ ]:
from models import Perceptron

model_class = Perceptron

Model initialization#

We now initialize the parameters that will be used in the optimization. First we decide how many particles, we want to use in the following, in this case N=50.

Initializing the weights#

We initialize the parameters, by creating a list containing N different initializations: models = [model_class(sizes=[784,100,10]) for _ in range(N)]. This list is then transformed into a torch tensor w of shape (50, d) with the function flatten_parameters, where d is the number of trainable parameters per network. We save further properties, like the names and shapes of each parameter into the variable pprop, which later allows us to perform the inverse operation of flatten_parameters.

Do we save the whole list models?#

The important thing to realize in the following is, that we do not work with the list models anymore. We only created it to initialize the parameters. The only thing that is updated is the tensor w that stores the flattened parameters. Every time, we want to evaluate the ensemble, we make use of the function `functional_call <https://pytorch.org/docs/stable/generated/torch.func.functional_call.html>`__, which takes the tensor w together with one realization of the model and applies it. That is why set model = models[0] in order to have one realization.

[ ]:
from cbx.utils.torch_utils import flatten_parameters, get_param_properties, eval_losses, norm_torch, compute_consensus_torch, standard_normal_torch, eval_acc, effective_sample_size
N = 50
models = [model_class(sizes=[784,100,10]) for _ in range(N)]
model = models[0]
pnames = [p[0] for p in model.named_parameters()]
w = flatten_parameters(models, pnames).to(device)
pprop = get_param_properties(models, pnames=pnames)

The objective function#

We now define the functions that we want to optimize. For a single particle \(w_i\) it is defined as

\[f(w_i) := \sum_{(x,y)\in\mathcal{T}} \ell(N_{w_i}(x), y),\]

where \(\mathcal{T}\) is our training set, \(N_{w_i}\) denotes the neural net parametrized by \(w_i\) and \(\ell\) is some loss function. We usually do not evaluate the loss on the whole training set, but rather on so called mini-batches \(B\subset\mathcal{T}\) which is incorporated in the objective function below.

[ ]:
class objective:
    def __init__(self, train_loader, N, device, model, pprop):
        self.train_loader = train_loader
        self.data_iter = iter(train_loader)
        self.N = N
        self.epochs = 0
        self.device = device
        self.loss_fct = nn.CrossEntropyLoss()
        self.model = model
        self.pprop = pprop
        self.set_batch()

    def __call__(self, w):
        return eval_losses(self.x, self.y, self.loss_fct, self.model, w[0,...], self.pprop)

    def set_batch(self,):
        (x,y) = next(self.data_iter, (None, None))
        if x is None:
            self.data_iter = iter(self.train_loader)
            (x,y) = next(self.data_iter)
            self.epochs += 1
        self.x = x.to(self.device)
        self.y = y.to(self.device)

Set up CBX Dynamic#

We now set up the dynamic. First we set some parameters.

[ ]:
kwargs = {'alpha':50.0,
        'dt': 0.1,
        'sigma': 0.1,
        'lamda': 1.0,
        'max_it': 200,
        'verbosity':0,
        'batch_args':{'batch_size':N},
        'check_f_dims':False}

How to incorporate torch in CBX?#

The interesting question is now, how we can incorporate torch into CBX. Usually, CBX assumes numpy as the underlying array packages, however this can be modified. If we only work on the CPU, we would not have to change too much because then numpy and torch can use the same underlying memory and numpys ufunctions could act on torch tensors. This does not work, if we want to work on the GPU!

Where is numpy actually relevant?#

Since standard CBO does not require many advanced array operations, the question arises, where it actually makes a difference what the ensemble x is? We list the most important aspects below:

  • The ensemble x must implement basic array operations, such as +,-,*,\, scalar multiplication, some broadcasting rules and numpy-style indexing.

  • Copying: in certain situations it is important to copy the ensemble. In numpy we use np.copy, the torch analogue is torch.clone. The ParticleDynmaic allows to specify which function should be used with the keyword argument copy=....

  • Generating random numbers: All noise methods in cbx call an underlying random number generator. In fact most of the time, we want to obtain an array of a certain size, with the entries being distributed according to the standard normal distribution. Therefore, the ParticleDynmaic allows to specify a callable normal=... that is used, whenever we want to sample. In our case employ a wrapper class for torch.normal, defined in the file cbx_torch_utils.

  • Computing the consensus: The consensus computation also requires some modification. Most notably, for numerical stability we employ a `logsumexp <https://en.wikipedia.org/wiki/LogSumExp>`__ function, which has to be replaced by the torch variant. We can specify which function to use, with the keyword compute_consensus=... and use the function defined in cbx_torch_utils

[ ]:
f = objective(train_loader, N, device, model, pprop)
resampling =  rsmp.resampling([rsmp.loss_update_resampling(wait_thresh=40)])

dyn = CBO(f, f_dim='3D', x=w[None,...], noise='anisotropic',
          norm=norm_torch,
          copy=torch.clone,
          sampler=standard_normal_torch(device),
          compute_consensus=compute_consensus_torch,
          post_process = lambda dyn: resampling(dyn),
          **kwargs)
sched = effective_sample_size(maximum=1e7, name='alpha')

Train the network#

After we set up everything, we can now start the training loop :)

[ ]:
e = 0
while f.epochs < 10:
    dyn.step()
    sched.update(dyn)
    f.set_batch()
    if e != f.epochs:
        e = f.epochs
        print(30*'-')
        print('Epoch: ' +str(f.epochs))
        acc = eval_acc(model, dyn.best_particle[0,...], pprop, test_loader)
        print('Accuracy: ' + str(acc.item()))
        print(30*'-')
[ ]: