Backpropagation Details

The backpropagation process contains reasonably complex mathematics (at least for people without an extensive background in maths). Even if you can grasp the individual formulas and equations involved, their overall, high level effect can be difficult to see, as the examples given in machine learning courses often involve large matrices of weights and activation values (e.g with written character recognition). This means that the reason weights change by certain values can be hard to see and understand. To compound this, the implementation of the formulas involved is often abstracted inside large matrix multiplications, so it can be hard to follow individual granular steps in a given equation.

The SimpleML.BasicFeedForwardNeuralNetwork class implements gradient descent using backpropagation in its Train() method. Its implementation of backpropagation is kept simple and easy to understand by the following features...

The backpropagation algorithm implemented uses minimal regularization techniques (just a simple learning rate)
The algorithm is implemented using 2-dimensional arrays and for loops rather than matrices and matrix multiplication

Throughout this page I'll refer to Peter Sadowski's document 'Notes on Backpropagation' which is one of the most straightforward explanations of backpropagation I've found.

Basics

This section explains the core entities and mathematical functions involved in the backpropagation process...

Cost Function (aka loss function)

The cost function is a fundamental element in backpropagation, and any gradient descent process. It's used to measure the difference between the actual result values of a set of training data, and the activation values produced by the neural network on the same set of training data. There are several different cost functions which can be applied to neural networks and other machine learning techniques. BackPropagationVisualization uses the cross-entropy cost function. Squared error and quadratic are other types of cost function.

Derivative of Cost Function (aka gradient)

The derivative of the cost function is used in backpropagation to find the amount/degree that a change in one of the weights, produces a change in the overall cost. During the backpropagation process this amount of change is used to update the weights.

Activation Function

This is a mathematical function that produces the output value of a single neuron/unit in the network, based on the input and weight values which feed into that neuron. The sigmoid function is used in BasicFeedForwardNeuralNetwork, but there are many other variations.

Target Values (aka result values, y values, labels)

These are the known correct results for each of the training examples. In the context of BackPropagationVisualization, the target values are always 0 or 1, representing logical false and true.

Neural Network Representation

A neural network is often depicted with the neurons/units in each layer as circles and the weights between them as lines. What's sometimes not clear from these diagrams is that the line representing each weight has a value attached to it, and is equally important as the neurons it connects. The diagram below shows how the elements in a typical neural network diagram map to the UI elements in BackPropagationVisualization.

Backpropagation Step by Step

Terminology

Firstly we'll define some of the parameters used in the training...

Batch Size - This is the number of training examples that are pushed through the backpropagation process before the weights are updated. The batch size must be less than or equal to the number of training examples.

Number of Epochs - Completing 1 epoch means pushing all the training examples through the process once. Since there are always 4 training examples in BackPropagationVisualization, 1 epoch is completed each time all 4 of these training examples have been processed.

Note that the batch size can be less than the total number of training examples, and it's possible to set a batch size not equally divisible into the number of training examples. For example, using a batch size of 3, with 4 epochs will result in 6 total updates of the weights and a final 'partial' batch as depicted below...

The following section describes the backpropagation process implemented in the SimpleML.BasicFeedForwardNeuralNetwork class step by step...

Initialization and Setup

At the start of the Train() method, 2-dimensional arrays 'inputToHiddenLayerGradients' and 'hiddenToOutputLayerGradients' are declared, which hold the gradients corresponding to each of the weights in the network...

// Create arrays to hold the gradients for each of the weights
Double[,] inputToHiddenLayerGradients = new Double[numberOfHiddenUnits, numberOfInputUnits];
Double[,] hiddenToOutputLayerGradients = new Double[numberOfOutputUnits, numberOfHiddenUnits];

Forward Pass

During the forward pass, the input values are multiplied forward through the layers of the network, producing a logit (n.b. not 100% sure if I'm using the correct terminology here) and then activation value at each unit in the hidden layer, and then the output layer. The following nested for loops calculate the activation (sigmoid function) value for each of the hidden layer units...

// Calculate the activation values for the hidden layer
for (Int32 k = 0; k < numberOfHiddenUnits; k++)
{
    Double currentHiddenUnitLogitValue = 0.0;

    // Sum up the products of the input unit and the corresponding weight which feeds into the current hidden unit
    for (Int32 m = 0; m < numberOfInputUnits; m++)
    {
        currentHiddenUnitLogitValue += currentTrainingCase[m] * inputToHiddenLayerWeights[k, m];
    }
    // Apply the activation function, and store the value
    hiddenLayerActivationValues[k] = ApplySigmoidFunction(currentHiddenUnitLogitValue);
}

The for loops map to the following UI and equation components (below equations are taken from page 1 of 'Notes on Backpropagation')...

Then a similar routine is used to calculate the activation value for the output unit

// Calculate the activation value for the output layer
for (Int32 k = 0; k < numberOfOutputUnits; k++)
{
    Double currentOutputUnitLogitValue = 0.0;

    for (Int32 m = 0; m < numberOfHiddenUnits; m++)
    {
        currentOutputUnitLogitValue += hiddenLayerActivationValues[m] * hiddenToOutputLayerWeights[k, m];
    }
    outputLayerActivationValues[k] = ApplySigmoidFunction(currentOutputUnitLogitValue);
}

Finally the method CalculateCost() is used to calculate the cost difference between the output unit activation value, and the actual result of the current training example (again, the cost function equation is taken from page 1 of 'Notes on Backpropagation')...

private Double CalculateCost(Double[] targetValues, Double[] outputLayerActivationValues)
{
    Double cost = 0.0;
    for (Int32 i = 0; i < targetValues.Length; i++)
    {
        cost = cost + (targetValues[i] * Math.Log(outputLayerActivationValues[i])) + ((1 - targetValues[i]) * Math.Log(1 - outputLayerActivationValues[i]));
    }
    return -cost;
}

Backward Pass

In the backward pass the derivative of the cost function (gradient) is calculated for each neuron/unit in the network. This shows how much a change in each unit will affect the cost function. First the hidden to output layer gradients are calculated according to the equation below (see page 2 of 'Notes on Backpropagation')...

// Calculate the hidden to output layer gradients
for (Int32 k = 0; k < numberOfOutputUnits; k++)
{
    for (Int32 m = 0; m < numberOfHiddenUnits; m++)
    {
        hiddenToOutputLayerGradients[k, m] += (outputLayerActivationValues[k] - currentTarget[k]) * hiddenLayerActivationValues[m];
    }
}

Then the input to hidden layer gradients are calculated as per below (again page 2 of 'Notes on Backpropagation') ...

// Calculate the input to hidden layer gradients
for (Int32 k = 0; k < numberOfHiddenUnits; k++)
{
    for (Int32 m = 0; m < numberOfInputUnits; m++)
    {
        Double currentHiddenToOutputLayerGradient = 0.0;

        for (Int32 n = 0; n < numberOfOutputUnits; n++)
        {
            currentHiddenToOutputLayerGradient += ((outputLayerActivationValues[n] - currentTarget[n]) * hiddenToOutputLayerWeights[n, k] * (hiddenLayerActivationValues[k] * (1 - hiddenLayerActivationValues[k])) * currentTrainingCase[m]);
        }

        inputToHiddenLayerGradients[k, m] += currentHiddenToOutputLayerGradient;
    }
}

The above forward and backward pass process is repeated for the specified batch size. The gradient values for each training example in the batch are accumulated into arrays 'inputToHiddenLayerGradients' and 'hiddenToOutputLayerGradients'. When the batch is completed, the gradients are averaged over the batch size by the DivideArrayElements() method.

// Average the gradients by the number of training cases in the current batch
DivideArrayElements(inputToHiddenLayerGradients, Convert.ToDouble(completedTrainingCasesInCurrentBatchCount));
DivideArrayElements(hiddenToOutputLayerGradients, Convert.ToDouble(completedTrainingCasesInCurrentBatchCount));

Finally, the weight layers are updated by simply subtracting the corresponding gradient, multiplied by the learning rate...

// Update the input to hidden layer weights
for (Int32 j = 0; j < inputToHiddenLayerWeights.GetLength(0); j++)
{
    for (Int32 k = 0; k < inputToHiddenLayerWeights.GetLength(1); k++)
    {
        inputToHiddenLayerWeights[j, k] -= inputToHiddenLayerGradients[j, k] * learningRate;
    }
}

// Update the hidden to output layer weights
for (Int32 j = 0; j < hiddenToOutputLayerWeights.GetLength(0); j++)
{
    for (Int32 k = 0; k < hiddenToOutputLayerWeights.GetLength(1); k++)
    {
        hiddenToOutputLayerWeights[j, k] -= hiddenToOutputLayerGradients[j, k] * learningRate;
    }
}