How Do Neural Networks Learn?

Joana Owusu-Appiah
6 min readJul 2, 2024

--

The simple answer is through Gradient Descent and backpropagation. The long answer is, well, let’s explore it in detail.

In this post, I present information on neural networks, gradient descent, and backpropagation in a non-technical manner, explaining how these networks learn to perform tasks.

Any introductory material on deep learning will discuss neurons and neural networks. Both deep learning and machine learning involve training a model to learn details from data to perform a task. The key difference is that deep learning mimics the behavior of neurons by weighting the numerical input data, processing it, and sending it to other neurons connected to it until the action/task is carried out.

A neuron, the basic unit of a neural network, is where it all begins. A neural layer contains several neurons vertically connected by edges(lines) to other neurons horizontally (the operations happen in the neuron while the edges transfer the data). A typical neural network consists of input, hidden, and output layers.

perceptron (artificial neuron) — drawn by me

Neural Network

  • Input Layer: input layer takes the numerical input, and passes it into the hidden layer.
  • Hidden Layers: The hidden layers consist of interconnected neurons that further process the input data. (here, the numerical inputs are scaled, weighted inputs are summed up with a bias, and finally activated.)
  • Output Layer: The output layer executes the specified task assigned to it, like classification/regression.
multi-layer perceptron from geek2geeks

Key terms

Key terms you will encounter include weights, biases, activation functions, and loss/cost functions. Let’s break them down:

Consider a simple diagram of a neuron fed with inputs x_i (representing values that could be fed into the neuron). Random weights are assigned to these inputs thus the input gets scaled up or down. The weighted inputs ‘w_i(x_i)’ are summed, and a bias (b) is added. The bias parameter is an offset, intended to shift the distribution of the data. The summation of the input looks like this:

z = w_1(x_1) + w_2(x_2)… + b

This summation is represented by z. Z is then mapped unto an activation function ‘a’, which introduces non-linearity, enabling the neuron to learn complex patterns in the data.

a = 1/1 + e^-z

A neural network contains several layers of the operations we just described above. A basic configuration of neurons connected in layers by edges is called Multi-layer Perceptron(MLP).

How Is This Applicable to Learning?

Consider the MNIST dataset, which contains images of handwritten digits (0–9). The task is to train a model to recognize and predict the digit in a given image.

MNIST data

The output layer assigns probabilities to each digit based on how similar the input image is to each of the ten classes. If the input image is 2, the pixel values are fed into the neuron, random weights are assigned to each pixel, and the weighted inputs plus a bias are mapped to the activation function, through several hidden layers until the output layer. The nodes in the output layer then assign probabilities to each class based on how close it is to being the input. The digit 2 should have the highest probability, close to 1, while the other digits should have lower probabilities.

In the initial iteration, the probabilities are likely to be inaccurate. Let’s call these initial probabilities the “disappointment probabilities” y^, as they differ from the true/expected probabilities y. The difference between these(y — y^) values gives us a loss.

To learn from the errors, we need a quantitative measure of the error value, which is provided by the loss function. A common loss function is the Root Mean Squared Error (RMSE). Several other loss functions exist depending on the specific use case.

The cost function aggregates the loss for the entire training set. Gives a measure of how well the model is performing on the entire training examples.

This initial loss/cost is the starting point of the learning process. The idea is that if we can minimize the cost, we can get the estimated output (y^)to be as close to the expected output (y) as possible.

To minimize this loss, we use gradient descent.

Gradient Descent(GD)

Gradient (m) = y_2 — y_1/x_2 — x_1

From high school math, this equation gives us the gradient m, indicating how the function increases and in what direction. A positive gradient means the function increases, while a negative gradient means it decreases. In the context of climbing a hill, the gradient can be likened to the steepest path to the top (maximum point) or the bottom (minimum point).

In GD, the gradient is calculated from the loss, and the parameters(weights and biases) that contributed to the loss are adjusted in a direction that would minimize the gradient.

There is a parameter, called the learning rate that determines the size of the steps taken to descend the gradient.

Problems with GD

  1. Gradient descent is a methodical (deterministic) and slow process. It follows a downward path to the lowest point but can be computationally expensive with large datasets. Its deterministic nature means that the model can quickly learn a path to the lowest point for that dataset. The lowest point in question may be false i.e. a local minimum.
  2. Networks can also become overfitted to the training data, i.e. they can learn the exact path to the minimum point without generalizing well to new data. It is expected that the network learns the details of the training set well, and only then can it perform well on a new set that was not part of the training set. Sometimes, the model performs so well on the training set but fails on the test set. This phenomenon is called overfitting.
  3. Additionally, gradient descent can get stuck in a local minima, which is not the optimal solution across the entire parameter space. At the local minimum, the cost function is lowered for only a subset of the values and not for the entire range. We expect a global minimum.

Stochastic Gradient Descent (SGD)

SGD addresses these issues by using a randomized approach. Instead of using the entire dataset, smaller random batches of data are used to compute the gradient and to update the parameters. This helps the network avoid overfitting and local minima, as it introduces variability in the descent process. Momentum and velocity can further improve SGD by helping the network overshoot local minima and avoid oscillations in loss.

Backpropagation (BP)

Backpropagation ‘moves’ the derivatives of the loss from the output layer back into the input layer using the chain rule. It does so by adjusting the weights and biases to improve the neural network. It involves two main stages:

  1. Forward Pass: Based on gradient descent, the input data is passed through the network to compute the output cost function, a gradient is calculated and weights and biases are adjusted in the direction that reduces the cost function.
    2. Backward Pass: Involves error calculation, gradient calculation, and weight updates. The error is propagated back into the input layer.

Error Calculation: The error is computed based on the difference between the predicted and expected outputs.
Gradient Calculation and weight update: The gradient is calculated using the error, and weights are adjusted in the opposite(descent) direction of the gradient to reduce the error.

This process is iterated over multiple epochs until the network’s performance is optimized.

1 iteration: mini-batch

1 epoch: all mini-batches

There is some mathematics involved in the backpropagation process but in general, this is what happens in the BP phase of the learning process.

A few terms to think about

Gradient saturation: happens when the loss does not produce a significant gradient for certain neurons. This way, the neurons do not learn. The solution is to redesign the loss to ensure that all the neurons are contributing to the learning process.

Neuron saturation: happens when the activation function of the neuron is close to zero this occurs with bounded activated functions (like the sigmoid function). The neurons will also not learn in this instance.

This is generally how a neural network learns to perform a specified task given input labels.

Resources

  1. Machine and Deep Learning course— University of Cassino

2. What is Backpropagation — IBM Technology

3. How Neural Networks learn — 3 Blue1Brown

--

--

Joana Owusu-Appiah
Joana Owusu-Appiah

Written by Joana Owusu-Appiah

Writer (because i write sometimes)| Learner (because I...) | Data Analyst (because ...) | BME Graduate | Basically documenting my Life!

No responses yet