Let's understand neural networks, both in theory and practically. We'll first discuss the theory behind how a neural network works, while also defining some of the confusing jargon that you may hear around this topic. Then we'll build a basic neural network library from scratch with only numpy (because without it, we'd spend most of our time trying to implement ndarrays). This resource is generally for people who already have seen neural networks to some extent but want to explore them at a deeper level. However, if you are interested and have a good mathematical background, you will likely be able to follow along.
You should be familiar with differential multivariable calculus and linear algebra, especially vector and matrix operations. For a quick refresher on these concepts, you can check out Paul's Online Notes and his lesson on matrices and vectors.
Proficiency with programming is also important for the practical implementations and background. The implementations are done in Python, but experience with other programming languages is fine.
We can start by reviewing the basic operations of vectors and matrices. You should already be familair with these operations, but they are here as a reference if you need it as we will be rewriting a lot of expressions later on.
Let
The
The hadamard (element-wise) product of two matrices is defined as:
The matrix-matrix product of two matrices is defined such that, for
Similarly, a matrix-vector product is the dot product of the vector and each column vector of the matrix. In other words,
Note that matrix multiplication does not commute.
Some less conventional notation that is often used in machine learning is the addition of a matrices and vectors, where the objects are filled to allow for element-wise operations. For example, in
Remember that the gradient vector of
This is an important detail that will become very relevant to neural networks later on.
Assume that the output of our learning model is given by a function
We also have another function
Note that
To perform this computation, we need to use the chain rule of multiple variables. Formally written, if
For proofs of these rules, see the wikipedia page. To apply the rule to our case, the value of
This computation that we've just done will prove useful in the future, as you'll find out soon.
Neural networks, modeled broadly around the concept of a human brain, can be thought of as really good function approximators.
They are often considered "black box" algorithms because, although they can approximate functions with relatively high accuracy, it can be challenging to draw patterns and relationships between data by studying the outputs and parameters of the algorithm. This will become more apparent as you begin to understand how these black boxes work.
Machine learning models generally need to be trained, or fit, to some data. The data used to train the model is called the training data, and the data used to test the model is called the testing data.
We can stay away from the concept of a "neural network" for a bit and look at something that might be more familiar: linear regression.
Say we have a matrix of data
Generally, the data samples are indexed using
The
It is usually comptuationally (and mathematically) convenient to split our data into smaller partitions called batches. We train the model using each batch rather than each individual data sample. If we do this, we can rewrite the equation from before using a matrix-vector product, where
Looking at the new equation, we can see that our prediction is now a vector, and the bias is also a vector. This is because we add one bias term and return a prediction for each data sample in the batch. Logically, for
Note that
is a vector of of the same values from above.
In other words,
Surprise! We've made a basic neural network (kind of). We can call the model above a linear neural network. This will make more sense once we discuss the architecure of a neural network (which just so happens to be the topic of the next section).
An artificial neural network (ANN) is a machine learning algorithm that uses a set of connected nodes (called neurons) arranged in groups called layers.
Usually, nodes are connected to each other with each connection holding some quantity called a weight. Each node can also have a bias term, which is essentially the same as the
To make more sense of this, let's view the regression model that we observed in the last section as a neural network.
Above, we have a basic neural network. Each column of neurons (nodes) are called layers.
The first layer is called the input layer, and it is the only layer that does not have inputs from a previous layer. The last layer is called the output layer, and it is the only layer that does not output to another layer. The layers in the middle are called hidden layers (in this example we only have one hidden layer).
In this network, we have
By the time our input
The output of a single neuron in a neural network is going to be the following, where
The only exception is the output of the neurons in the input layer, which output
Notice that this expression is the same as the output of our regression model. The neural network structured above is just the regression model represented using neurons and layers.
Let's look at an example where we have a hidden layer.
We now have many more weights, with a total of 2 layers that affect the input (not counting the input layer, because that doesn't affect the input).
To manage these transformations, let's count the number of connections we have. From the input layer to the hidden layer, there are
Each group of weights can be viewed as independent from each other, as each one also applies a separate transformation on the output of the previous layer.
Let's first generalize the equation for the output of a single neuron to multiple layers, where
By definition, we can write the output of the
To extend this to batches of data, we can use a matrix
Another way to think of that second equation is that
This type of neural network, where we go forward in only one direction (as opposed to allowing connections to neurons of previous layers) is called a feedforward neural network (FFNN). We only talk about FFNNs in this post. The term "multilayer perceptron" also is used loosely to refer to FFNNs.
A deep neural network is one with multiple layers contained between the input and output layers.
The network above has three hidden layers. The weights for each of these layers can be represented as matrices
Based on the formula above, we can say that the output of our
Substituting the values of
Now that we're here, note that in terms of the mathematics, we don't actually care what the value of
With that said, let's look at the shapes of the outputs here:
If we look more closely, at this, we can see that this can also be rewritten back to a single layered neural network, but now
So, what's the point in having multiple layers? The answer is, there is no point! For linear models, having multiple layers not very useful. So now we have two problems: our model's output is linear, and we can't actually use the concept of multiple layers.
Now that we've established that multiple layers are useless for linear models, let's try to solve that---and escape the trap of linearity at the same time.
The most straightforward way to tackle the latter problem is to simply introduce some type of non-linearity to the outputs. To make use of the multiple layers at the same time, we can introduce a nonlinear transformation at each layer oytput. This transformation is called an activation function.
The activation function is some function
Modifying the rest appropriately, we get (note that
So now, if we were to do the same expansion from before but with these nonlinearities, we get:
Notice that we cannot rewrite this as a single expression because each entire output is transformed with an arbitrary nonlinear transformation.
Let's look at some common activation functions. The graphs are from PyTorch's documentation.
ReLu is generally most common activation function as it provides just enough nonlinearity to be useful, allows for quick learning, and is mathematically convenient. The output is:
The graph follows directly as a regular linear line, cut off at
The output of a sigmoid function is restricted between
The graph is:
The tanh activation function is analogous to the hyperbolic tangent function, meaning that the values are restricted between
We'll explore these activation functions (and their derivatives) more in the next section.
So far, we've talked about these arbitrary objects called weights and biases. Although they were defined vaguely, it's likely that what they are doesn't completely make sense yet. To define them, we need to discuss how the learning in neural networks happens.
Initially, these parameters are randomized by some algorithm. It turns out that the initial values of our weights and biases is very nontrivial. There are different methods of initialization that will be discussed later.
Learning in machine learning generally involves the tuning of some parameters such that the prediction gets closer and closer to the actual value. Likewise, neural networks learn by tuning internal parameters, called weights and biases, so that we get as accurate as possible. The actual process of how that is done is detailed in the next section.
There are also parameters that are generally not optimized within the model (they are still optimized, just in different ways). These are called hyperparameters. These are separate from the model parameters as the model parameters can be inferred directly from the data input. Some examples of hyperparameters in neural networks include the number of layers, the output dimensionalities of the layers, and the activation functions that are utilized.
Now that we have defined the structure of nonlinear and linear neural networks, we can put it all together into something called the forward pass. For each pass, we send in a batch of data
Everything we've talked about so far---that is, the computation of the output of the neural network transformed by some weights and activation functions---is sometimes referred to as the forward pass (because we are going forward through each layer until we reach the output layer).
However, just giving us a prediction won't get us too far, because then our neural network would just be guessing. Let's explore how our neural network can learn the appropriate weights and biases for some data.
To learn anything, we also need to know what we are doing wrong. Neural networks learn from their mistakes, but this is only possible if we know how "much of a mistake" we are making. This means that we need some way to tell how innacurate our model is so that we can figure out what to do and when to stop.
A loss function quantifies this error, meaning that we have a scalar-valued function that returns a quantity representing the innacuracy of our model. There are various loss functions that are used in neural networks, but we will only be discussing the squared error loss function. It is defined as:
Now, all that's left (and the fun part) is to minimize the innacuracy (therefore maximizing the accuracy) of the model.
We have a scalar valued function that returns the loss of the model given some weights and biases. If we can figure out the point at the global minimum of the loss function, we can find the weights that result in the lowest loss (resulting in the highest accuracy).
The idea is that we can use optimization methods to find this global minimum. Ideally, we would look for some type of analytical solution, but the functions outputted by neural networks generally don't allow for closed form solutions to finding the extremum.
For its computational efficiency and effectiveness, this optimization is done through an iterative process called gradient descent.
If you need a refresher on some of the concepts of calculus, look at the preliminaries section.
We first start with some initial value of
Assume that we already have the gradient of the weights with respect to the loss
Now, we set some arbitrary positive learning rates denoted
For
Intuitively, we are taking a step in the opposite (negative) direction of the gradient for each iteration of the algorithm. Since the gradient points in the direction of the greatest increase, we are going to go in the opposite direction to travel towards the greatest decrease. If we keep going in this direction, we expect to eventually end up at a minimum.
The learning rate
It is important to note that in GD we expect to perform the forward pass for every single sample in the data set to perform a single update to the parameters. This will prove to be extremely inefficient for larger datasets.
Note that we update the biases in a similar fashion, taking
There are other variants of gradient descent that are used in deep learning. Usually, libraries provide multiple "optimizers" (optimization algorithms) as certain algorithms perform better in certain scenarios. In this post, we'll only talk about one of those algorithms. Vanilla gradient descent (discussed in the previous section) is very uncommon for training neural networks.
Both gradient descent (GD) and stochastic gradient descent (SGD) are very similar. The difference in SGD is that we go forward for only a subset of data in the dataset instead of the entire dataset.
By doing this, we also get much noiser loss landscapes, meaning that we might be less likely to get stuck in a local minimum, as the noise sometimes causes the gradient to stray away from the local minima (and the global minimum).
Another technique, varying the learning rate, is often used to control the convergence of the algorithm. If we do not vary the learning rate, it can become very difficult to reach convergence at an optimal point.
It is especially important in SGD because of the noisiness of the objective function. In the gradient descent example, we had a set of constant learning rates for each iteration. However, decaying the learning rate exponentially (and other methods) also exist:
In the exponential decay,
is known as the decay constant. It controls the rate at which the learning rate decays.
If we decay the learning rate too quickly, convergence time will increase significantly. However, if we don't decay enough, we risk not reaching the optimal point because of the all the noise.
In addition, sometimes the effectiveness of the learning rate differs for each dimension in the function. When the gradients are sufficiently low in one dimension but not so much in another, the effectiveness of the learning rate in one dimension might vary from another.
To compensate for this, we introduce the concept of momentum. Momentum allows us to fix this problem by taking into account the values of the previous gradients when taking the next step. More formally,
With some substitution, we can rewrite this as a single update rule:
We also have a variable
In some cases, fitting our neural network to a dataset may result in something called overfitting. This happens when the neural network gets so "used to" the training data that the model won't give as accurate predictions when it encounters new, slightly different data.
The idea behind regularization is to solve this problem by putting a penalty on the weights in our loss function. Our previous loss function can be reiterated in terms of vectors
With reguralization, we take the sum of this loss function and the norm of our weight vector, such that our loss becomes:
Note that in
By doing this, we expect that if our weights get too big, our optimizer will eventually try to minimize the norm of the weights instead of just the loss. The parameter
Assuming we are using SGD as our optimizer and
To compute the derivatives, namely
There are two "variants" of automatic differentiation, each more efficient for a certain use case. For some function
To understand the concept of reverse autodiff, we need to go back to the multivariable chain rule that we talked about in the beginning of this post.
For some function
For some intuition before getting into the mathematics, we are going to split the computation of our gradient into two parts. First, we will go forward and record the operations that were done from our starting point in what is called a gradient tape. Then, we will go in backwards order and apply the derivative equivalent of the operation on the appropriate parameters, updating the gradients in accordance with the differentiation rules.
Say we have:
We first rewrite this in terms of abstracted compositions of expressions:
Now, we compute the value of
Using this rule, we can write
Now, let's go step by step for the reverse-mode differentiation:
Notice how we went step by step in the backwards direction. If we program the basic derivative rules, we can see that adding a number
By generalizing these rules and chaining them together, we are able to compute the derivatives
The final result from our computation above is:
In neural networks, the algorithm that computes the derivatives of the weight and bias objects is called backpropagation. Backpropagation uses reverse-mode automatic differentation to compute the gradients of the weights in the same way that is described above.
In case you need clarification, the functions produced by neural networks are not like the example function
shown above. They are made up of function composition (activation functions) and matrix multiplications (output of each layer).
Generally, the derivatives of activation functions are hard-coded in deep learning libraries. Let's discuss these derivatives and their graphs here so that we can use them later in our implementations.
Notice that the derivative is undefined at
Now we have a set of tools for learning: a way to compute derivatives, some loss functions, and a number of methods for optimizing this loss function. Let's put this all together and create a concrete process that we can use to learn parameters in the network.
Let's say that at this point we have sent a batch of data
Now, we quantify the loss using a loss function, this is our objective function that will be minimized using our optimier. For this single batch, we take one step with our optimizer, and we are done with a full forward and backward pass of our neural network.
This process of forward pass, calculate loss, compute gradients, and update parameters is looped for every batch in our dataset and we take an optimizer step for every batch (every time we go forward). We repeat this process for
This concludes the discussion of neural network theory. The next section deals with the technicalities that arise when we want to impelement neural networks in program code.
Let's take a step away from mathematics and towards engineering. The implementation of neural networks comes with some complexities and issues that need to be attended to in order to attain the most desirable results.
Throughout the previous sections, we've talked about weights and biases. However, we only left these as arbitrary variables, because that is fine in mathematics. But, if we want to actually implement neural nets, it's important that we deal with real numbers, not variables.
When we say model parameters in the context of neural networks, we often mean the weights and biases of each layer. We already know from the previous section that the end values of our weights and biases will be the optimal weights and biases to minimize the loss for the given data. However, when starting out, how do we know what weights and biases to use?
We are more likely to get stuck in a local minimum if all of the weights are the same. In other words, all of the neurons will learn the same things and the network will be unable to generalize like we want it to.
In an attempt to optimize the initialization process, parameters are generally initialized using a recent method called kaiming initialization. We sample
Throughout the math for both the forward and backwards pass, we've talked about matrices and vectors. In the context of deep learning, these are simply convenient methods of representing data and computations involving them.
As you go deeper into your study of neural networks, you may find that we might need more than just a list, or a list of lists to store data. Thus, for mathematical convenience, we use tensors to represent data.
For context, a rank
Moving forward in our implementations, we will use tensors to represent our data, weights, and biases.
Okay, so we now know how neural nets work mathematically and in theory. We also have some handy tricks for making neural networks easier to work with computationally. Now, it's time to actually make basic implementations of our theory. In this section, we'll be making a very basic neural network library.
One interesting exercise might be to follow along without referring to the code in the GitHub page. This will force you to write a large chunk of the automated differentiation API by yourself.
Note that implementing neural networks from scratch should, in most cases, never be done for practical use. There are existing libraries like PyTorch or Keras that allow for the same functionality and more in an efficient and stable manner.
We'll make a very basic tensor class that stores a numpy ndarray as its elements. First, let's import numpy.
import numpy as np
Next, we can create our tensor class.
class Tensor:
The constructor simply stores the elements of the tensor as an ndarray
and initializes the gradient objects (for autodiff).
For autodiff, we need to store a gradient "tape" that keeps track of all operations and also initialize our gradients to
Note that we usually work with floating point numbers (float32
) in deep learning. We only allow float
Tensors for autodifferentiation.
def __init__(self, els, requires_grad=False, dtype=None, tape=None): # Store the elements of our tensor in a numpy ndarray self.els = np.array(els, dtype=dtype) # Decide whether or not we are computing gradients self.requires_grad = requires_grad # Initialize an empty gradient tape for autodiff self.tape = tape or [] # If the dtype is not "float", raise an error if not self.els.dtype == float and requires_grad: raise TypeError("Autodiff is only allowed for tensors of type \"float\"!") # Otherwise, initialize our gradient if requires_grad: self.zero_grad() else: self.grad = None
Note that our zero_grad
function is still not defined. The point of this function is exactly what it says---it initializes the gradient to an ndarray with zeros. The reason why we are making this a separate function is because we'll need it later for SGD.
def zero_grad(self): self.grad = np.zeros_like(self.els)
We also need a way to make our Tensor
subscriptable, meaning that if we try to get the ndarray
. For convenience, we'll also make sure that we return the els
attribute of our Tensor
object every time a user tries to print it.
def __getitem__(self, n): return self.els[n] def __repr__(self): return self.els.__str__()
And we are now done with the base of our Tensor class.
Next, let's implement autodiff. We will overload the multiplication, addition, subtraction, and division operations in the Tensor
class.
Before we do that however, it's important to generalize all of our operations. Essentialy, we'll build an API for operations that have derivatives so that we can make our lives easier when actually implementing the operations. Our API takes in the following:
other
- The other operand (other than self
)operation
- A function that represents the operation that is to be performedderivative_a
- The function for the derivative of the first operandderivative_b
- The function for the derivative of the second operandUsing this, we can put together a function definition.
def __op(self, other, operation, derivative_a, derivative_b=None):
Note that we are not always guaranteed to have another Tensor
as the other
argument. An easy workaround to this would be to just convert it into one:
# Turn the other argument into a tensor if it isn't already oneif not isinstance(other, Tensor): other = Tensor([other], dtype=self.els.dtype, requires_grad=False)
We also need to decide whether or not we want to compute gradients for this new output tensor. Generally, we want to if either one of our operands requires grad.
# If either one of our tensors require grad, then the output tensor also requires gradrequires_grad = self.requires_grad or other.requires_grad
Let's go ahead and compute the output. Note that we need to append the tapes of both this Tensor and the other Tensor to make sure that operations aren't left out when we chain them. If the gradient tape is out of order or has missing operations, our multivariable chain rule will not work.
Generally, this is taken care of by constructing a directed acyclic graph (DAG) and going backwards in topological order, but that is beyond the scope of our neural network library.
# Find the output tensor by executing the given operationout = Tensor(operation(self.els, other.els), requires_grad=requires_grad, tape=(other.tape + self.tape))
Finally, we add a function that computes the derivatives during the backwards pass. To prevent problems going forward, we'll also make sure to preserve the shapes of the outputs by summing the values on axis 0
def tape_action(): # Execute the derivative functions for the operation if the gradients exist if self.grad is not None: self.grad = self.grad + derivative_a(self.els, other.els, out.grad) if self.grad.shape != self.els.shape: self.grad = np.add.reduce(self.grad) if other.grad is not None and derivative_b is not None: other.grad = other.grad + derivative_b(self.els, other.els, out.grad) if other.grad.shape != other.els.shape: other.grad = np.add.reduce(other.grad)out.tape.append(tape_action)# Return the outputted valuereturn out
And the "API" is done.
Moving on to the actual operations, only a few are implemented here. For the full code, head over to the GitHub page.
Remember that our API takes in three functions:
operation(a, b)
- The operation that is being performed, where a
and b
are the elements of the operands.
derivative_a(a, b, dc)
- The derivative of a
that is added to a.grad
as per the chain rule. a
and b
are the elements of the operands and dc
is the gradient of the output.
derivative_b(a, b, dc)
- The derivative of b
that is added to b.grad
as described above.
Using this, let's override the __add__
method of our Tensor
object:
def __add__(self, other): return self.__op(other, lambda a, b: a + b, lambda a, b, dc: dc, lambda a, b, dc: dc)
Here, we have an anonymous function lambda a, b: a + b
that reads, "the sum of a and b is a + b". The other two functions are a bit less obvious. Assuming that dc
to the gradient of a
for every addition operation.
The concepts are the same for subtraction, multiplication, etc.
def __sub__(self, other): return self.__op(other, lambda a, b: a - b, lambda a, b, dc: dc, lambda a, b, dc: -dc)def __mul__(self, other): return self.__op(other, lambda a, b: a * b, lambda a, b, dc: b * dc, lambda a, b, dc: a * dc) # ... See the github page.
If you are interested in implementing autodiff by themselves, the list of operations that need to be supported are included below.
np.sum()
)Remember that the reverse operation functions, __rsub__
, etc. must also be implemented.
We now have completed most of what it takes to build neural networks. Implementing MSE is relatively straightforward. Recall that MSE is:
In our library, we will return a function that can be called to compute the loss, so that we can add additional options to the function (such as reduction
, specifiying if we want to take the average or only the sum)
def MSE(): # Compute the mean squared error loss given a prediction (yhat) and a target (y) def compute_loss(yhat, y): N = np.prod(y.els.shape, dtype=float) return ((y - yhat) ** 2).sum() / N return compute_loss
That's it for loss functions!
Let's talk about how neural networks are structured. Anything that we can create should preferably be multipurposed for use for many different things.
In deep learning, we want to generalize the concept of a neural network to each of its components, such as activation functions, or layers. The easiest way to do this is to extract what all of these have in common: some inputs and some outputs. Even a single layer in a neural network is just an input-output gate. In fact, our entire neural network is an input-output gate.
To make use of this notion, we introduce the concept of blocks, or modules. A module is simply an object that holds some parameters, takes in an input, and transforms it to create an output. Modules can be put together to form layers, groups of layers, or neural networks.
Let's implement this, starting off with a base class.
class Module: def __init__(self, *args): # Initialize module with given arguments as children self.__children = args or [] def forward(self, x): # Call .forward() on all of the children for child in self.__children: x = child.forward(x) return x
We also need a way to keep track of block parameters. To identify parameters, we can create a very basic class:
class Parameter: def __init__(self, data, requires_grad=True): # Set data of parameter self.data = Tensor(data, requires_grad, float)
This will just be a convenient way to keep track of module parameters. It also helps to have a convenient way to get all of the parameters in a module:
# Note: This is a method in the module classdef parameters(self): # Fetch all of the parameters by going through each child parameters = [] for child in self.__children: for (key, value) in child.__dict__.items(): if isinstance(value, Parameter): parameters.append(value) return parameters
And that's it, we've successfully implemented modules.
Using our modules, it helps to give users some starter layers so that they can build their own neural networks.
Starting off with the linear module, we initialize the weights and biases with kaiming initialization. We then override the forward function with the output computation of a linear layer,
# Just a basic linear layer with an optional bias nodeclass Linear(Module): def __init__(self, input_neurons, output_neurons, bias=True): self.weights = Parameter(np.random.normal(0, (2 / input_neurons)**1/2, (input_neurons, output_neurons))) self.bias = bias