Build a Neural Network from Scratch

In this post, we’ll walk through how to build a simple neural network from scratch using just NumPy. No high-level libraries like TensorFlow or PyTorch, just the fundamentals.


What is a Neural Network?

A neural network is a set of interconnected layers of simple computational units called neurons. Each neuron receives inputs and returns an output value. It does this by multiplying each input by a learned weight and adding adds a bias term. The resulting value is then passed through a nonlinear activation function. We’ll explain later why this last step is necessary.

If the activation function is a sigmoid, a single neuron is essetially a logistic regression function: given multiple inputs returns a probability between 0 and 1.

The real power of neural networks comes from stacking many of these simple neurons into layers and then chaining layers together by connecting the outputs of one layer to the inputs of another. This composition allows the network to learn increasingly complex patterns, a process often referred to as automatic feature engineering. So instead of manually designing features like we have seen in a previous pose the model can automatically learn them provided there is enought training data and the model is sufficiently complex.

A classic example of this is image classification. The first layer might learn to detect edges, the next layer might learn to combine those edges into shapes, and subsequent layers might learn to recognize more complex patterns like objects or faces.

facial recognition (Source: Zhanli Liu 2019)

We can use neural networks to approximate virtually any function. That’s why neural networks are called universal function approximators.


Architecture

We’ll define the neural networks as a series of different Layer’s.

Each layer has:

  • A weight matrix connecting inputs to outputs,
  • A bias vector,
  • An optional activation function (e.g., ReLU, Sigmoid, Tanh).

During the forward pass, the layer computes:

z = input @ weights + bias
a = activation(z)

In the backward pass, we improve the model performance by adjusting the weights and biases slightly. First, we compute the loss, a measure of how far off the model’s guess was. Then we calculate the gradient of the loss, which tells us how much each part of the output contributed to that error.

We propagate this gradient backward through the network, layer by layer, figuring out how much each weight and bias contributed to the final output. Using this information, we update the weights and biases to reduce the loss. Each layer also calculates how much its inputs contributed to the output, and passes that information to the layer before it. This way, the entire network learns to make better predictions over time.

Why does this work?

We can do this because of the chain rule: $$ F(x) = f(g(x)) $$ $$ F’(x) = f’(g(x)) \cdot g’(x) $$

So if we imagine $f$ is the loss funciton and $g$ is the model, we can compute the gradient of the loss function with respect to the model’s input, by first computing the gradient of the loss with respect to the models’s output ($f’(y)$ where $y=g(x)$), then multiplying that by the gradient of the model with respect to its input ($g’(x)$). Also notice that we already computed $y$ in the forward pass, so if we just keep track of the result instead of recompute it in the backwards pass.

This priciple can be applied recursively since the model is just a series of layers. The gradient of the full model can again be decomposed into the gradient of the last layer with respect to the previous layers output times the gradient of all previous layers with respect to the models inputs. When we fully decompose the model this way we have arrrived at our impelemntaion of the backwards pass.

import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import tqdm
class Layer:
    def __init__(self, input_size, output_size, activation=None):
        self.weights = np.random.normal(scale=np.sqrt(1/input_size), size=(input_size, output_size))
        self.bias = np.zeros((1, output_size))
        if activation is None:
            self.activation = None
        else:
            self.activation = activation()

    def forward(self, x):
        self.input = x
        self.z = x @ self.weights + self.bias  # pre-activation output
        if self.activation is not None:
            self.a = self.activation.forward(self.z)
            return self.a
        return self.z

    def backward(self, d_output, learning_rate):
        # If activation is present, backprop through it first
        if self.activation is not None:
            d_output = self.activation.backward(d_output)

        d_weights = self.input.T @ d_output
        d_bias = np.sum(d_output, axis=0, keepdims=True)
        d_input = d_output @ self.weights.T

        # Update parameters
        self.weights -=  learning_rate * d_weights
        self.bias -= learning_rate * d_bias

        return d_input

Activation Functions

Why do we need non-linear activaiton functions? Because otherwise each layer is a linear operation. And stacking multiple linear operations still results in an overall linear operation. This means we don’t gain complexity by making our model more deep. Non-linearities like ReLU, Tanh, and Sigmoid prevent this and let the network learn complex patterns and approximate any function.

ReLU (f(x) = max(0, x)) is the most commonly used. It’s fast, works well in deep networks, and avoids vanishing gradients for positive inputs. However, some neurons can “die” and stop updating if they get stuck on the negative side.

Tanh outputs values between -1 and 1, which centers the data and can help with convergence. But like sigmoid, it still suffers from vanishing gradients at the extremes.

Sigmoid squashes inputs to the range [0, 1], making it intuitive for binary outputs. It’s rarely used in hidden layers today due to slow learning and saturation issues but remains useful in binary classification output layers.

class Relu:
    def forward(self, x):
        self.input = x
        return np.maximum(0, x)

    def backward(self, d_output):
        return d_output * (self.input > 0)

class Tanh:
    def forward(self, x):
        self.output = np.tanh(x)
        return self.output

    def backward(self, d_output):
        return d_output * (1 - self.output ** 2)


class Sigmoid:
    def forward(self, x):
        self.output = 1 / (1 + np.exp(-x))
        return self.output

    def backward(self, d_output):
        return d_output * (self.output * (1 - self.output))

Activaiton Functions

Putting it all together

The NeuralNetwork class brings everything together—it’s the main model that chains multiple layers to form a complete feedforward neural network. It takes care of creating the input layer, a stack of hidden layers, and the output layer, each with their own activation functions. During training, we call forward() to compute predictions and backward() to update the weights based on the loss. This class is the high-level structure that lets us switch between tasks like regression or classification just by changing layer sizes and activations, without rewriting any of the core logic.

class NeuralNetwork:
    def __init__(self, input_size, output_size, hidden_layer_size, num_hidden_layers, head_activation=None, hidden_activation=Relu):
        self.layers = []

        # first hidden layer (input layer) 
        self.layers.append(Layer(input_size, hidden_layer_size, activation=hidden_activation))
        
        # Hidden layers
        for _ in range(num_hidden_layers - 1):
            self.layers.append(Layer(hidden_layer_size, hidden_layer_size, activation=hidden_activation))

        #output layer
        self.layers.append(Layer(hidden_layer_size, output_size, activation=head_activation))

    def forward(self, x):
        for layer in self.layers:
            x = layer.forward(x)
        return x

    def backward(self, loss_grad, learning_rate):
        for layer in reversed(self.layers):
            loss_grad = layer.backward(loss_grad, learning_rate)

Training

To train the model we define a Loss function and a fairly standard training loop. For regression we’ll use MeanSquaredError Loss.

class MeanSquaredErrorLoss:
    @staticmethod
    def loss(y_pred, y_true):
        return np.mean(np.square(y_pred - y_true))

    @staticmethod
    def grad(y_pred, y_true):
        return 2 * (y_pred - y_true) / y_true.shape[0]
def train(model, X_train, y_train, loss_fn, epochs, batch_size, learning_rate, X_test=None, y_test=None, bar=True):
    train_losses = []
    eval_losses = []

    if bar:
        pbar = tqdm.tqdm(total=epochs, desc="Training Progress", position=0)

    for epoch in range(epochs):
        permutation = np.random.permutation(X_train.shape[0])
        X_shuffled = X_train[permutation]
        y_shuffled = y_train[permutation]
    
        losses = []
        grads = []
    
        for i in range(0, X_train.shape[0], batch_size):
            x_batch = X_shuffled[i:i+batch_size]
            y_batch = y_shuffled[i:i+batch_size]
            y_pred = model.forward(x_batch)
            loss = loss_fn.loss(y_pred, y_batch)
            losses.append(loss)
            grad = loss_fn.grad(y_pred, y_batch)
            grads.append(np.mean(np.abs(grad)))
            
            model.backward(grad, learning_rate)
    
    
        train_loss = np.mean(losses)
        train_losses.append(train_loss)
        avg_grad = np.mean(grads)
        if X_test is not None and y_test is not None:
            eval_loss = loss_fn.loss(model.forward(X_test), y_test)
            eval_losses.append(eval_loss)
        else:
            eval_loss = None
        if bar:
            pbar.update(1)
            pbar.set_postfix(train_loss=train_loss, eval_loss=eval_loss, avg_grad=avg_grad)
    return train_losses, eval_losses

We can test our model as a universal function approximator. Here we define a few non-linear function we want our model to approximate. We setup a fairly small model (2 hidden layers with size 32), the hidden_activation_function is tanh and we have no head activation, since we to want to project values to any possible value on the y axis.

functions = [lambda x: x**2, lambda x: abs(x), lambda x: 3*np.sin(x)]

plt.figure(figsize=(15, 5))

for i, fn in enumerate(functions):
    X = np.linspace(-3, 3, 100).reshape(-1, 1)
    y = fn(X)
    model = NeuralNetwork(input_size=1, output_size=1, hidden_layer_size=32,  num_hidden_layers=2, hidden_activation=Tanh)
    train(model, X, y, loss_fn=MeanSquaredErrorLoss, epochs=400, batch_size=64, learning_rate=0.05)
    plt.subplot(1, 3, i + 1)
    plt.plot(X, y, label='True Function')
    plt.plot(X, model.forward(X), label='Model Prediction')
    plt.legend()

Regression

We see the model’s prediction (orange) roughtly follow the true function in blue

Classification

Neural networks are also great for classification tasks. To demonstrate this, we’ll use the classic MNIST dataset—a collection of thousands of 28x28 pixel images of handwritten digits from 0 to 9. The goal is to correctly identify which digit is shown in each image.

# Load data
mnist = fetch_openml('mnist_784', version=1)
X, y = mnist['data'], mnist['target'].astype(int)

# Normalize input
X = X / 255.0
X = X.to_numpy()

# One-hot encode targets
encoder = OneHotEncoder(sparse_output=False)
y_encoded = encoder.fit_transform(y.to_numpy().reshape(-1, 1))

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

png

For classification, we can’t just return raw values. We use the softmax activation function, which turns the output into probabilities across the classes. This makes it possible to interpret the model’s prediction as a confidence score for each digit. We then use cross-entropy loss, which works well with probability outputs and is the standard choice for multi-class classification problems.

class Softmax:
    def forward(self, x):
        # Stability trick: subtract max
        x_stable = x - np.max(x, axis=1, keepdims=True)
        exp_x = np.exp(x_stable)
        self.output = exp_x / np.sum(exp_x, axis=1, keepdims=True)
        return self.output

    def backward(self, d_output):
        # Placeholder, won't be used since we'll derive directly from cross-entropy
        return d_output
class CrossEntropyLoss():
    @staticmethod
    def loss(y_pred, y_true):
        # Avoid log(0)
        epsilon = 1e-12
        y_pred = np.clip(y_pred, epsilon, 1. - epsilon)
        return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
    
    @staticmethod
    def grad(y_pred, y_true):
        # Gradient of softmax + cross-entropy
        return (y_pred - y_true) / y_true.shape[0]

We set up the model with 784 inputs (the number of pixels in each image) and an output size of 10 (the number of classes). For the hidden activation we choose Relu and the Head activation is Softmax as we said earlier.

nn = NeuralNetwork(
    input_size=784,
    output_size=10,
    hidden_layer_size=32,
    num_hidden_layers=2,
    hidden_activation=Relu,
    head_activation=Softmax
)
train_losses, eval_losses = train(nn, X_train, y_train, X_test=X_test, y_test=y_test, loss_fn=CrossEntropyLoss, epochs=40, batch_size=64, learning_rate=0.05)

When we plot the train and evaluation loss, we can see that evaluation loss platous pretty quickly, but training loss continues to go down. If we keep going too long we will overfit to our training data.

output_24_1

Performance is pretty good. Our neural network is able to predict the correct digit in the evaluation set with 97% accuracy

def accuracy(model, X, y_true):
    y_pred = model.forward(X)
    pred_labels = np.argmax(y_pred, axis=1)
    true_labels = np.argmax(y_true, axis=1)
    return np.mean(pred_labels == true_labels)

print("Train accuracy:", accuracy(nn, X_train, y_train))
print("Test accuracy:", accuracy(nn, X_test, y_test))
Train accuracy: 0.9944642857142857
Test accuracy: 0.9665714285714285

We can display the 10-dimensional output of the model using t-SNE - a popular visualization technique for highdimensional data. It will project the output down to 2 dimensions but tries to preserve clusters. Doing this we can see nice groups for each class in the test set. Notice how similar looking number like 4 and 9 appear closer to each other.

output_29_0

Finally we can look at a few of the samples, where our neural network guessed wrong. Most of these misclassifications are understandable and can be attributed to sloppy handwriting. An accuracy of 97% is still really good for a simple model like this.

output_31_0