Neural Net
An implementation of a deep neural network, from scratch.
An exploration of my learning experience of coding a neural network from scratch. I chose Python for this implementation because of my desire to port and compare it to the Mojo programming language, which aims to be, syntactically, a superset of Python. GitHub
Python
Let’s begin with the constructor method:
def __init__(self, layer_sizes, learning_rate=0.01, max_epochs=1000, batch_size=20):
self.layer_sizes = layer_sizes
self.learning_rate = learning_rate
self.max_epochs = max_epochs
self.batch_size = batch_size
self.num_layers = len(layer_sizes) - 1
# weights and biases initialization (He initialization)
self.weights = [np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(4.
/ (layer_sizes[i] + layer_sizes[i+1])) for i in range(self.num_layers)]
self.biases = [np.zeros((1, layer_sizes[i+1])) for i in range(self.num_layers)]
# cache for storing activations and inputs
self.cache = {}
Here, layer_sizes
is a list of integers representing the number of neurons in each layer of the network. The weights are initialized using the He initialization
method, and the biases are initialized to zero. The cache dictionary is used to store the activations and inputs of each layer during forward and backward
propagation. I was following Simon Prince’s explanation of neural networks from his new book, Understanding Deep Learning.
Interestingly, I had a learning moment here because initially, my implementation was for a shallow network with hard-coded layer sizes, and I had
also left out biases to begin with (thinking that I would simplify the development experience). Boy, was I wrong. After an embarassingly long time of testing
and not realizing what was wrong with my implementation, it finally dawned on me that the function I was trying to fit to, a sinc(x)
function, never crossed the origin,
and there was no way for my neural network to fit to a function that never crosses the origin without biases, which allow for affine transformations.
Forward propagation:
def forward_propagation(self, x):
self.cache['a0'] = x
for i in range(self.num_layers):
z = self.cache[f'a{i}'] @ self.weights[i] + self.biases[i]
self.cache[f'z{i+1}'] = z
self.cache[f'a{i+1}'] = self.relu(z) if i < self.num_layers - 1 else z
return self.cache[f'a{self.num_layers}']
Here, x
is the input to the network, and the forward propagation method computes the activations of each layer using the ReLU activation function. The activations
and inputs of each layer are stored in the cache dictionary for use in the backward propagation step.
Backward propagation:
def backpropagate(self, y):
deltas = [None] * self.num_layers
grads_w = [None] * self.num_layers
grads_b = [None] * self.num_layers
# Output layer error
deltas[-1] = self.cache[f'a{self.num_layers}'] - y
grads_w[-1] = self.cache[f'a{self.num_layers-1}'].T @ deltas[-1]
grads_b[-1] = np.sum(deltas[-1], axis=0, keepdims=True)
for i in range(self.num_layers - 2, -1, -1):
deltas[i] = (deltas[i+1] @ self.weights[i+1].T)
* self.relu(self.cache[f'z{i+1}'], derivative=True)
grads_w[i] = self.cache[f'a{i}'].T @ deltas[i]
grads_b[i] = np.sum(deltas[i], axis=0, keepdims=True)
return grads_w, grads_b
The backpropagate method computes the gradients of the loss function with respect to the weights and biases of each layer using the chain rule. The deltas
represent the error of each layer, and the gradients are computed using the activations and deltas of each layer stored in the cache dictionary.
One step of stochastic gradient descent:
def stochastic_gradient_descent_step(self, grads_w, grads_b):
for i in range(self.num_layers):
self.weights[i] -= self.learning_rate * grads_w[i] / self.batch_size
self.biases[i] -= self.learning_rate * grads_b[i] / self.batch_size
In this method, the weights and biases of each layer are updated using the gradients computed in the backpropagation step. The learning rate and batch size are
used to scale the gradients.
Helper function to compute loss during training:
def compute_loss(self, y, y_hat):
loss = np.mean((y_hat - y)**2)
return loss
Helper functions for calling forward propagation and for ReLu activation:
def predict(self, X):
return self.forward_propagation(X)
def relu(self, x, derivative=False):
if derivative:
return np.where(x > 0, 1, 0)
return np.maximum(0, x)
Finally, the training loop:
def fit(self, X, y, X_val=None, y_val=None, get_validation_loss=False):
training_loss = []
if get_validation_loss:
validation_loss = []
n_samples = X.shape[0]
n_batches = n_samples // self.batch_size
for epoch in range(self.max_epochs):
indices = np.arange(n_samples)
np.random.shuffle(indices)
X_shuffled = X[indices]
y_shuffled = y[indices]
for batch in range(n_batches):
start = batch * self.batch_size
end = start + self.batch_size
X_batch = X_shuffled[start:end]
y_batch = y_shuffled[start:end]
y_hat = self.forward_propagation(X_batch)
loss = self.compute_loss(y_batch, y_hat)
grads_w, grads_b = self.backpropagate(y_batch)
self.stochastic_gradient_descent_step(grads_w, grads_b)
training_loss.append(loss)
if get_validation_loss:
X_val_batch = X_val[start:end]
y_val_batch = y_val[start:end]
y_val_hat = self.forward_propagation(X_val_batch)
val_loss = self.compute_loss(y_val_batch, y_val_hat)
validation_loss.append(val_loss)
if epoch % 100 == 0:
print(f'Epoch {epoch}, Loss: {loss}')
if get_validation_loss:
return training_loss, validation_loss
return training_loss
In the training loop, the training data is shuffled at the beginning of each epoch, and the network is trained using mini-batches of data. The loss is computed
for each batch, and the gradients are computed using backpropagation. The weights and biases are updated using stochastic gradient descent. If the validation
data is provided, the loss is computed for the validation data as well.
I chose the sinc(x)
function to test my network— I wrote a brief script and used Numpy to generate a noisy dataset with just 500 samples. Fitting actually worked out quite well, despite not having implemented the Adam solver:
layer_sizes = [1, 50, 50, 50, 50, 50, 50, 50, 50, 50, 1]
smol = smol_brain(layer_sizes=layer_sizes, learning_rate=0.01, max_epochs=1000, batch_size=20)
training_loss, validation_loss = smol.fit(x_train, y_train, x_val, y_val, get_validation_loss=True)
Mojo
Mojo implementation writeup in progress.