Basic architecture of a Neural Network
A standard Neural Network model (also known as Artificial Neural Network) is organized into three parts: an input layer, an output layer and hidden layers as explained below
- An input layer receives the initial data or features that need to be processed. Each neuron in the input layer corresponds to features of the data. Depending on the specific application, the input layer may preprocess the input data to make it suitable for neural net training.
- An output layer provides the final result of the neural network’s computation. The number of neurons in the output layer depends on the task at hand. For example, in the case of multi-class classification, the output layer might contain as many nodes as classes.
- Hidden layers are intermediate layers between the input and output layers. They perform computations on the input data through weighted connections between neurons and transformed using an activation function. The hidden layers, in essence, are responsible for extracting features and patterns from the data.
A few examples of Neural Network structures:
Each layer consists of a certain number of units, or nodes. The connections between neurons are represented by weights that determines the strength of the connection. The neural network learns these weights during a training process to optimize its performance on a specific task.
Deep neural networks (DNNs) refer to neural networks with multiple hidden layers. Few examples of Neural Network algorithms are Recurrent Neural Network, Convolutional Neural Network, and Transformers
Training process of a Neural Network
Neural Networks, in true sense, are just mathematical expressions which take the training data and weights as inputs and emits predictions and loss values as outputs.
To train a neural network model, the initial step involves collecting and preprocessing the input data (such as cleaning, normalizing, etc.) to prepare it for training. Next, determine the architecture of your neural network. This involves choosing layer types (such as convolutional, recurrent, or transformer layers), deciding on the number of layers and neurons within each layer, initializing weights, and selecting activation functions, loss function, regularization technique and optimization algorithm.
In the training process, a mini-batch of training data is sampled during each iteration. Each batch goes through a training loop. A typical training loops looks like this:
- Sample a mini-batch of data from the training dataset
- Forward Propagation of data through the hidden layers
- Calculate weighted sum of inputs at each neuron
- Apply activation function to the weighted sum
- Predict output and calculate loss (difference between predicted and actual values)
- Calculate gradients of the loss with respect to all model parameters
- Update the network weights using an optimization algorithm
The process of iteratively adjusting the weights based on gradients aims to minimize the loss function and improve the model’s performance on the training data. This training loop continues until the predefined stopping criteria are met, which could be a fixed number of epochs, early stopping based on validation performance, or other convergence indicators.
During the training process, the model’s performance is periodically tested on a separate validation set to check how well the model is generalizing to new data. Hyperparameters such as learning rate, batch size are adjusted based on validation performance. Once the training is complete, the trained model is tested on test dataset and the performance is reported.
Hyper-parameters in a Neural Network Model
For a detailed understanding on hyper-parameters, please refer to this post: What are the key hyper-parameters of a neural network model?
Following are the list of hyper-parameters that can be tweaked around to improve the performance of a neural network:
- Weights initialization at the beginning of the network training. Eg: random initialization, xavier initialization
- Number of hidden layers (also known as depth of the network)
- Number of neurons per layer impacts the network’s capacity to capture complex patterns. Larger layers can capture more intricate features but might also lead to overfitting
- Activation function introduces non-linearity into the model, allowing it to capture complex relationships in the data. Eg: ReLU, tanH, Leaky ReLU
- Loss function quantifies the difference between the predicted and actual values. Neural Network training aims to minimize the loss function. Eg: Mean Squared Error (MSE), Huber Loss for regression, Cross-entropy for classification
- Optimization function are used to update weights based on the gradients of the loss function. Eg: Stochastic Gradient Descent (SGD), Adam
- Learning Rate that determines the step size taken during weight updates in the optimization process. Learning rate decay is often used for optimizing the training process. Eg: 0.1, 0.01, 0.001
- Batch Size is the number of training examples used in a single iteration of gradient descent. Typical values for batch size are 16, 32, 64, 128, 256, 512 and 1024.
- Epochs is the number of times the entire training dataset is passed through the network during training. Eg: 1, 10, 50, 100
- Number of iterations is the number of times model performs forward and backward propagations, as well as updates its weight and bias parameters.
- Regularization Techniques are used to prevent overfitting, or the problems of dead neurons. Eg: Dropout, L1 (Lasso) / L2 (Ridge), Batch normalization, Early Stopping
An example of hyper-parameters used for training GPT models
- Great video by Professor Winston from MIT explaining the concepts of neural nets and back propagation (Runtime: 50 minutes)
- “Neural Networks: Zero to Hero” is an incredible video series by Andrej Karpathy, the former AI Director at Tesla. In this series, he delves into the fundamentals of Neural Networks, explaining the concepts such as forward propagation, loss functions, and backward propagation from the ground up. (Runtime for the video: ~2 hours, Runtime for the full series: ~13 hours)