Introduction
Creating new neural network architectures can be quite timeconsuming, especially in realworld workflows where numerous models are trained during the experimentation and design phase. In addition to being wasteful, the traditional method of training every new model from scratch slows down the entire design process. In a normal workflow, several models train, with each attempting to improve on the advantages of the one before it. However, determining whether each change results in improvement is delayed because the iterative design approach requires a whole cycle of training and evaluation for every model.
The Net2Net procedure offers a solution to this problem. It’s a cool and simple method that helps address the challenges of the iterative design process to some extent.
Learning Objectives
 Learning about the Net2Net method (Net2WiderNet and Net2DeeperNet) to increase the training speed of neural networks.
 Implementing the practical (coding) for Nnet2Net((Net2WiderNet and Net2DeeperNet) using TensorFlow.
 Learn about the Net2Net procedure and its role in addressing the challenges of the iterative design process.
 Learn how Net2DeeperNet works; this method increases the depth of the network. How does the ReLU function operate in this context?
 In the last comparison of results, how does this method evaluate the task? Is it considered good or bad?
This article was published as a part of the Data Science Blogathon.
Net2Net Procedure
The Net2Net strategy involves the teacher network and the student network. We initialize the student network (new model) to represent the same function as the teacher network (previous model). In this process, we perform knowledge transfer by using the previous model as the base model and applying the Net2Net methods (Net2WiderNet and Net2DeeperNet). We adopt the knowledge from the teacher network to the student network. Before training the student network, its output matches that of the teacher network, even though the architectures of these two networks may vary.
Mathematics
Suppose we have a teacher network, represented by the function y=f(x;θ), where:
 x is the input to the network,
 y s the output of the network
 θ are the parameters of the network
Now, we want to initialize a student network, represented by the function g(x;θ′), where:
 x is, again, the input to the network,
 θ′ are the parameters of the student network.
The goal is to choose a new set of parameters ′θ′ for the student network in such a way that, for every input x, the output of the student network matches the output of the teacher network:
∀x, f(x;θ) = g(x;θ′)
Simple Flowchart
There are two ways of using Net2Net: Increase the width or the depth of the network.
Net2WiderNet Method
In Net2WiderNet, the width of the neural network increases. The method involves replacing the layer with a wider layer so the number of units or channels is increased. In convolutional architecture, this means having more channels.
Specifically, if layer i and layer i+1 are both fully connected layers and layer i uses an elementwise nonlinearity, Net2WiderNet allows you to replace layer i with a layer that has more units (wider layer).
The teacher network weights can be represented as W^(i), where i is the layer index. To create a consistent random mapping g^(i) for every layer, use forward inference.
Replicate the current weights for each layer I using the random mapping. For the wider layer, introduce a new weight matrix U^(i).
Make sure that the broader layer has been initialized. If so, move on to the following actions. If not, carry out the initialization step again.
Mathematical Example:
Let us examine a particular scenario in which layers i and i+1 are fully connected layers. Both W^(i) ∈ R^m×n and W^(i+1) ∈ R^n×p are the original weights. Expanding the layer i to provide q outputs, where q>n, is the aim.
Random Mapping Function g^(i)
Give rise to a random mapping function g^(i): {1,2,…,q} → {1,2,…,n}, which fulfils the following:
For every j≤n, g(j)=j
Given a j>n, g(j) is a random sample taken from {1,2,…,n}.
Weight Replication
For the broader layer(wider layer), new weight matrices U(i) and U(i+1) are introduced. The purpose is to use the random mapping function to copy the weights from the original layer to the broader layer.
The replication factor determines how many times a certain weight is reproduced in the larger layer.
Structure
Input

Teacher Network (Original Size)

Layer 1: (W, U)

Layer 2: (W, U)

Layer 3: (W, U)

…

Layer n: (W, U)

Wider Layer: (U, New Connections)

Output
Input: The network’s first input.
Teacher N/W: The original neural network, represented by W(i), with weights for each layer i.
Layers 1–n: The teacher network’s existing layers, each with weights W and extra broader weights U.
Wider Layer: The layer broadened by the Net2WiderNet method includes new connections and weights.
Output: The network’s ultimate output.
# Importing Libraries
import tensorflow as tf
from tensorflow. Keras import layers, models
def net2wider_net(teacher_model, scale_factor):
# Clone the teacher model to create the student model
student_model = models.clone_model(teacher_model)
# Iterate through layers in the student model
for i, layer in enumerate(student_model.layers):
# Check if the layer is a Dense layer
if isinstance(layer, tf.keras.layers.Dense):
# Get input and output dimensions of the layer
input_dim = layer.input_shape[1]
output_dim = layer.output_shape[1]
# Calculate the new width of the layer based on the scale factor
widened_dim = int(output_dim * scale_factor)
# Create a new weight matrix with increased width
new_weights = tf.Variable(layer.get_weights()[0][:, :output_dim],
shape=(input_dim, widened_dim),
trainable=True)
# Create a new Dense layer with the increased width and the same activation function
new_layer = layers.Dense(widened_dim, activation=layer.activation,
use_bias=layer.use_bias)
# Set the weights of the new layer
new_layer.set_weights([new_weights.numpy(), layer.get_weights()[1]])
# Replace the original layer in the student model with the new wider layer
student_model.layers.pop(i)
student_model.layers.insert(i, new_layer)
return student_model
# Example usage:
teacher_model = tf. keras.Sequential([
layers.Dense(32, activation='relu', input_shape=(10,)),
layers.Dense(64, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
# Apply Net2WiderNet with a scale factor of 1.5
scale_factor = 1.5
wider_student_model = net2wider_net(teacher_model, scale_factor)
Experiment with Net2WiderNet
In this experiment, the researchers started with a smaller neural network (teacher network) by reducing the number of convolution channels in each layer. This made the model simpler with fewer parameters. They trained this smaller network and then used it to speed up the training of a regularsized network (student network) through a method called Net2WiderNet.
The results showed that the Net2WiderNet approach led to faster convergence (the model learning quickly) compared to other methods. Importantly, despite the faster training, the final accuracy of the model using Net2WiderNet was the same as a model trained from scratch. This means that using Net2WiderNet allows researchers to reach the same level of accuracy more quickly, saving time in running experiments without sacrificing the final performance of the model.
Net2DeeperNet Method
In the Net2DeeperNet method, they increase the depth of the neural network by converting the existing network into a deeper one. the basic concept is to replace the layer h(i) = ϕ(h^(i1)^TW(i) with twolayers.
The main constraint is that we are increasing the depth of the network while keeping the structure of the network in a similar manner. The reason for increasing the depth of the network is that deeper architectures have the ability to gain more information and capture complex patterns in the data.
 Layer Transformation: We replace the initial h^(i) layer with a deeper structure, including the matrices U^(i) and W^(i). U^(i) is initialized as an identity matrix, preserving the initial structure.
 Activation Function ϕ: The selection of the activation function is critical to the success of this transformation. The ReLU (Rectified Linear Unit) is an appropriate choice since it fulfils the criterion ϕ(Iϕ(v))=ϕ(v) for all vectors v
 Application to Convolutional Networks: Setting the convolution kernels to be identity filters simplifies the procedure for convolutional networks. This ensures that the convolutional layers are similarly suitably modified.
The Net2DeeperNet method divides a layer L^(i) into two layers: the identity mapping layer I and the updated layer L^(i). This factorization enables a smooth shift to deeper topologies, hence unleashing the potential for greater network performance.
Structure
Original Layer: h^(i) = phi(h^(i1)T * W^(i))
Net2DeeperNet Transformation:
New Layer 1: h^(i) = phi(U^(i)T * phi(W^(i)T * h^(i1)))
New Layer 2: h^(i+1) = phi(I * h^(i))
Note: I is the identity mapping layer.
This transformation replaces a single layer h^(i) with two layers, creating a deeper structure while retaining the original network’s general function. The type of the layers involved and the activation function phi determine the precise shape of the transformation.
Code
#Importing Libraries
import tensorflow as tf
from tensorflow.keras import layers, models
def net2deeper_net(teacher_model):
# Clone the teacher model to create the student model
student_model = models.clone_model(teacher_model)
# Iterate through layers in the student model
for i, layer in enumerate(student_model.layers):
# Check if the layer is a Dense layer
if isinstance(layer, tf.keras.layers.Dense):
output_dim = layer.output_shape[1]
# Factorize the Dense layer into an identity layer and a new Dense
# layer with ReLU activation
identity_layer = layers.Activation('linear', input_shape=(output_dim,))
new_layer = layers.Dense(output_dim, activation='relu', use_bias=True,
kernel_initializer=tf.keras.initializers.Identity(),
bias_initializer="zeros")
# Replace the original Dense layer in the student model with factorized layers
student_model.layers.pop(i)
student_model.layers.insert(i, identity_layer)
student_model.layers.insert(i + 1, new_layer)
return student_model
# Example usage:
teacher_model = tf.keras.Sequential([
layers.Dense(32, activation='relu', input_shape=(10,)),
layers.Dense(64, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
# Apply Net2DeeperNet
deeper_student_model = net2deeper_net(teacher_model)
Experiment with Net2DeeperNet:
In these experiments, the researchers used the Net2DeeperNet method to make the model deeper, focusing on the convolutional layer. They used a term like “Inception” to refer to a deeper model. They employed rectangular kernels to gain information, arranging them in pairs. One layer used a vertical kernel, and the following layer used a horizontal kernel.
The results indicated that using Net2DeeperNet led to significantly faster improvement in accuracy compared to training from random initialization, both in terms of training and validation accuracy. In simpler terms, they made the Inception model deeper, and it learned more quickly while achieving good accuracy.
Fig: Training Accuracy of Different Methods
Fig: Validation Accuracy of Different Methods
Code For MNIST Data Via Knwoledge Tranfer
We are developing code for the MNIST dataset. Initially, we create the teacher model and then transfer all the weights to expand the depth of the architecture. Subsequently, we build both the student and deeper student architectures. Finally, we observe the output.
Step 1: Install Required Libraries
 here, we can run the code on Jupytre Notebook or collab.
!pip install keras numpy
Step 2: Import Packages
 We are importing all the required packages.
from __future__ import print_function
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dense, Flatten
from keras.datasets import mnist
#from keras.utils import to_categorical
from tensorflow.keras.utils import to_categorical
import numpy as np
Step 3: Set Seed for Reproducibility
 Setting a seed ensures reproducibility by making random operations in the code deterministic, allowing consistent results across runs.
np.random.seed(1337)
Step 4: Define Input Shape and Load/Preprocess Data
 Specify the input shape for the neural network and load the MNIST dataset, preparing it for training by normalizing pixel values and categorizing labels.
input_shape = (28, 28, 1) # Image shape
# Load and preprocess data
(train_x, train_y), (validation_x, validation_y) = mnist.load_data()
# Preprocess input data: reshape and normalize
preprocess_input = lambda x: x.reshape((1, 28, 28, 1)) / 255.
preprocess_output = lambda y: to_categorical(y)
train_x, validation_x = map(preprocess_input, [train_x, validation_x])
train_y, validation_y = map(preprocess_output, [train_y, validation_y])
# Display data shapes
print("Loading MNIST data...")
print("train_x shape:", train_x.shape, "train_y shape:", train_y.shape)
print("validation_x shape:", validation_x.shape, "validation_y shape", validation_y.shape, "\n")
Step 5: Define Functions for Weight Manipulation
 Create functions like wider2net_fc and deeper2net_conv2d to manipulate weights for expanding neural network architectures, enabling wider, fully connected layers and deeper convolutional layers.
def wider2net_fc(teacher_w1, teacher_b1, teacher_w2, new_width, init):
"""Get initial weights for a wider, fully connected (dense) layer with a bigger nut,
by 'randompadding' or 'net2wider'.
# Arguments
teacher_w1: `weight` of fc layer to become wider, of shape (nin1, nout1)
teacher_b1: `bias` of fc layer to become wider, of shape (nout1, )
teacher_w2: `weight` of next connected fc layer, of shape (nin2, nout2)
new_width: new `nout` for the wider fc layer
init: initialization algorithm for new weights, either 'randompad' or 'net2wider'
"""
assert teacher_w1.shape[1] == teacher_w2.shape[0] # nout1 == nin2 for connected layers
assert teacher_w1.shape[1] == teacher_b1.shape[0]
assert new_width > teacher_w1.shape[1]
n = new_width  teacher_w1.shape[1]
if init == 'randompad':
new_w1 = np.random.normal(0, 0.1, size=(teacher_w1.shape[0], n))
new_b1 = np.ones(n) * 0.1
new_w2 = np.random.normal(0, 0.1, size=(n, teacher_w2.shape[1]))
elif init == 'net2wider':
index = np.random.randint(teacher_w1.shape[1], size=n)
factors = np.bincount(index)[index] + 1.
new_w1 = teacher_w1[:, index]
new_b1 = teacher_b1[index]
new_w2 = teacher_w2[index, :] / factors[:, np.newaxis]
else:
raise ValueError("Unsupported weight initializer: %s" % init)
student_w1 = np.concatenate((teacher_w1, new_w1), axis=1)
student_w2 = np.concatenate((teacher_w2, new_w2), axis=0)
if init == 'net2wider':
student_w2[index, :] = new_w2
student_b1 = np.concatenate((teacher_b1, new_b1), axis=0)
return student_w1, student_b1, student_w2
def deeper2net_conv2d(teacher_w):
"""Get initial weights for a deeper conv2d layer by net2deeper'.
# Arguments
teacher_w: `weight` of previous conv2d layer, of shape (nb_filter, nb_channel, h, w)
"""
nb_filter, nb_channel, w, h = teacher_w.shape
student_w = np.zeros((nb_filter, nb_filter, w, h))
for i in xrange(nb_filter):
student_w[i, i, (h  1) // 2, (w  1) // 2] = 1.
student_b = np.zeros(nb_filter)
return student_w, student_b
def copy_weights(teacher_model, student_model, layer_names):
"""Copy weights from teacher_model to student_model,
for layers listed in layer_names, ensuring compatible shapes."""
for name in layer_names:
teacher_layer = teacher_model.get_layer(name)
student_layer = student_model.get_layer(name)
if teacher_layer.get_weights()[0].shape == student_layer.get_weights()[0].shape:
student_layer.set_weights(teacher_layer.get_weights())
print(f"Weights successfully copied to layer: {name}")
else:
print(f"Skipping layer {name} due to incompatible shapes.")
Step 6: Experiment Setup – Define Teacher Model
 Establish a simple Convolutional Neural Network (CNN) as the teacher model for training on the MNIST dataset. This serves as the baseline model from which knowledge will be transferred to student models.
def make_teacher_model(train_data, validation_data):
"""Train a simple CNN as a teacher model."""
model = Sequential()
model.add(Conv2D(64, (3, 3), input_shape=input_shape, padding="same", name="conv1"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2), name="pool1"))
model.add(Conv2D(128, (3, 3), padding="same", name="conv2"))
model.add(MaxPooling2D(name="pool2"))
model.add(Flatten(name="flatten"))
model.add(Dense(128, activation="relu", name="fc1"))
model.add(Dense(10, activation="softmax", name="fc2"))
model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])
train_x, train_y = train_data
history = model.fit(train_x, train_y, epochs=1, validation_data=validation_data)
# Print layer shapes for verification
print("Shapes after training:")
for layer in model.layers:
print(layer.name, layer.output_shape)
return model, history
 The teacher model is a simple CNN with two convolutional layers (conv1 and conv2), followed by maxpooling layers (pool1 and pool2), and two fully connected layers (fc1 and fc2).
 After training for 1 epoch, the accuracy on the validation set is around 94.49%.
Step 7: Experiment Setup – Define Deeper Student Model
 Design a deeper student model based on the teacher model. Two initialization options are available: “randominit” (baseline) and “net2deeper.” In the latter, we expand the depth of the original architecture and copy weights from the corresponding layers of the teacher model to maintain knowledge transfer.
def make_deeper_student_model(teacher_model, train_data, validation_data, init):
"""Train a deeper student model based on teacher_model, with either 'randominit' (baseline)
or 'net2deeper'
"""
model = Sequential()
model.add(Conv2D(64, 3, 3, input_shape=input_shape, padding="same", name="conv1"))
model.add(MaxPooling2D(name="pool1"))
model.add(Conv2D(128, 3, 3, padding="same", name="conv2"))
# Check the dimensions after the second convolutional layer
model.add(MaxPooling2D(name="pool2"))
print("Dimensions after pool2:", model.output_shape)
model.add(Flatten(name="flatten"))
model.add(Dense(128, activation="relu", name="fc1"))
# Add another fc layer to make original fc1 deeper
if init == "net2deeper":
# Net2deeper for fc layer with relu is just an identity initializer
model.add(Dense(128, kernel_initializer="identity", activation="relu", name="fc1deeper"))
elif init == "randominit":
model.add(Dense(128, activation="relu", name="fc1deeper"))
else:
raise ValueError("Unsupported weight initializer: %s" % init)
model.add(Dense(10, activation="softmax", name="fc2"))
# Copy weights for other layers
copy_weights(teacher_model, model, layer_names=["conv1", "conv2", "fc1", "fc2"])
model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])
train_x, train_y = train_data
history = model.fit(train_x, train_y, epochs=3, validation_data=validation_data)
return model, history
 The deeper student model is built by adding another convolutional layer (conv2) and a fully connected layer (fc1deeper) to the architecture of the teacher model.
 The dimensions after the second maxpooling layer (pool2) are (None, 1, 1, 128).
 Weights are successfully copied for convolutional layers (conv1 and conv2), but the fully connected layer (fc1) is skipped due to incompatible shapes.
 The student model is trained for 3 epochs, achieving an accuracy of around 93.78% on the validation set.
Step 8: Run Experiment
 Execute the experiment to benchmark the performances of three models – the teacher model, a deeper student model with “randominit” weights, and a deeper student model with “net2deeper” weights. The training and validation accuracies are observed to analyze the impact of the depth expansion on model performance.
def net2deeper_experiment():
train_data = (train_x, train_y)
validation_data = (validation_x, validation_y)
print("Experiment of Net2DeeperNet ...")
# Build teacher model
teacher_model, teacher_history = make_teacher_model(train_data, validation_data)
# Build deeper student model with random initialization
random_student_model, random_student_history = make_deeper_student_model(
teacher_model, train_data, validation_data, "randominit")
# Build deeper student model with net2deeper initialization
net2deeper_student_model, net2deeper_student_history = make_deeper_student_model(
teacher_model, train_data, validation_data, "net2deeper")
# Run the experiment
net2deeper_experiment()
 Both the randominit and net2deeper initialization approaches result in deeper student models.
 The skipping of the fully connected layer (fc1) during weight copying suggests that there might be a mismatch in the dimensions of this layer between the teacher and student models.
 The training accuracy and validation accuracy of the student models are comparable, indicating that the deeper student models can learn effectively from the teacher model.
 We may need to further analyze the fully connected layer dimensions to identify and address the issue, ensuring successful weight copying and potentially improving the performance of the deeper student models.
Is Net2Net Effective?
Because of the functionpreserving strategy adopted, the new larger network (student network) performs exactly as well as the old network (teacher network), rather than experiencing a time of low performance.
Additionally, compared to randomly initialized networks, Net2Nettrained networks converge to the same accuracy more quickly. Remember that the final accuracy solely depends on the size of the network and is not affected by the training method.
The authors of the paper illustrate the benefits of training with Net2Net when developing new designs and conducting testing through graphs showing the results of tests.
Challenges of the Net2Net Method
 In the coding part, you may encourage yourself to avoid errors related to the shape. Check the original data weight.
 Net2Net transformations may not be universally applicable to all types of neural network architectures.
 The effectiveness of Net2Net could be taskdependent.
 Generalizing Net2Net to novel or custom architectures.
Limitations of the Net2Net Method
 Certain architectures may not benefit as much from widening or deepening transformations, potentially limiting the scope of knowledge transfer.
 Some tasks may not exhibit the same level of improvement, and the benefits might vary across different domains and problem complexities.
 It may not be wellestablished how effective the method is on nonstandard architectures or architectures designed for specific tasks.
Future Improvements
 Examine Different Architectures: To discover Net2Net’s versatility, run it through multiple neural network designs.
 Generalization of the Task: Extend its application beyond picture categorization to other machine learning problems.
 Strategies for Finetuning Transferred Knowledge: Create ways for finetuning transferred knowledge for taskspecific nuances.
 Concerns about Scalability: Address scalability difficulties for larger and more sophisticated models.
 Analysis of Robustness: Determine the robustness of Net2Nettransferred models under various situations.
Conclusion
In conclusion, the Net2Net method proves to be valuable for designing neural networks and facilitating effective knowledge transfer during training. The results indicate an increased training speed and a reduction in the time complexity of model construction compared to building from scratch. The researchers experimented with two types of Net2Net: Net2WiderNet, which maximizes the width of the neural network, and Net2DeeperNet, which increases the depth while maintaining the initial model’s structure. Both methods improved the performance of the model. However, future improvements are necessary for Net2Net to enable more efficient neural network designs, especially as deep learning continues to advance.
Key Takeaways
 Net2Net proves to be a valuable method in the design of neural networks in deep learning.
 Net2WiderNet and Net2DeeperNet are two methods that help increase the speed of the model.
 By effectively sharing information between models, Net2Net provides a novel approach to accelerating neural network training.
 In Net2WiderNet, we increase the width of the model.
 In Net2DeeperNet, we increase the depth of the model to capture complex information from the data.
Frequently Asked Questions
A. The Net2Net procedure accelerates training by efficiently transferring knowledge from a smaller network (teacher) to a larger one (student), reducing the need for training the larger network from scratch.
A. Net2Net enables quick exploration of the design space by transforming existing stateoftheart architectures, allowing for faster experimentation and improved results in deep learning.
A. Net2WiderNet accelerates convergence to the same accuracy as random initialization, while Net2DeeperNet achieves good accuracy much faster than training from random initialization.
A. Net2Net demonstrates the possibility of transferring knowledge rapidly between neural networks, providing a technique for exploring model families more rapidly and reducing the time required for typical machine learning workflows.
Resources for Further Learning
 Net2Net Paper: Link
 Research Paper: Link
 TensorFlow Documentation: Link
 Convolution Neural Network: Link
 Deep Learning: Link
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.