import torch
from torch.autograd import Variable
================================================================================
Forward propagation and back propagation
2018-06-17 08-12-25.png
================================================================================
torch.autograd calculates differentiation (gradient) values on behalf of you
================================================================================
autograd.Variable type Variable is composed of "data", "grad", "grad_fn"
"data" is data which is stored is torch.autograd.Variable type variable
"grad" is calculated gradient value
When you calculate gradient
operation which a is affected is stored in "grad_fn"
================================================================================
Suppose following computational graph structure
1. input_tensor
(you don't need to use requires_grad=True on input data
like input image unless you need it explicitly)
2. layer1:
$$$\text{output_from_layer1}=\text{required_grad_is_True_and_randomly_initialized_trainable_variable_1}\times \text{input_tensor}+1$$$
3. loss function:
$$$\text{calculated_loss_value}=\text{output_from_layer1}+100$$$
When input_image is passed through layer1,
"grad_fn" has the operation which is specified in layer1
================================================================================
Your ultimate goal is to update all trainable parameters in all layers
to make all trainable parameters to reflect the pattern of your big data.
Updating trainable parameters is problem which has direction
just like mathematical vector than scalar.
And that direction for updating trainable parameters is provided
from your situation where you need to minimize gradient value $$$\frac{\partial loss}{\partial \text{param_layer1}}$$$
when you use following gradient descent algorithm
$$$\text{adjusted_new_param_in_layer1} \leftarrow \text{current_param_in_layer1} - \text{learning_rate} \times \frac{\partial loss}{\partial \text{current_param_in_layer1}}$$$
================================================================================
# @ c input_image: torch tensor as input image
input_image=torch.ones(2,2)
print(input_image)
# tensor([[1., 1.],
# [1., 1.]])
# @ c trainable_param_in_layer1: trainable parameter in layer1, which is initialized by 1s
trainable_param_in_layer1=input_tensor=torch.ones(2,2)
print(input_tensor)
# tensor([[1., 1.],
# [1., 1.]])
================================================================================
@ As constant tensor input_image goes through all layers,
parameters as torch.autograd.Variables in all layers need to be defined as requires_grad=True
to tract gradient values.
It means gradient of the loss wrt each trainable parameter is asked to be calculated
@ torch.autograd.Variables parameters in CNN and RNN are defined
with requires_grad=True by default
================================================================================
# @ c trainable_param_in_layer1: trainable parameter in layer1
# with option requires_grad=True
trainable_param_in_layer1=Variable(trainable_param_in_layer1,requires_grad=True)
print(trainable_param_in_layer1)
# tensor([[1., 1.],
# [1., 1.]], requires_grad=True)
print(trainable_param_in_layer1.data)
# tensor([[1., 1.],
# [1., 1.]])
print(trainable_param_in_layer1.grad)
# None
# because trainable_param_in_layer1 hadn't performed operation yet
print(trainable_param_in_layer1.grad_fn)
# None
# because trainable_param_in_layer1 hadn't performed operation yet
# @ Create layer1 by using trainable_param_in_layer1
# layer1: (trainable_param_in_layer1*x)+2
# @ Pass input_image into layer1 and get output_from_layer1
output_from_layer1=(tensor_after_layer1*input_image)+2
# @ Create loss function layer
loss: loss_value=output_from_layer1+100
================================================================================
@ To update network
(in this case, there is only one parameter trainable_param_in_layer1),
you need to calculate $$$\frac{\partial \text{loss_value}}{\partial \text{trainable_param_in_layer1}}$$$
================================================================================
@ In PyTorch, you can do it by using
# Initialize trainable parameter by 1
trainable_param_in_layer1=torch.ones(2,2)
# Note you use requires_grad=True
trainable_param_in_layer1=Variable(trainable_param_in_layer1,requires_grad=True)
output_from_layer1=input_image*trainable_param_in_layer1
final_loss_value=output_from_layer1.sum()
# Do \frac{\partial final_loss_value}{\partial trainable_param_in_layer1}
final_loss_value.backward()
================================================================================
@ And gradient of "final_loss_value" wrt "trainable_param_in_layer1" is stored
into "trainable_param_in_layer1.grad"
@ Note that there are 2 operations in your whole neural network
One operation is in layer1 and another operation is in loss function layer
You actually can't directly calculate $$$\dfrac{\partial \text{final_loss_value}}{\partial \text{trainable_param_in_layer1}}$$$
To calculate $$$\dfrac{\partial \text{final_loss_value}}{\partial \text{trainable_param_in_layer1}}$$$, you should use chain rule
For example,
$$$\dfrac{\partial \text{final_loss_value}}{\partial \text{trainable_param_in_layer1}} = \dfrac{\partial \text{final_loss_value}}{\partial \text{output_of_layer1}} \times \dfrac{\partial \text{output_of_layer1}}{\partial \text{trainable_param_in_layer1}}$$$
================================================================================
@ The contribution of torch.autograd.Variables is
that it calculates that gradient $$$\dfrac{\partial \text{loss_value}}{\partial \text{trainable_param_in_layer1}}$$$ on behalf of you
who is originally supposed to manually perform multiple calculations
in chain rule to find one gradient
================================================================================
Practical example1
import torch
from torch.autograd import Variable
a=torch.ones(2,2)
# 1 1
# 1 1
a=Variable(a,requires_grad=True)
# ================================================================================
print(a.data)
# 1 1
# 1 1
print(a.grad)
# None
# Because you didn't perform any operation
print(a.grad_fn)
# None
# Because you didn't perform any operation
# ================================================================================
b=a+2
print(b)
# 3 3
# 3 3
c=b**2
print(c)
# 9 9
# 9 9
out=c.cum()
print(out)
# 36 36
# 36 36
# ================================================================================
# To update a, you should calculate \frac{\partial out}{\partial a}
# \frac{\partial out}{\partial a} is stored into a.grad
# torch.autograd directly calculates \frac{\partial out}{\partial a}
# without you using chain rule of dout/da=dout/dsum*dsum/dc*dc/db/*db/da
out.backward()
# ================================================================================
print(a.data)
# 1 1
# 1 1
print(a.grad)
# 6 6
# 6 6
print(a.grad_fn)
# None
# Because there is no operation a did
# ================================================================================
print(b.data)
# 3 3
# 3 3
print(b.grad)
# None
# because b doesn't have option of requires_grad=True
print(b.grad_fn)
# AddBackward0
# To calculate dout/da, PyTorch performed AddBackward0 operation
# because b=a+2
# ================================================================================
print(c.data)
# 9 9
# 9 9
print(c.grad)
# None
# because c doesn't have option of requires_grad=True
print(c.grad_fn)
# PowBackward0
# To calculate dout/da, PyTorch performed PowBackward0 operation
# because c=b^2
# ================================================================================
print(out.data)
# 36
# 9 9
print(out.grad)
# None
# because out doesn't have option of requires_grad=True
print(out.grad_fn)
# SumBackward0
# To calculate dout/da, PyTorch performed SumBackward0 operation
# because out=c.sum()
================================================================================
Practical example2
Suppose function z
$$$z=3\times x^{2}$$$
$$$\frac{\partial{z}}{\partial{x}} = 3\times 2\times x$$$
When x=1, $$$\frac{\partial{z}}{\partial{x}} = 6$$$
================================================================================
x=torch.ones(3)
x=Variable(x,requires_grad=True)
y=x**2
z=y*3
print(z)
# 3
# 3
# 3
# ================================================================================
grad=torch.Tensor([0.1,1,10])
# Perform \frac{\partial z}{\partial x}
z.backward(grad)
# ================================================================================
print(x.data)
# 1
# 1
# 1
print(x.grad)
# 0.6
# 6.0
# 60.0
# Since \frac{\partial z}{\partial x}=6 and you passed grad=torch.Tensor([0.1,1,10])
# you get above numbers
print(x.grad_fn)
# None