import torch from torch.autograd import Variable ================================================================================ Forward propagation and back propagation 2018-06-17 08-12-25.png ================================================================================ torch.autograd calculates differentiation (gradient) values on behalf of you ================================================================================ autograd.Variable type Variable is composed of "data", "grad", "grad_fn" "data" is data which is stored is torch.autograd.Variable type variable "grad" is calculated gradient value When you calculate gradient operation which a is affected is stored in "grad_fn" ================================================================================ Suppose following computational graph structure 1. input_tensor (you don't need to use requires_grad=True on input data like input image unless you need it explicitly) 2. layer1: $$$\text{output_from_layer1}=\text{required_grad_is_True_and_randomly_initialized_trainable_variable_1}\times \text{input_tensor}+1$$$ 3. loss function: $$$\text{calculated_loss_value}=\text{output_from_layer1}+100$$$ When input_image is passed through layer1, "grad_fn" has the operation which is specified in layer1 ================================================================================ Your ultimate goal is to update all trainable parameters in all layers to make all trainable parameters to reflect the pattern of your big data. Updating trainable parameters is problem which has direction just like mathematical vector than scalar. And that direction for updating trainable parameters is provided from your situation where you need to minimize gradient value $$$\frac{\partial loss}{\partial \text{param_layer1}}$$$ when you use following gradient descent algorithm $$$\text{adjusted_new_param_in_layer1} \leftarrow \text{current_param_in_layer1} - \text{learning_rate} \times \frac{\partial loss}{\partial \text{current_param_in_layer1}}$$$ ================================================================================
# @ c input_image: torch tensor as input image
input_image=torch.ones(2,2)
print(input_image)
# tensor([[1., 1.],
#         [1., 1.]])

# @ c trainable_param_in_layer1: trainable parameter in layer1, which is initialized by 1s
trainable_param_in_layer1=input_tensor=torch.ones(2,2)
print(input_tensor)
# tensor([[1., 1.],
#         [1., 1.]])
================================================================================ @ As constant tensor input_image goes through all layers, parameters as torch.autograd.Variables in all layers need to be defined as requires_grad=True to tract gradient values. It means gradient of the loss wrt each trainable parameter is asked to be calculated @ torch.autograd.Variables parameters in CNN and RNN are defined with requires_grad=True by default ================================================================================
# @ c trainable_param_in_layer1: trainable parameter in layer1 
# with option requires_grad=True
trainable_param_in_layer1=Variable(trainable_param_in_layer1,requires_grad=True)
print(trainable_param_in_layer1)
# tensor([[1., 1.],
#         [1., 1.]], requires_grad=True)

print(trainable_param_in_layer1.data)
# tensor([[1., 1.],
#         [1., 1.]])

print(trainable_param_in_layer1.grad)
# None
# because trainable_param_in_layer1 hadn't performed operation yet

print(trainable_param_in_layer1.grad_fn)
# None
# because trainable_param_in_layer1 hadn't performed operation yet

# @ Create layer1 by using trainable_param_in_layer1
# layer1: (trainable_param_in_layer1*x)+2

# @ Pass input_image into layer1 and get output_from_layer1
output_from_layer1=(tensor_after_layer1*input_image)+2

# @ Create loss function layer
loss: loss_value=output_from_layer1+100
================================================================================ @ To update network (in this case, there is only one parameter trainable_param_in_layer1), you need to calculate $$$\frac{\partial \text{loss_value}}{\partial \text{trainable_param_in_layer1}}$$$ ================================================================================ @ In PyTorch, you can do it by using
# Initialize trainable parameter by 1
trainable_param_in_layer1=torch.ones(2,2)

# Note you use requires_grad=True
trainable_param_in_layer1=Variable(trainable_param_in_layer1,requires_grad=True)

output_from_layer1=input_image*trainable_param_in_layer1

final_loss_value=output_from_layer1.sum()

# Do \frac{\partial final_loss_value}{\partial trainable_param_in_layer1}
final_loss_value.backward()
================================================================================ @ And gradient of "final_loss_value" wrt "trainable_param_in_layer1" is stored into "trainable_param_in_layer1.grad" @ Note that there are 2 operations in your whole neural network One operation is in layer1 and another operation is in loss function layer You actually can't directly calculate $$$\dfrac{\partial \text{final_loss_value}}{\partial \text{trainable_param_in_layer1}}$$$ To calculate $$$\dfrac{\partial \text{final_loss_value}}{\partial \text{trainable_param_in_layer1}}$$$, you should use chain rule For example, $$$\dfrac{\partial \text{final_loss_value}}{\partial \text{trainable_param_in_layer1}} = \dfrac{\partial \text{final_loss_value}}{\partial \text{output_of_layer1}} \times \dfrac{\partial \text{output_of_layer1}}{\partial \text{trainable_param_in_layer1}}$$$ ================================================================================ @ The contribution of torch.autograd.Variables is that it calculates that gradient $$$\dfrac{\partial \text{loss_value}}{\partial \text{trainable_param_in_layer1}}$$$ on behalf of you who is originally supposed to manually perform multiple calculations in chain rule to find one gradient ================================================================================ Practical example1
import torch
from torch.autograd import Variable

a=torch.ones(2,2)
# 1 1
# 1 1

a=Variable(a,requires_grad=True)

# ================================================================================
print(a.data)
# 1 1
# 1 1

print(a.grad)
# None
# Because you didn't perform any operation

print(a.grad_fn)
# None
# Because you didn't perform any operation

# ================================================================================
b=a+2
print(b)
# 3 3
# 3 3

c=b**2
print(c)
# 9 9
# 9 9

out=c.cum()
print(out)
# 36 36
# 36 36

# ================================================================================
# To update a, you should calculate \frac{\partial out}{\partial a} 
# \frac{\partial out}{\partial a} is stored into a.grad

# torch.autograd directly calculates \frac{\partial out}{\partial a} 
# without you using chain rule of dout/da=dout/dsum*dsum/dc*dc/db/*db/da
out.backward()

# ================================================================================
print(a.data)
# 1 1
# 1 1

print(a.grad)
# 6 6
# 6 6

print(a.grad_fn)
# None
# Because there is no operation a did

# ================================================================================
print(b.data)
# 3 3
# 3 3

print(b.grad)
# None
# because b doesn't have option of requires_grad=True

print(b.grad_fn)
# AddBackward0
# To calculate dout/da, PyTorch performed AddBackward0 operation
# because b=a+2

# ================================================================================
print(c.data)
# 9 9
# 9 9

print(c.grad)
# None
# because c doesn't have option of requires_grad=True

print(c.grad_fn)
# PowBackward0
# To calculate dout/da, PyTorch performed PowBackward0 operation
# because c=b^2

# ================================================================================
print(out.data)
# 36
# 9 9

print(out.grad)
# None
# because out doesn't have option of requires_grad=True

print(out.grad_fn)
# SumBackward0
# To calculate dout/da, PyTorch performed SumBackward0 operation
# because out=c.sum()
================================================================================ Practical example2 Suppose function z $$$z=3\times x^{2}$$$ $$$\frac{\partial{z}}{\partial{x}} = 3\times 2\times x$$$ When x=1, $$$\frac{\partial{z}}{\partial{x}} = 6$$$ ================================================================================
x=torch.ones(3)
x=Variable(x,requires_grad=True)

y=x**2

z=y*3
print(z)
# 3
# 3
# 3

# ================================================================================
grad=torch.Tensor([0.1,1,10])

# Perform \frac{\partial z}{\partial x}
z.backward(grad)

# ================================================================================
print(x.data)
# 1
# 1
# 1

print(x.grad)
# 0.6
# 6.0
# 60.0
# Since \frac{\partial z}{\partial x}=6 and you passed grad=torch.Tensor([0.1,1,10])
# you get above numbers

print(x.grad_fn)
# None