================================================== Let's see idea about how to train neural network Goal of training neural network is to minimize "cost" See graph, "cost" is written in vertical axis weight w is written in horizontal axis Another expression of goal is to make differentiation (slope) w with respect to cost minimize See graph, slope at top right point is high But slope decreases as training goes Finally, if training converges, you can reach to lowest point which has almost 0 slope ================================================== Conceptually, you should know "amont of effect to $$$\hat{y}$$$ by $$$x_1$$$ to adjust $$$w_1$$$ ================================================== Aside from effect to $$$\hat{y}$$$ by $$$x_1$$$ you also should know effects to $$$\hat{y}$$$ by all other x like $$$x_2$$$, $$$x_3$$$, ... But this is very high computationally burden ================================================== So professor M. Minsky concluded this kind of computation couldn't be possible because it'd be too high computationally burden ================================================== Above issue had been resolved by backpropagation algorithm Idea of backpropagation is 1. you calculate error (loss or cost) 2. you pass error from last layer to first layers, that is backward direction 3. through "2." step, you can calculate "derivative values" and you can decide how and what to update ================================================== $$$f=wx+b$$$ You can replace $$$wx$$$ with $$$g$$$ then, you can write $$$g=wx$$$ Then, you can write $$$f=wx+b$$$ as follow $$$f=g+b$$$ ================================================== You can draw graph about above equations ================================================== $$$\dfrac{\partial f}{\partial w}$$$: effect to f by w $$$\dfrac{\partial f}{\partial x}$$$: effect to f by x $$$\dfrac{\partial f}{\partial b}$$$: effect to f by b ================================================== One technique you need to know to use backpropagation is chainrule $$$y=f(g(x))$$$ $$$\dfrac{\partial y}{\partial x} = \dfrac{\partial y}{\partial g} \times \dfrac{\partial g}{\partial x}$$$ ================================================== Step 1. Forward phase Just suppose you have weight w, input x, bias b as -2, 5, 3 Then, you can get values of $$$g$$$ and $$$f$$$ as $$$-10$$$ and $$$-7$$$ ================================================== You perform several derivatives in advance for backpropagation From $$$g=wx$$$, you can make $$$\dfrac{\partial g}{\partial w}=x$$$ $$$\dfrac{\partial g}{\partial x}=w$$$ From $$$f=g+b$$$, you can make $$$\dfrac{\partial f}{\partial g}=1$$$ $$$\dfrac{\partial f}{\partial b}=1$$$ ================================================== In above screenshot, you can see $$$\dfrac{\partial f}{\partial g}$$$ and $$$\dfrac{\partial f}{\partial b}$$$ are calcuated as 1 and 1 ================================================== You're going to calculate $$$\dfrac{\partial f}{\partial w}$$$ by using chain rule $$$\dfrac{\partial f}{\partial w} = \dfrac{\partial f}{\partial g} \times \dfrac{\partial g}{\partial w}$$$ $$$= 1 \times 5$$$ $$$= 5$$$ ================================================== $$$\dfrac{\partial f}{\partial x} = \dfrac{\partial f}{\partial g} \times \dfrac{\partial g}{\partial x}$$$ $$$= 1 \times -2$$$ $$$= -2$$$ ================================================== Now, you calculated all effects to f by $$$w, x, b$$$ ================================================== Meaning $$$\dfrac{\partial f}{\partial w}$$$: effect to f by w is 5 amount intensity So, if you make w -1 from -2, that means you performed +1 on -2 Since $$$1 \times 5 = 5$$$, you should +5 on f according to $$$\dfrac{\partial f}{\partial w}=5$$$ Let's perform operation manually to check whether above calculate is correct $$$w=-1, x=5 \rightarrow -1 \times 5 = -5$$$ $$$-5 + 3=-2$$$ Since $$$-7 + 5 = -2$$$, above calculate is correct ================================================== Case where there are more nodes (like more layers) Logic is same ================================================== You start with last node (next to f) Operation of that node is + Since you know operation on that node and you know function f as f=a+b you can calculate $$$\dfrac{\partial f}{\partial a}$$$ ================================================== Suppose you've calculated $$$\dfrac{\partial f}{\partial g}$$$ Then, what you actually want to calculate is $$$\dfrac{\partial f}{\partial x}$$$ In that situation, you already know $$$\dfrac{\partial g}{\partial x}$$$ because you know operation ($$$*$$$) on that node Since that operation is $$$*$$$ you can write $$$x*y=g$$$ Then, you can calculate $$$\dfrac{\partial g}{\partial x}$$$ using $$$x*y=g$$$ ================================================== You can calculate $$$\dfrac{\partial f}{\partial x}$$$ using chain rule $$$\dfrac{\partial f}{\partial x} = \dfrac{\partial f}{\partial g} \times \dfrac{\partial g}{\partial x}$$$ Then, you can finally calculate effect to f by x which will be used for updating neural network with minimizing loss value