Mar 3, 2016

Notes for Neural Network

If we set the initial \\(\\Theta\\) be the same, the units in next layer with the same \\(x\_i\\) will get the same result, then all units in the same layer will get the same output. At last, the cost function will also get same cost, so we will update the \\(\\Theta\\) with same step.

seems \\(\\delta^{(l)}\_{i}\\) means the cost of \\(i\_{th}\\) unit in the \\(l\_{th}\\) layer

How to decide the iteration times
How to initialize the weight \(w_i\)
- solved. random initialized. But still didn’t fully understand the formula。
How to choose the learning rate \(\alpha\)
How to choose the activation function
- what’s the different between sigmoid and other functions.
How to decision the number of hidden layers and the number of the units in hidden layers
- simply explained. Mostly take 3 layers, and the hidden layers usually take a fixed number of units.
What the \(\Delta^{(i)}\) does in the \(\Theta^{(i)}\) updating.
What Back Propagation algorithm does in the training process? It is only used to calculate the partial derivatives of Cost Function that used to update the \(\Theta\)?
Why every article about NN mentions that Perceptron can finish logical operations? Is there some theory about with logical operation we can simulate the human brain or sth?

We can simply divide this into 2 part:

Train the \(\Theta\) for the MLP.
Use the trained \(\Theta\) to predict the input’s classification.
The second part is much more easier, let’s first assume that we already have a set of trained \(\Theta\) , and we are now trying to use these \(\Theta\) to predict a testing samples.

The thing you need to do is just:
1. Multiply the input \(X\) with each layer’s \(\Theta\)
2. Do some small fixed in it (adding bias, choose the most possible option e.t.c.)
And you can get the prediction! What an easy job!
Then, we are now facing the training part
1. First randomly initialized the \(\Theta\). (Why we don’t simply use 1 or 0? Read this!)
2. Build a function that we can calculate the difference between our prediction and the fact. We call it Cost Function, and use it to evaluate our prediction.
3. Then, take that Cost Function as a measurement, we use a searching algorithm (i.e. Gradient Descents), to find out the best \(\Theta\) which could minimized the Cost.