Mechine Learning¶
Instruction¶
Defination¶
A Computer program is said to learn from experience E
with respect to some task T
and some performance measure P
, if its performance on T
, as measured by P
, improves with experience E
.
There are two common types of Machine Learning: Supervised Learning and Unsupervised Learning.
Supervised Learning¶
The term Supervised Learning refers to the fact that we gave the algorithm a data set which we call "right answer", and the task of the algorithm was to produce more of these right answers.
The problem is called a
- regression problem, if the algorithm should predict a continuous valued output;
- classification problem, if the algirhtm should predict a discrete valued output.
For example, you have a large inventory of identical items and want to predict how many of these items will sell over the next 3 months, this is a regression problem.
If you would like to examine individual customer accounts, and for each account decide if it has been hacked/compromised, this is a classification problem.
Unsupervised Learning¶
Unsupervised Learning allows us to approach problems whith little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.
We can derive this structure by clustering the data based on relationships among the variables in the data.
With unsupervised learning there is no feedback based on the prediction results.
For example,
- Clustering: Take a collection of 1000000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles and so on.
- Non-Clustering: The "Cocktail Party Problem", allows you to find structure in a chaotic environment.
Linear Regression with One Variable¶
Model Representation¶
To establish notation for future use, we use:
to denote the "input" variables(also called input features); to denote the "output" variables we are trying to predict.
A pair
We also use:
to denote the space of input values; to denote the space of output values;
To describe the supervised learning problem more formally, our goal is, given a training set, to learn a function
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Cost Function¶
we can measure the accuracy of our hypothesis function by using a cost function. The most common cost function is "Squared error function" or called "Mean squared error":
We should try to minimize the cost function to find the best
Gradient Descent¶
So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in the hypothesis function. That's where gradient descent comes in.
We put
The gradient descent algorithm is:
where:
represents the feature index number. is called "learning rate" to ensure that the gradient descent algorithm converges in a reasonable time.
Gradient Descent For Cost Function¶
We can substitute our cost function and our actual hypothesis function and modify the equation to:
where:
is the size of the training set; is a constant that will be changing simultaneously with are values of the given training set.
With the equation, we can repeat calculating
This method looks at every example in the entire training set on every step, and is called batch gradient descent
.
Linear Regression with Multiple Variables¶
Multiple Features¶
Linear regression with multiple variables is also known as multivariate linear regression
.
The notation for equations where we can have any number of input variables is as below:
is the value of feature in the training example; is the input (features) of the training exmaple; is the number of training examples; is the number of features.
The multivariable form of the hepothesis function with multiple features is as follows:
Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:
Gradient Descent for Multiple Variables¶
The gradient descent equation is the same form, we just have to repeat it for our
In other words:
where:
Feature Scaling to Speed up Gradient Descent¶
We can speed up gradient descent by having each of our input values in roughly the same range. This is because
The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:
or
These are not exact requirements, we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.
Two techiques to help with this are:
- feature scaling: involves diving the input values by the range of the input variable, resulting in a new range of just
; - mean normalization: involves subtracing the average value for an input variable from the values for that input variable, resulting in a new average value for the input variable of just zero.
To impliment both of these techniques, adjust your input values as shown in this formula:
where:
is the average of all the values for feature; is the range of values( ).
Choosing Correct Learning Rate¶
To debug gradient descent, we can make a plot with number of iterations on the x-axis. Now plot the cost function
The learning rate effects the convergence of the
is too small: slow convegence; is too large: may not decrease on every iteration and shus may not converge.
Improve features and hypothesis¶
We can improve our features and the form of our hypothesis function in a couple different ways.
Feature¶
We can combine multiple features into one. For example, we can combine
Hypothesis Function¶
Our hypothesis function need not to be linear (a straight line) if that does not fit the data well. We can change the behavior or curve of our hypothesis function by making it a:
- quadratic;
- cubic;
- square root
function.
One important thing to keep in mind is, if you choose your features this way, the feature scaling becomes very important.
Normal Equation for Multiple Variables¶
Gradient descent gives one way to solve the minimizing normal equation
method is another way of doing so. In this way, we will minimizing
Following is a comparison of gradient descent and the normal equation:
Gradient Descent | Normal Equation |
---|---|
Need to choose |
No need to choose |
Need many iterations | No need to iterate |
Time complexity |
Time complexity |
Works well when |
Slow if |
With the normal equation, computing the inversion has comlexity
Noninvertibility¶
The normal equation used
- Redundant features, where two features are very closely related;
- Too many features(e.g.
), in this case, delete some featues or use "regularization".
Logistic Regression¶
Logistic regression is a method for classifying data into discrete outcomes. For example, we might use logistic regression to classify an email as spam or not spam. In this module, we introduce the notion of classification, the cost function for logistic regression, and the application of logistic regression to multi-class classification.
The classification problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values. To attempt classification, one method is to use linear regression and map all predictions greater than
Hypothesis Representation¶
We could approach the classification problem ignoring the fact that y is discrete value and use our old linear function to predict y with given x. However, it doesn't make sense for Logistic Function
.
Logistic Function
(or Sigmoid Function
) is defined as:
more details can be found here.
The function
Decision Boundary¶
In order to get our discrete
The way our logistic function
So if the input to
when
From these statements we can now say:
The decision boundary
is the line that separates the area where
Cost Function¶
We cannot use the same cost function that we use in linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it isn't a convex function.
Instead, our cost function for logistic regression looks like:
When
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Similarly, when
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Simplified Cost Function¶
We can compress our cost function's two conditianal cases into one case:
Then the entire const function is as follows:
A vectorized implementation is:
Gradient Descent¶
The general form of gradient descent is:
We can work out the derivative part using calculus to get:
A vectorized implementation is:
Advanced Optimization¶
Instead of gradient descent,
- "Conjugate gradient",
- "BFGS",
- "L-BFGS"
are more sophisticated, faster ways to optimize
Multiclass Classification: One-vs-all¶
Now we will approach the classification of data when we have more than two . Instead of
Since
We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as prediction.
To sumerize:
- Train a logistic regression classifier
for each class to predict the probability that ; - To make a prediction on a new
, pick the class that maximized .
Neural Networks¶
Model Representation¶
In this chapter, we will represent a hypothesis function using neural networks. At a very simple level, neurons are basically computational units that take inputs(dendrites) as electrical inputs(called "spikes") that are channeled to outputs(axons). In our model, our dendrites are like the input features
Visually, a simplistic representation looks like:
Our input nodes(layer 1), also known as the "input layer", go into another node(layer 2), which finally outputs the hypothesis function, known as "output layer".
We can have intermediate layers of nodes between "input layers" and "output layers" called the "hidden layers".
In this example, we label these intermediate or "hidden" layer nodes
If we have hidden layer, it would like:
The values for each of the "activation" nodes is obtained as follows:
This is saying that we compute our activation nodes by using a
If network has
units in layer and in layer , then will be of dimension .
The
Application¶
Implement A Logical Operator¶
A simple example of applying neural networks is by predicting and
operator and is only true if both
The graph of our functions will look like:
Remember that
Let's set our first
This will case the output of our hypothesis to only be positive if both
And
g(z) | |||
---|---|---|---|
0 | 0 | g(-30) | 0 |
0 | 1 | g(-10) | 0 |
1 | 0 | g(-10) | 0 |
1 | 0 | g(10) | 1 |
So we have constructed one of the fundamental operation in computers by using a small neural network rather than an actual AND
gate. Neural network can also be used to simulate all the other logical gates.
Implement A Complex Logical Operator¶
The AND
, NOR
and OR
are:
- AND:
- NOR:
- OR:
We can combine these to get the XNOR
logical operator(which gives 1 if
For the transition between the second and third layer, we'll use a
For the transition between the second and third layer, we'll use a
Let's write out the values for all our nodes:
The neural networks is like this:
Multiclass Classification¶
To classify data into multiple classes, we let our hypothesis function return a vector of values. We still use the One-vs-all
method.
For example, if we want to classify our data into one of four , we can define our set of resulting classes as:
Each
Our resulting hypothesis for one set of inputs may look like:
Cost Function¶
Let's define a few variables that we will need to use:
: total number of layers int the network; : number of units(not counting bias unit) in layer ; : number of output units/classes.
In neural networks, we may have many output nodes, we donate
Our cost function for neural networks is going to be a generalization of the one we used for logistic function. Recall that the cost function for regularized logistic regression was:
For neural networks, it is going to be slightly more complicated:
We have added a few nested summations to account for our multiple output nodes. In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes.
In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of colums in our current theta matrix is equal to the number of nodes in our current layer(including the bias unit). The number of rows in our current theta matrix is equal to the nubmer of node in the next layer(excluding the bias unit). As before with logistic regression, we suqare every term.
Backpropagation Algorithm¶
"BackPropagation" is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression. Our goal is to compute:
That is, we want to minimize our cost function
To do so:
- Given training set
,- Set
for all (hence you end up having a matrix full of zeros)
- Set
-
For training example
:- Set
-
Perform
forward propagation
to compute for -
Using
to compute . - Where
is our total number of layers and is the vector of outputs of the activation units for the last layer. So our "error values" for the last layer are simply the differences of our actual results in the last layer and the correct outputs in . To get the values of the layers before the last layer, we can use an equation that steps us back from right to left. - Computing
using
- Set
The equation for b.:
We then element-wise multiple that with a function called
Hence we update our new
The