# Machine Learning by Standford University

## Week 1

### Introduction

#### What is Machine Learning?

- Definition of machine learning defined by many computer scientists:
    - Arthur Samuel (1959): Machine learning is field of study that gives computers the ability to learn without being explicitly programmed.
    - Tom Mitchell (1998): Well-posed learning problem: A computer program is said to *learn* from experience $E$ with respect to some task $T$ and some performance measure $P$, if its performance on $T$, as measured by $P$, improves with experience $E$.
- Types of machine learning algorithms:
    - **Supervised learning**: teach the computer how to do something
    - **Unsupervices learning**: let computer learn but itself
    - Others:
        - Reinforcement learning
        - Recommender systems

#### Supervised Learning

- **Definition**: Give the computer a data set in which the right answer were given. Computer then resposible for producing *more* right answer from what we were given.
- Type of problems on supervised learning
    - **Regression problem**: try to predict continuous (real) valued output e.g. house pricing.
    - **Classification problem**: discrete valued output(s) e.g. probability of breast cancer (nalignant, benign) based on tumor size as attribute or feature. 

#### Unsupervised Learning

- **Definition**: Data have the same labels or no labels. Let computer find the structure of data
- By: **clustering algorithm** and **non-clustering algorithm**

### Model and Cost Function

#### Model Representation

- This training set will be used in the following section:

| Size in feet^2 (x) 	| Price ($) in 1000's (y) 	|
|:------------------:	|:-----------------------:	|
|        2104        	|           460           	|
|        1416        	|           232           	|
|        1534        	|           315           	|
|         852        	|           178           	|
|         ...        	|           ...           	|

- To represent the model, these are basic description of notation:
    - $m$ = Number of training exmaples
    - $x$'s = input variable/features
    - $y$'s = output variable/"target" variable
    - $(x, y)$ = one training example for corresponding $x$ and $y$
    - $(x^i, y^i); i=1,...,m$ = training examples from row on table when $i$ is an index into the training set
    - $X$ = space of input values, for example: $X = R$
    - $Y$ = space of output values, for example: $Y = R$
- Supervised learning (on house pricing problem) is consists of
    - Training set or data set $(x^i, y^i); i=1,...,m$
    - Learning algorithm, to output $h$ or *hypothesis function*
    - $h$ or *hypothesis function* takes input and try to output the estimated value of $y$, corresponding to $x$ or $h: X \rightarrow Y$
- There are many ways to represent $h$ based on learning algorithm, for example, for house pricing problem, supervised, regression problem, the hypothesis can be described as 

$$h_\theta(x) = \theta_0 + \theta_1x$$

which is called *linear regression model with one variable* or *univariate linear regression*.

#### Cost Function

Cost function is the function that tell *accuracy* of hypothesis.

According to the training set of house pricing problem below where $m = 47$

| Size in feet^2 (x) 	| Price ($) in 1000's (y) 	|
|:------------------:	|:-----------------------:	|
|        2104        	|           460           	|
|        1416        	|           232           	|
|        1534        	|           315           	|
|         852        	|           178           	|
|         ...        	|           ...           	|

The hypothesis of this linear regression problem can be notated as:

$$h_\theta(x) = \theta_0 + \theta_1x$$

For house pricing linear regression problem, we need to choose $\theta_0$ and $\theta_1$ so that the hyopothesis $h_\theta(x_i)$ (predicted value) is close to $y$ (actual value), or $h_\theta(x_i) - y_i$ must be small. In this situation, **mean squared error (MSE)** or **mean squared division (MSD)** can be used to measure the average of the squares of the errors or deviations. The cost function of this problem can be described by the MSE as:

$$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$$

#### Cost Function Intuition I

To find the best hypothesis, best straight line from linear equation that can be used to predict an output, for house pricing problem, result from the cost function of best fit hypothesis must closer to zero or ideally zero.

#### Cost Function Intuition II

This section explains about contour plot which use to conviniently describe more complex hypothesis.

![Example of hypothesis with contour plots to find the best hypothesis based on result of cost function](images/1.png)


### Parameter Learning

#### Gradient Descent

![Gradient descent algorithm](images/2.png)

Gradient descent is algorithm which can be used to minimize cost function $J$, and other type of problems. The basic concept of gradent descent algorithm is:

- Start with some $\theta_0$, $\theta_1$
- Keep changing $\theta_0$, $\theta_1$ to reduce $J(\theta_0, \theta_1)$ until minimum

The gradeint descent algorithm is:

$$Repeat\, until\, convergence\, for\, (j=0\, and\, j=1)\, \{\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)\}$$

Unpacking algorithm:

- $:=$ is assignment operator
- $=$ is truth assertion
- $\alpha$ is learning rate, or simply *the big of step we take downhill with creating descent*
- $\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$ is derivative term
- $for\, (j=0\, and\, j=1)$ is updater
- Assignment to $\theta_j$ must be simaltaneously happened from $\theta_0$ and $\theta_1$ 

![Gradient descent: correct way](images/3.png)

#### Gradient Descent Intuition

The derivative term in gradient descent $\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$ is responsible to finding the slope of specfic point on $J$ until found the minimum point.

![The derivative term explanation](images/4.png)

The learning rate $\alpha$ is responsible to define move rate until convergence. If $\alpha$ is too small, gredient descent can be slow, but if $\alpha$ is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.

As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease learning rate $\alpha$ over time.

#### Gradient Descent for Linear Regression

- Linear regression => convex function => bowl-shaped cost function => no local optimum (only one left)
- *Batch* gradient descennt: each step of gradient descent uses all the training examples

### Linear Algebra Review

#### Matrices and Vector

- Matrix: rectangular array of numbers
- Dimension of matrix: number of rows * number of columns
- $A_ij$ refers to the element in the $i_{th}$ row and $j_{th}$ column of matrix A.
- A vector with 'n' rows is referred to as an 'n'-dimensional vector. Only one column.
- $v_i$ refers to the element in the $i_{th}$ row of the vector.
- In general, all our vectors and matrices will be 1-indexed. Note that for some programming languages, the arrays are 0-indexed.
- Matrices are usually denoted by uppercase names while vectors are lowercase.
- "Scalar" means that an object is a single value, not a vector or matrix.
- $‚Ñù$ refers to the set of scalar real numbers.
- $‚Ñù^ùïü$ refers to the set of n-dimensional vectors of real numbers.

#### Addition and Scalar Multiplication

- Addition and substraction on matrix can be done by taking each $A_ij$ and $B_ij$ and add together, but $A$ and $B$ must be same diemnsion.
- Scalar multiplication and division can be done by taking the number and multiply each element on $A$ one at a time.

#### Multi-vector Multiplication

We map the column of the vector onto each row of the matrix, multiplying each element and summing the result.

$$\begin{bmatrix} a & b \newline c & d \newline e & f \end{bmatrix} *\begin{bmatrix} x \newline y \newline \end{bmatrix} =\begin{bmatrix} a*x + b*y \newline c*x + d*y \newline e*x + f*y\end{bmatrix}$$

The result is a vector. The number of columns of the matrix must equal the number of rows of the vector. An m x n matrix multiplied by an n x 1 vector results in an m x 1 vector.

#### Matrix-matrix multiplication

We multiply two matrices by breaking it into several vector multiplications and concatenating the result.

$$\begin{bmatrix} a & b \newline c & d \newline e & f \end{bmatrix} *\begin{bmatrix} w & x \newline y & z \newline \end{bmatrix} =\begin{bmatrix} a*w + b*y & a*x + b*z \newline c*w + d*y & c*x + d*z \newline e*w + f*y & e*x + f*z\end{bmatrix}$$

An m x n matrix multiplied by an n x o matrix results in an m x o matrix. In the above example, a 3 x 2 matrix times a 2 x 2 matrix resulted in a 3 x 2 matrix.

To multiply two matrices, the number of columns of the first matrix must equal the number of rows of the second matrix.

#### Matrix Multiplication Properties

- Matrices are not commutative: $A‚àóB \neq B‚àóA$
- Matrices are associative: $(A‚àóB)‚àóC =  A‚àó(B‚àóC)$

The **identity matrix**, when multiplied by any matrix of the same dimensions, results in the original matrix. It's just like multiplying numbers by 1. The identity matrix simply has 1's on the diagonal (upper left to lower right diagonal) and 0's elsewhere.

$$\begin{bmatrix} 1 & 0 & 0 \newline 0 & 1 & 0 \newline 0 & 0 & 1 \newline \end{bmatrix}$$

When multiplying the identity matrix after some matrix (A‚àóI), the square identity matrix's dimension should match the other matrix's columns. When multiplying the identity matrix before some other matrix (I‚àóA), the square identity matrix's dimension should match the other matrix's rows.

#### Inverse and Transpose

The **inverse** of a matrix A is denoted $A^{-1}$. Multiplying by the inverse results in the identity matrix.

A non square matrix does not have an inverse matrix. We can compute inverses of matrices in octave with the $pinv(A)$ function and in Matlab with the $inv(A)$ function. Matrices that don't have an inverse are *singular* or *degenerate*.

The **transposition** of a matrix is like rotating the matrix 90¬∞ in clockwise direction and then reversing it. We can compute transposition of matrices in matlab with the transpose(A) function or A':

$$A = \begin{bmatrix} a & b \newline c & d \newline e & f \end{bmatrix}; A^T = \begin{bmatrix} a & c & e \newline b & d & f \newline \end{bmatrix}$$

In other words:

$$A_{ij} = A^T_{ji}$$

## Week 2

### Multivariate Linear Regression

#### Multiple Features

Linear regression with multiple variables is also known as "multivariate linear regression".

We now introduce notation for equations where we can have any number of input variables.

$$\begin{align*}x_j^{(i)} &= \text{value of feature } j \text{ in the }i^{th}\text{ training example} \newline x^{(i)}& = \text{the input (features) of the }i^{th}\text{ training example} \newline m &= \text{the number of training examples} \newline n &= \text{the number of features} \end{align*}$$

The multivariable form of the hypothesis function accommodating these multiple features is as follows:

$$h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n x_n$$

In order to develop intuition about this function, we can think about $Œ∏_0$ as the basic price of a house, $Œ∏_1$ as the price per square meter, $Œ∏_2$ as the price per floor, etc. $x_1$ will be the number of square meters in the house, $x_2$ the number of floors, etc.

Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:

$$\begin{align*}h_\theta(x) =\begin{bmatrix}\theta_0 \hspace{2em} \theta_1 \hspace{2em} ... \hspace{2em} \theta_n\end{bmatrix}\begin{bmatrix}x_0 \newline x_1 \newline \vdots \newline x_n\end{bmatrix}= \theta^T x\end{align*}$$

This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more.

Remark: Note that for convenience reasons in this course we assume $x(i)0=1 for (i‚àà1,‚Ä¶,m)$. This allows us to do matrix operations with theta and $x$. Hence making the two vectors '$Œ∏$' and $x^(i)$ match each other element-wise (that is, have the same number of elements: $n+1$).]

#### Gradient Descent for Multiple Variables

The gradient descent equation itself is generally the same form; we just have to repeat it for our 'n' features:

$$\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)}\newline \; & \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} \newline \; & \theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} \newline & \cdots \newline \rbrace \end{align*}$$

In other words:

$$\begin{align*}& \text{repeat until convergence:} \; \lbrace \newline \; & \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \; & \text{for j := 0...n}\newline \rbrace\end{align*}$$

The following image compares gradient descent with one variable to gradient descent with multiple variables:

![Gradien Descent for Multiple Variables](images/5.png)

#### Gradient Descent in Practice I - Feature Scaling

#### Gradient Descent in Practice II - Learning Rate

#### Features and Polynomial Regression

### Computing Parameters Analytically

#### Normal Equation

#### Normal Equation Noninvertibility