# Machine Learning by Standford University

## Week 1

### Introduction

#### What is Machine Learning?

- Definition of machine learning defined by many computer scientists:
    - Arthur Samuel (1959): Machine learning is field of study that gives computers the ability to learn without being explicitly programmed.
    - Tom Mitchell (1998): Well-posed learning problem: A computer program is said to *learn* from experience $E$ with respect to some task $T$ and some performance measure $P$, if its performance on $T$, as measured by $P$, improves with experience $E$.
- Types of machine learning algorithms:
    - **Supervised learning**: teach the computer how to do something
    - **Unsupervices learning**: let computer learn but itself
    - Others:
        - Reinforcement learning
        - Recommender systems

#### Supervised Learning

- **Definition**: Give the computer a data set in which the right answer were given. Computer then resposible for producing *more* right answer from what we were given.
- Type of problems on supervised learning
    - **Regression problem**: try to predict continuous (real) valued output e.g. house pricing.
    - **Classification problem**: discrete valued output(s) e.g. probability of breast cancer (nalignant, benign) based on tumor size as attribute or feature. 

#### Unsupervised Learning

- **Definition**: Data have the same labels or no labels. Let computer find the structure of data
- By: **clustering algorithm** and **non-clustering algorithm**

### Model and Cost Function

#### Model Representation

- This training set will be used in the following section:

| Size in feet^2 (x) 	| Price ($) in 1000's (y) 	|
|:------------------:	|:-----------------------:	|
|        2104        	|           460           	|
|        1416        	|           232           	|
|        1534        	|           315           	|
|         852        	|           178           	|
|         ...        	|           ...           	|

- To represent the model, these are basic description of notation:
    - $m$ = Number of training exmaples
    - $x$'s = input variable/features
    - $y$'s = output variable/"target" variable
    - $(x, y)$ = one training example for corresponding $x$ and $y$
    - $(x^i, y^i); i=1,...,m$ = training examples from row on table when $i$ is an index into the training set
    - $X$ = space of input values, for example: $X = R$
    - $Y$ = space of output values, for example: $Y = R$
- Supervised learning (on house pricing problem) is consists of
    - Training set or data set $(x^i, y^i); i=1,...,m$
    - Learning algorithm, to output $h$ or *hypothesis function*
    - $h$ or *hypothesis function* takes input and try to output the estimated value of $y$, corresponding to $x$ or $h: X \rightarrow Y$
- There are many ways to represent $h$ based on learning algorithm, for example, for house pricing problem, supervised, regression problem, the hypothesis can be described as 

$$h_\theta(x) = \theta_0 + \theta_1x$$

which is called *linear regression model with one variable* or *univariate linear regression*.

#### Cost Function

Cost function is the function that tell *accuracy* of hypothesis.

According to the training set of house pricing problem below where $m = 47$

| Size in feet^2 (x) 	| Price ($) in 1000's (y) 	|
|:------------------:	|:-----------------------:	|
|        2104        	|           460           	|
|        1416        	|           232           	|
|        1534        	|           315           	|
|         852        	|           178           	|
|         ...        	|           ...           	|

The hypothesis of this linear regression problem can be notated as:

$$h_\theta(x) = \theta_0 + \theta_1x$$

For house pricing linear regression problem, we need to choose $\theta_0$ and $\theta_1$ so that the hyopothesis $h_\theta(x_i)$ (predicted value) is close to $y$ (actual value), or $h_\theta(x_i) - y_i$ must be small. In this situation, **mean squared error (MSE)** or **mean squared division (MSD)** can be used to measure the average of the squares of the errors or deviations. The cost function of this problem can be described by the MSE as:

$$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$$

#### Cost Function Intuition I

To find the best hypothesis, best straight line from linear equation that can be used to predict an output, for house pricing problem, result from the cost function of best fit hypothesis must closer to zero or ideally zero.

#### Cost Function Intuition II

This section explains about contour plot which use to conviniently describe more complex hypothesis.

![Example of hypothesis with contour plots to find the best hypothesis based on result of cost function](images/1.png)


### Parameter Learning

#### Gradient Descent

![Gradient descent algorithm](images/2.png)

Gradient descent is algorithm which can be used to minimize cost function $J$, and other type of problems. The basic concept of gradent descent algorithm is:

- Start with some $\theta_0$, $\theta_1$
- Keep changing $\theta_0$, $\theta_1$ to reduce $J(\theta_0, \theta_1)$ until minimum

The gradeint descent algorithm is:

$$Repeat\, until\, convergence\, for\, (j=0\, and\, j=1)\, \{\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)\}$$

Unpacking algorithm:

- $:=$ is assignment operator
- $=$ is truth assertion
- $\alpha$ is learning rate, or simply *the big of step we take downhill with creating descent*
- $\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$ is derivative term
- $for\, (j=0\, and\, j=1)$ is updater
- Assignment to $\theta_j$ must be simaltaneously happened from $\theta_0$ and $\theta_1$ 

![Gradient descent: correct way](images/3.png)

#### Gradient Descent Intuition

The derivative term in gradient descent $\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$ is responsible to finding the slope of specfic point on $J$ until found the minimum point.

![The derivative term explanation](images/4.png)

The learning rate $\alpha$ is responsible to define move rate until convergence. If $\alpha$ is too small, gredient descent can be slow, but if $\alpha$ is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.

As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease learning rate $\alpha$ over time.

#### Gradient Descent for Linear Regression

- Linear regression => convex function => bowl-shaped cost function => no local optimum (only one left)
- *Batch* gradient descennt: each step of gradient descent uses all the training examples