## Machine Learning Video Lectures

I taught an introductory Machine Learning course to BS students at FAST Peshawar in Fall 2015. The feedback was quite positive so I decided to offer another course to the MS/PhD students in the next semester. The mode of teaching was also a bit different: we tried doing the pen-tablet-augmented-multimedia-slides model. The semester is still in progress but we have the core of the basics done now.

The lectures are in Urdu so might be easier to follow for those who understand the language. I will be uploading the future videos as they come up inshaallah. You can see the first video below and follow the complete collection on Vimeo here: https://vimeo.com/album/3770825

## Machine Learning Self-Study Track

I started with Machine Learning a while back and had a slightly hard time getting help from the local community. The reason was mostly because the Machine Learning community in general is way behind the state-of-the-art in industry and research. This is true for almost all fields nowadays but with Machine Learning, the issues are more pronounced due to the recent fast-paced developments in the industry.

On the other hand, once you know what to study, things are much easier than many other fields such as security. Here I would outline the plan I followed to get to where I am (which isn’t too far ahead but still a little better than what most people know, IMHO).

So, here’s my guide for getting started with Machine Learning self-study.

1. Start with Andrew Ng’s Coursera course — Machine Learning. That’s the advice almost everyone seems to give — and it’s a great advice. The Coursera course is completely basic and eases you in the field with little pre-reqs and not much depth. Be careful though: do not think after completing the course that you are an expert in Machine Learning. It misses quite a few areas and the skills needed to be above average. It does get you started with practicals so you are likely to think you’re already done after finishing the course.
2. So, after you complete the courser in its entirety — including the assignments — I suggest you start with Prof. Nando de Freitas’ undergrad course.  This is a much more detailed course and would get you a very different view of ML than traditional outlines. Of course, you might have to brush up on your Probability, Calculus and Linear Algebra. You can’t really do anything without these three.
3. For the above three, I suggest the following courses:
1. Probability: Probability for Life Sciences by UCLA’s Math Department. You can find videos for this easily.
2. Calculus: I strongly suggest you go with Virtual University Pakistan’s Calculus-I course by Dr. Faisal Shah Khan. It’s a great course but it’s in Urdu. If you don’t know Urdu, you can find your own series. Please let me know in the comments about great resources for this.
3. Linear Algebra: Of course, this can only be done with Gilbert Strang’s Linear Algebra course from OCW.
4. After that, you can start with the grad course and the second grad course by Prof. Nando de Freitas. Both have very detailed video lectures.

Of course, you also need to work with tools other than Matlab. I strongly suggest the python PyData stack. The full list would be:

1. Python PyData full stack (plus go through their yearly videos as well)
2. Theano
3. Torch
4. Keras

That’s what I have till now. I might add more when I know more inshaallah.

## A Basic Naive Bayes classifier in Matlab

This is the second in my series of implementing low-level machine learning algorithms in Matlab. We first did linear regression with gradient descent and now we’re working with the more popular naive bayes classifier. As is evident from the name, NB it is a classifier i.e. it sorts data points into classes based on some features. We’ll be writing code for NB using low-level matlab (meaning we won’t use matlab’s implementation of NB). Here’s the example we’ve taken (with a bit of modification) from here.

Consider the following vector:

(likes shortbread, likes lager, eats porridge, watched England play football, nationality)T

A vector $x = (1, 0, 1, 0, 1)^T$ would describe that a person likes shortbread, does not like lager, eats porridge, has not watched England play football and is a national of Scottland. The final point is the class that we want to predict and takes two values: 1 for Scottish, 0 for English.

Here’s the data we’re given:

``` X = [ 0 0 1 1 0 ; 1 0 1 0 0 ; 1 1 0 1 0 ; 1 1 0 0 0 ; 0 1 0 1 0 ; 0 0 1 0 0 ; 1 0 1 1 1 ; 1 1 0 1 1 ; 1 1 1 0 1 ; 1 1 1 0 1 ; 1 1 1 1 1 ; 1 0 1 0 1 ; 1 0 0 0 1 ]; ```

Notice that usually when we represent data, we write features in columns, instances in rows. If this is the case, we need to get the data in proper orientation: features in rows, instances in columns. That’s the convention. Also, we need to separate the class from the feature set:

```Y = X(:,5);
X = X(:,1:4)'; % X in proper format now.
```

Alright. Now, that we have the data, let’s hear some theory. As always, this isn’t a tutorial on statistics. Go read about the theory somewhere else. This is just a refresher:

In order to predict the class from a feature set, we need to find out the probability of Y given X (where

$X = ( x_1, x_2, ldots x_n )$

with n being the number of features. We denote the number of instances given to us as m. In our example, n = 4, m = 13. The probability of Y given X is:

$P(Y=1|X) = P(X|Y=1) * P(Y=1) / P(X)$

Which is called the Bayes rule. Now, we make the NB assumption: All features in the feature set are independant of each other! Strong assumption but usually works. Given this assumption, we need to find $P(X|Y=1), P(Y) and P(X)$.

(The weird braces notation that follows is the indicator notation. $1{ v }$ means use 1 only if condition v holds, 0 otherwise.)

$P(X) = P(X|Y=1) + P(X|Y=0)$

$P(X|Y=1) = prod_j{P(x_i|Y=1)}$

To find $P(X|Y=1)$, you just have to find $P(x_i|Y=1)$ for all features and multiply them together. This is where the assumption comes in. You need the assumption of independence here for this.

$P(x_i|Y=1) = sum_j{1{x_i^j = 1, y^j = 1}} / sum_j{1{y^j = 1}}$

This equation basically means count the number of instances for which both x_i and Y are 1 and divide by the count of Y being 1. That’s the probability of x_i appearing with Y. Fairly straight forward if you think about it.

$P(Y=1) = sum_j{1{y^j = 1 }} / sum_j{1{y^j = 1, y^j = 0 }}$

Same as above. Count the ratio of Y=1 with the total number of Ys. Notice that we need to calculate all these for both Y=0 and Y=1 because we need both in the first equation. Let’s begin from the bottom up. For all of below, consider E as 0 and S as 1 since we consider being Scottish as being in class 1 (positive example).

P(Y):

```pS = sum (Y)/size(Y,1);     % all rows with Y = 1
pE = sum(1 - Y)/size(Y,1);  % all rows with Y = 0
```

P(x_i|Y):

```phiS = X * Y / sum(Y);  % all instances for which attrib phi(i) and Y are both 1
% meaning all Scotts with attribute phi(i)  = 1
phiE = X * (1-Y) / sum(1-Y) ;  % all instances for which attrib phi(i) = 1 and Y =0
% meaning all English with attribute phi(i) = 1
```

PhiS and PhiE are vectors that store the probabilities for all attributes. Now that we have the probabilities, we’re ready to make a prediction. Let’s get a test datapoint:

```x=[1 0 1 0]';  % test point
```

And calculate the probabilities P(X|Y=1) and P(X|Y=0)

```pxS = prod(phiS.^x.*(1-phiS).^(1-x));
pxE = prod(phiE.^x.*(1-phiE).^(1-x));
```

And finally, the probabilities of P(Y=1|X) and P(Y=0|X)

```pxSF = (pxS * pS ) / (pxS + pxE)
pxEF = (pxE * pS ) / (pxS + pxE)
```

They should add upto 1 since there are only two classes. Now you can define a threshold for deciding whether the class should be considered 1 or 0 based on these probabilities. In this case, we can consider this test point to belong to class 1 since the probability pxSF > 0.5.

And there you have it!