Naive Bayes#
Intuition#
Pen and Paper Exercises#
Exercise 1#
Problem#
Suppose we have the following training data to predict whether a person buys a computer or not:
RID |
age |
income |
student |
credit_rating |
Class: buys_computer |
---|---|---|---|---|---|
1 |
<=30 |
high |
no |
fair |
no |
2 |
<=30 |
high |
no |
excellent |
no |
3 |
31…40 |
high |
no |
fair |
yes |
4 |
>40 |
medium |
no |
fair |
yes |
5 |
>40 |
low |
yes |
fair |
yes |
6 |
>40 |
low |
yes |
excellent |
no |
7 |
31…40 |
low |
yes |
excellent |
yes |
8 |
<=30 |
medium |
no |
fair |
no |
9 |
<=30 |
low |
yes |
fair |
yes |
10 |
>40 |
medium |
yes |
fair |
yes |
11 |
<=30 |
medium |
yes |
excellent |
yes |
12 |
31…40 |
medium |
no |
excellent |
yes |
13 |
31…40 |
high |
yes |
fair |
yes |
14 |
>40 |
medium |
no |
excellent |
no |
Using the Naive Bayes algorithm to
a. Calculate the prior probabilities of the classes.
b. Predict whether a person buys a computer or not given age<=30, income=medium, student=yes, credit-rating=fair.
Answer#
a. Let \(B\) be the event that a person buys a computer, and \(\bar{B}\) be the event that a person does not buy a computer. Then the prior probabilities are:
b. The posterior probabilities of the classes are given by:
Let \(A_1 = \text{age} \leq 30\), \(A_2 = \text{income} = \text{medium}\), \(A_3 = \text{student} = \text{yes}\), and \(A_4 = \text{credit\_rating} = \text{fair}\), then \(A = \{A_1, A_2, A_3, A_4\}\).
Then, for each event \(A_i \in A\), we have:
\( P(A_1 | B) = \frac{count(A_1 \cap B)}{count(B)} = \frac{2}{9} \) and \( P(A_1 | \bar{B}) = \frac{count(A_1 \cap \bar{B})}{count(\bar{B})} = \frac{3}{5} \)
\( P(A_2 | B) = \frac{count(A_2 \cap B)}{count(B)} = \frac{4}{9} \) and \(P(A_2 | \bar{B}) = \frac{count(A_2 \cap \bar{B})}{count(\bar{B})} = \frac{2}{5} \)
\( P(A_3 | B) = \frac{count(A_3 \cap B)}{count(B)} = \frac{6}{9} \) and \(P(A_3 | \bar{B}) = \frac{count(A_3 \cap \bar{B})}{count(\bar{B})} = \frac{1}{5} \)
\( P(A_4 | B) = \frac{count(A_4 \cap B)}{count(B)} = \frac{6}{9} \) and \(P(A_4 | \bar{B}) = \frac{count(A_4 \cap \bar{B})}{count(\bar{B})} = \frac{2}{5} \)
The likelihood of the evidence \(A\) for each class is given by:
\( P(A | B) = P(A_1 | B) \times P(A_2 | B) \times P(A_3 | B) \times P(A_4 | B) = \frac{2}{9} \times \frac{4}{9} \times \frac{6}{9} \times \frac{6}{9} = \frac{32}{729} \)
\( P(A | \bar{B}) = P(A_1 | \bar{B}) \times P(A_2 | \bar{B}) \times P(A_3 | \bar{B}) \times P(A_4 | \bar{B}) = \frac{3}{5} \times \frac{2}{5} \times \frac{1}{5} \times \frac{2}{5} = \frac{12}{625} \)
The posterior probabilities of the classes are given by:
and
where \(P(A)\) is the probability of the evidence \(A\).
It is easy to see that \(P(A)\) is the same for both classes, so we can compare the numerators of the posterior probabilities to determine which class is more likely. Since \(P(B | A) > P(\bar{B} | A)\), we predict that the person buys a computer.
Exercise 2#
Problem#
We are training a document classifier with Naive Bayes. Each document is represented by a sequence of characters (each character is a word). We have the following training data:
a. Calculate \(P(A)\), \(P(B)\), \(P(a|A)\), \(P(b|A)\), \(P(a|B)\), and \(P(b|B)\) using Laplace smoothing for the conditional probabilities.
b. For each of the following new data points, predict the class of the data point:
\(aaba\)
\(a\)
\(bbba\)
\(bccbba\)
\(bbbb\)
Answer#
a. The probabilities are:
\(P(A) = \dfrac{count(A)}{count(A) + count(B)} = \dfrac{3}{4}\) and \(P(B) = \dfrac{count(B)}{count(A) + count(B)} = \dfrac{1}{4}\)
There are two words in the vocabulary, \(a\) and \(b\). Using Laplace smoothing, we have:
\(P(a|A) = \dfrac{count(a|A) + 1}{\sum_{w \in V} count(w|A) + |V|} = \dfrac{4 + 1}{6 + 2} = \dfrac{5}{8}\)
\(P(b|A) = \dfrac{count(b|A) + 1}{\sum_{w \in V} count(w|A) + |V|} = \dfrac{2 + 1}{6 + 2} = \dfrac{3}{8}\)
and
\(P(a|B) = \dfrac{count(a|B) + 1}{\sum_{w \in V} count(w|B) + |V|} = \dfrac{0 + 1}{2 + 2} = \dfrac{1}{4}\)
\(P(b|B) = \dfrac{count(b|B) + 1}{\sum_{w \in V} count(w|B) + |V|} = \dfrac{2 + 1}{2 + 2} = \dfrac{3}{4}\)
where \(V\) is the vocabulary, \(\sum_{w \in V} count(w|A) = 6\) is the total count of words in class \(A\), \(\sum_{w \in V} count(w|B) = 2\) is the total count of words in class \(B\), and \(|V|\) is the size of the vocabulary.
b. Considering each of the new data points:
\(aaba\):
\(\begin{align} P(A|aaba) &= \frac{P(A) * P(a|A) * P(a|A) * P(b|A) * P(a|A)}{P(aaba)} \\ &= \frac{\frac{3}{4} \times \frac{5}{8} \times \frac{5}{8} \times \frac{3}{8} \times \frac{5}{8}}{P(aaba)} \approx \frac{0.069}{P(aaba)} \end{align}\)
\(\begin{align} P(B|aaba) &= \frac{P(B) * P(a|B) * P(a|B) * P(b|B) * P(a|B)}{P(aaba)} \\ &= \frac{\frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} \times \frac{3}{4} \times \frac{1}{4}}{P(aaba)} \approx \frac{0.003}{P(aaba)} \end{align}\)
\(\Rightarrow P(A|aaba) > P(B|aaba)\), so we predict that the class of the data point is \(A\).
\(a\):
\(\begin{align} P(A|a) &= \frac{P(A) * P(a|A)}{P(a)} = \frac{\frac{3}{4} \times \frac{5}{8}}{P(a)} \approx \frac{0.469}{P(a)} \end{align}\) \(\begin{align} P(B|a) &= \frac{P(B) * P(a|B)}{P(a)} = \frac{\frac{1}{4} \times \frac{1}{4}}{P(a)} \approx \frac{0.063}{P(a)} \end{align}\)
\(\Rightarrow P(A|a) > P(B|a)\), so we predict that the class of the data point is \(A\).
\(bbba\):
\(\begin{align} P(A|bbba) &= \frac{P(A) * P(b|A) * P(b|A) * P(b|A) * P(a|A)}{P(bbba)} \\ &= \frac{\frac{3}{4} \times \frac{3}{8} \times \frac{3}{8} \times \frac{3}{8} \times \frac{5}{8}}{P(bbba)} \approx \frac{0.025}{P(bbba)} \end{align}\) \(\begin{align} P(B|bbba) &= \frac{P(B) * P(b|B) * P(b|B) * P(b|B) * P(a|B)}{P(bbba)} \\ &= \frac{\frac{1}{4} \times \frac{3}{4} \times \frac{3}{4} \times \frac{3}{4} \times \frac{1}{4}}{P(bbba)} \approx \frac{0.026}{P(bbba)} \end{align}\)
\(\Rightarrow P(A|bbba) < P(B|bbba)\), so we predict that the class of the data point is \(B\).
\(bccbba\): we have new word ‘c’ that was not in the vocabulary during training. The data point is reduced to ‘bbba’ and we can use the probabilities calculated for ‘bbba’ to predict the class of the data point.
\(\Rightarrow\) \(bccbba\) is predicted to be in class \(B\) (similar to \(bbba\)).
\(bbbb\):
\(\begin{align} P(A|bbbb) &= \frac{P(A) * P(b|A) * P(b|A) * P(b|A) * P(b|A)}{P(bbbb)} \\ &= \frac{\frac{3}{4} \times \frac{3}{8} \times \frac{3}{8} \times \frac{3}{8} \times \frac{3}{8}}{P(bbbb)} \approx \frac{0.015}{P(bbbb)} \end{align}\) \(\begin{align} P(B|bbbb) &= \frac{P(B) * P(b|B) * P(b|B) * P(b|B) * P(b|B)}{P(bbbb)} \\ &= \frac{\frac{1}{4} \times \frac{3}{4} \times \frac{3}{4} \times \frac{3}{4} \times \frac{3}{4}}{P(bbbb)} \approx \frac{0.079}{P(bbbb)} \end{align}\)
\(\Rightarrow bbbb\) is predicted to be in class \(B\).
Note
Thr probabilities \(P(aaba)\), \(P(a)\), \(P(bbba)\), and \(P(bbbb)\) are the probabilities of the input features and no need to be calculated because they are not needed to compare the posterior probabilities of the classes.
However, they can be calculated as follows:
\(\begin{align} P(aaba) = & P(aaba|A) \times P(A) + P(aaba|B) \times P(B) \\ = & P(a|A) \times P(a|A) \times P(b|A) \times P(a|A) \times P(A) \\ & + P(a|B) \times P(a|B) \times P(b|B) \times P(b|B) \times P(B) \end{align}\)
\(\ldots\)