Naive Bayes#

Intuition#

Pen and Paper Exercises#

Exercise 1#

Problem#

Suppose we have the following training data to predict whether a person buys a computer or not:

RID

age

income

student

credit_rating

Class: buys_computer

1

<=30

high

no

fair

no

2

<=30

high

no

excellent

no

3

31…40

high

no

fair

yes

4

>40

medium

no

fair

yes

5

>40

low

yes

fair

yes

6

>40

low

yes

excellent

no

7

31…40

low

yes

excellent

yes

8

<=30

medium

no

fair

no

9

<=30

low

yes

fair

yes

10

>40

medium

yes

fair

yes

11

<=30

medium

yes

excellent

yes

12

31…40

medium

no

excellent

yes

13

31…40

high

yes

fair

yes

14

>40

medium

no

excellent

no

Using the Naive Bayes algorithm to

a. Calculate the prior probabilities of the classes.

b. Predict whether a person buys a computer or not given age<=30, income=medium, student=yes, credit-rating=fair.

Answer#

a. Let \(B\) be the event that a person buys a computer, and \(\bar{B}\) be the event that a person does not buy a computer. Then the prior probabilities are:

\[ P(B) = \frac{count(B)}{count(B) + count(\bar{B})} = \frac{9}{14} \]
\[ P(\bar{B}) = \frac{count(\bar{B})}{count(B) + count(\bar{B})} = \frac{5}{14} \]

b. The posterior probabilities of the classes are given by:

Let \(A_1 = \text{age} \leq 30\), \(A_2 = \text{income} = \text{medium}\), \(A_3 = \text{student} = \text{yes}\), and \(A_4 = \text{credit\_rating} = \text{fair}\), then \(A = \{A_1, A_2, A_3, A_4\}\).

Then, for each event \(A_i \in A\), we have:

  • \( P(A_1 | B) = \frac{count(A_1 \cap B)}{count(B)} = \frac{2}{9} \) and \( P(A_1 | \bar{B}) = \frac{count(A_1 \cap \bar{B})}{count(\bar{B})} = \frac{3}{5} \)

  • \( P(A_2 | B) = \frac{count(A_2 \cap B)}{count(B)} = \frac{4}{9} \) and \(P(A_2 | \bar{B}) = \frac{count(A_2 \cap \bar{B})}{count(\bar{B})} = \frac{2}{5} \)

  • \( P(A_3 | B) = \frac{count(A_3 \cap B)}{count(B)} = \frac{6}{9} \) and \(P(A_3 | \bar{B}) = \frac{count(A_3 \cap \bar{B})}{count(\bar{B})} = \frac{1}{5} \)

  • \( P(A_4 | B) = \frac{count(A_4 \cap B)}{count(B)} = \frac{6}{9} \) and \(P(A_4 | \bar{B}) = \frac{count(A_4 \cap \bar{B})}{count(\bar{B})} = \frac{2}{5} \)

The likelihood of the evidence \(A\) for each class is given by:

  • \( P(A | B) = P(A_1 | B) \times P(A_2 | B) \times P(A_3 | B) \times P(A_4 | B) = \frac{2}{9} \times \frac{4}{9} \times \frac{6}{9} \times \frac{6}{9} = \frac{32}{729} \)

  • \( P(A | \bar{B}) = P(A_1 | \bar{B}) \times P(A_2 | \bar{B}) \times P(A_3 | \bar{B}) \times P(A_4 | \bar{B}) = \frac{3}{5} \times \frac{2}{5} \times \frac{1}{5} \times \frac{2}{5} = \frac{12}{625} \)

The posterior probabilities of the classes are given by:

\[ P(B | A) = \frac{P(A | B) \times P(B)}{P(A)} = \frac{\frac{32}{729} \times \frac{9}{14}}{P(A)} \approx \frac{0.028}{P(A)} \]

and

\[ P(\bar{B} | A) = \frac{P(A | \bar{B}) \times P(\bar{B})}{P(A)} = \frac{\frac{12}{625} \times \frac{5}{14}}{P(A)} \approx \frac{0.007}{P(A)} \]

where \(P(A)\) is the probability of the evidence \(A\).

It is easy to see that \(P(A)\) is the same for both classes, so we can compare the numerators of the posterior probabilities to determine which class is more likely. Since \(P(B | A) > P(\bar{B} | A)\), we predict that the person buys a computer.

Exercise 2#

Problem#

We are training a document classifier with Naive Bayes. Each document is represented by a sequence of characters (each character is a word). We have the following training data:

\[\begin{split} \begin{aligned} & \text{ Training data }\\ &\begin{array}{cc|cc} \hline \text{ d } & \text{ c } & \text{ d } & \text{ c } \\ \hline aa & A & ba & A \\ ab & A & bb & B \\ \hline \end{array} \end{aligned} \end{split}\]

a. Calculate \(P(A)\), \(P(B)\), \(P(a|A)\), \(P(b|A)\), \(P(a|B)\), and \(P(b|B)\) using Laplace smoothing for the conditional probabilities.

b. For each of the following new data points, predict the class of the data point:

  • \(aaba\)

  • \(a\)

  • \(bbba\)

  • \(bccbba\)

  • \(bbbb\)

Answer#

a. The probabilities are:

  • \(P(A) = \dfrac{count(A)}{count(A) + count(B)} = \dfrac{3}{4}\) and \(P(B) = \dfrac{count(B)}{count(A) + count(B)} = \dfrac{1}{4}\)

There are two words in the vocabulary, \(a\) and \(b\). Using Laplace smoothing, we have:

  • \(P(a|A) = \dfrac{count(a|A) + 1}{\sum_{w \in V} count(w|A) + |V|} = \dfrac{4 + 1}{6 + 2} = \dfrac{5}{8}\)

  • \(P(b|A) = \dfrac{count(b|A) + 1}{\sum_{w \in V} count(w|A) + |V|} = \dfrac{2 + 1}{6 + 2} = \dfrac{3}{8}\)

and

  • \(P(a|B) = \dfrac{count(a|B) + 1}{\sum_{w \in V} count(w|B) + |V|} = \dfrac{0 + 1}{2 + 2} = \dfrac{1}{4}\)

  • \(P(b|B) = \dfrac{count(b|B) + 1}{\sum_{w \in V} count(w|B) + |V|} = \dfrac{2 + 1}{2 + 2} = \dfrac{3}{4}\)

where \(V\) is the vocabulary, \(\sum_{w \in V} count(w|A) = 6\) is the total count of words in class \(A\), \(\sum_{w \in V} count(w|B) = 2\) is the total count of words in class \(B\), and \(|V|\) is the size of the vocabulary.

b. Considering each of the new data points:

  • \(aaba\):

    \(\begin{align} P(A|aaba) &= \frac{P(A) * P(a|A) * P(a|A) * P(b|A) * P(a|A)}{P(aaba)} \\ &= \frac{\frac{3}{4} \times \frac{5}{8} \times \frac{5}{8} \times \frac{3}{8} \times \frac{5}{8}}{P(aaba)} \approx \frac{0.069}{P(aaba)} \end{align}\)

    \(\begin{align} P(B|aaba) &= \frac{P(B) * P(a|B) * P(a|B) * P(b|B) * P(a|B)}{P(aaba)} \\ &= \frac{\frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} \times \frac{3}{4} \times \frac{1}{4}}{P(aaba)} \approx \frac{0.003}{P(aaba)} \end{align}\)

    \(\Rightarrow P(A|aaba) > P(B|aaba)\), so we predict that the class of the data point is \(A\).

  • \(a\):

    \(\begin{align} P(A|a) &= \frac{P(A) * P(a|A)}{P(a)} = \frac{\frac{3}{4} \times \frac{5}{8}}{P(a)} \approx \frac{0.469}{P(a)} \end{align}\) \(\begin{align} P(B|a) &= \frac{P(B) * P(a|B)}{P(a)} = \frac{\frac{1}{4} \times \frac{1}{4}}{P(a)} \approx \frac{0.063}{P(a)} \end{align}\)

    \(\Rightarrow P(A|a) > P(B|a)\), so we predict that the class of the data point is \(A\).

  • \(bbba\):

    \(\begin{align} P(A|bbba) &= \frac{P(A) * P(b|A) * P(b|A) * P(b|A) * P(a|A)}{P(bbba)} \\ &= \frac{\frac{3}{4} \times \frac{3}{8} \times \frac{3}{8} \times \frac{3}{8} \times \frac{5}{8}}{P(bbba)} \approx \frac{0.025}{P(bbba)} \end{align}\) \(\begin{align} P(B|bbba) &= \frac{P(B) * P(b|B) * P(b|B) * P(b|B) * P(a|B)}{P(bbba)} \\ &= \frac{\frac{1}{4} \times \frac{3}{4} \times \frac{3}{4} \times \frac{3}{4} \times \frac{1}{4}}{P(bbba)} \approx \frac{0.026}{P(bbba)} \end{align}\)

    \(\Rightarrow P(A|bbba) < P(B|bbba)\), so we predict that the class of the data point is \(B\).

  • \(bccbba\): we have new word ‘c’ that was not in the vocabulary during training. The data point is reduced to ‘bbba’ and we can use the probabilities calculated for ‘bbba’ to predict the class of the data point.

    \(\Rightarrow\) \(bccbba\) is predicted to be in class \(B\) (similar to \(bbba\)).

  • \(bbbb\):

    \(\begin{align} P(A|bbbb) &= \frac{P(A) * P(b|A) * P(b|A) * P(b|A) * P(b|A)}{P(bbbb)} \\ &= \frac{\frac{3}{4} \times \frac{3}{8} \times \frac{3}{8} \times \frac{3}{8} \times \frac{3}{8}}{P(bbbb)} \approx \frac{0.015}{P(bbbb)} \end{align}\) \(\begin{align} P(B|bbbb) &= \frac{P(B) * P(b|B) * P(b|B) * P(b|B) * P(b|B)}{P(bbbb)} \\ &= \frac{\frac{1}{4} \times \frac{3}{4} \times \frac{3}{4} \times \frac{3}{4} \times \frac{3}{4}}{P(bbbb)} \approx \frac{0.079}{P(bbbb)} \end{align}\)

    \(\Rightarrow bbbb\) is predicted to be in class \(B\).

Note

Thr probabilities \(P(aaba)\), \(P(a)\), \(P(bbba)\), and \(P(bbbb)\) are the probabilities of the input features and no need to be calculated because they are not needed to compare the posterior probabilities of the classes.

However, they can be calculated as follows:

\(\begin{align} P(aaba) = & P(aaba|A) \times P(A) + P(aaba|B) \times P(B) \\ = & P(a|A) \times P(a|A) \times P(b|A) \times P(a|A) \times P(A) \\ & + P(a|B) \times P(a|B) \times P(b|B) \times P(b|B) \times P(B) \end{align}\)

\(\ldots\)

Practice#