Naive Bayes

Naive Bayes#

Pen and Paper Exercises #

Exercise 1 #

Problem #

Suppose we have the following training data to predict whether a person buys a computer or not:

RID	age	income	student	credit_rating	Class: buys_computer
1	<=30	high	no	fair	no
2	<=30	high	no	excellent	no
3	31…40	high	no	fair	yes
4	>40	medium	no	fair	yes
5	>40	low	yes	fair	yes
6	>40	low	yes	excellent	no
7	31…40	low	yes	excellent	yes
8	<=30	medium	no	fair	no
9	<=30	low	yes	fair	yes
10	>40	medium	yes	fair	yes
11	<=30	medium	yes	excellent	yes
12	31…40	medium	no	excellent	yes
13	31…40	high	yes	fair	yes
14	>40	medium	no	excellent	no

Using the Naive Bayes algorithm to

a. Calculate the prior probabilities of the classes.

b. Predict whether a person buys a computer or not given age<=30, income=medium, student=yes, credit-rating=fair.

Answer #

a. Let \(B\) be the event that a person buys a computer, and \(\bar{B}\) be the event that a person does not buy a computer. Then the prior probabilities are:

\[ P(B) = \frac{count(B)}{count(B) + count(\bar{B})} = \frac{9}{14} \]

\[ P(\bar{B}) = \frac{count(\bar{B})}{count(B) + count(\bar{B})} = \frac{5}{14} \]

b. The posterior probabilities of the classes are given by:

Let \(A_1 = \text{age} \leq 30\), \(A_2 = \text{income} = \text{medium}\), \(A_3 = \text{student} = \text{yes}\), and \(A_4 = \text{credit\_rating} = \text{fair}\), then \(A = \{A_1, A_2, A_3, A_4\}\).

Then, for each event \(A_i \in A\), we have:

\( P(A_1 | B) = \frac{count(A_1 \cap B)}{count(B)} = \frac{2}{9} \) and \( P(A_1 | \bar{B}) = \frac{count(A_1 \cap \bar{B})}{count(\bar{B})} = \frac{3}{5} \)
\( P(A_2 | B) = \frac{count(A_2 \cap B)}{count(B)} = \frac{4}{9} \) and \(P(A_2 | \bar{B}) = \frac{count(A_2 \cap \bar{B})}{count(\bar{B})} = \frac{2}{5} \)
\( P(A_3 | B) = \frac{count(A_3 \cap B)}{count(B)} = \frac{6}{9} \) and \(P(A_3 | \bar{B}) = \frac{count(A_3 \cap \bar{B})}{count(\bar{B})} = \frac{1}{5} \)
\( P(A_4 | B) = \frac{count(A_4 \cap B)}{count(B)} = \frac{6}{9} \) and \(P(A_4 | \bar{B}) = \frac{count(A_4 \cap \bar{B})}{count(\bar{B})} = \frac{2}{5} \)

The likelihood of the evidence \(A\) for each class is given by:

\( P(A | B) = P(A_1 | B) \times P(A_2 | B) \times P(A_3 | B) \times P(A_4 | B) = \frac{2}{9} \times \frac{4}{9} \times \frac{6}{9} \times \frac{6}{9} = \frac{32}{729} \)
\( P(A | \bar{B}) = P(A_1 | \bar{B}) \times P(A_2 | \bar{B}) \times P(A_3 | \bar{B}) \times P(A_4 | \bar{B}) = \frac{3}{5} \times \frac{2}{5} \times \frac{1}{5} \times \frac{2}{5} = \frac{12}{625} \)

The posterior probabilities of the classes are given by:

\[ P(B | A) = \frac{P(A | B) \times P(B)}{P(A)} = \frac{\frac{32}{729} \times \frac{9}{14}}{P(A)} \approx \frac{0.028}{P(A)} \]

and

\[ P(\bar{B} | A) = \frac{P(A | \bar{B}) \times P(\bar{B})}{P(A)} = \frac{\frac{12}{625} \times \frac{5}{14}}{P(A)} \approx \frac{0.007}{P(A)} \]

where \(P(A)\) is the probability of the evidence \(A\).

It is easy to see that \(P(A)\) is the same for both classes, so we can compare the numerators of the posterior probabilities to determine which class is more likely. Since \(P(B | A) > P(\bar{B} | A)\), we predict that the person buys a computer.

Exercise 2 #

Problem #

We are training a document classifier with Naive Bayes. Each document is represented by a sequence of characters (each character is a word). We have the following training data:

\[\begin{split} \begin{aligned} & \text{ Training data }\\ &\begin{array}{cc|cc} \hline \text{ d } & \text{ c } & \text{ d } & \text{ c } \\ \hline aa & A & ba & A \\ ab & A & bb & B \\ \hline \end{array} \end{aligned} \end{split}\]

a. Calculate \(P(A)\), \(P(B)\), \(P(a|A)\), \(P(b|A)\), \(P(a|B)\), and \(P(b|B)\) using Laplace smoothing for the conditional probabilities.

b. For each of the following new data points, predict the class of the data point:

\(aaba\)
\(a\)
\(bbba\)
\(bccbba\)
\(bbbb\)

Answer #

a. The probabilities are:

\(P(A) = \dfrac{count(A)}{count(A) + count(B)} = \dfrac{3}{4}\) and \(P(B) = \dfrac{count(B)}{count(A) + count(B)} = \dfrac{1}{4}\)

There are two words in the vocabulary, \(a\) and \(b\). Using Laplace smoothing, we have:

\(P(a|A) = \dfrac{count(a|A) + 1}{\sum_{w \in V} count(w|A) + |V|} = \dfrac{4 + 1}{6 + 2} = \dfrac{5}{8}\)
\(P(b|A) = \dfrac{count(b|A) + 1}{\sum_{w \in V} count(w|A) + |V|} = \dfrac{2 + 1}{6 + 2} = \dfrac{3}{8}\)

and

\(P(a|B) = \dfrac{count(a|B) + 1}{\sum_{w \in V} count(w|B) + |V|} = \dfrac{0 + 1}{2 + 2} = \dfrac{1}{4}\)
\(P(b|B) = \dfrac{count(b|B) + 1}{\sum_{w \in V} count(w|B) + |V|} = \dfrac{2 + 1}{2 + 2} = \dfrac{3}{4}\)

where \(V\) is the vocabulary, \(\sum_{w \in V} count(w|A) = 6\) is the total count of words in class \(A\), \(\sum_{w \in V} count(w|B) = 2\) is the total count of words in class \(B\), and \(|V|\) is the size of the vocabulary.

b. Considering each of the new data points:

\(aaba\):

\(\begin{align} P(A|aaba) &= \frac{P(A) * P(a|A) * P(a|A) * P(b|A) * P(a|A)}{P(aaba)} \\ &= \frac{\frac{3}{4} \times \frac{5}{8} \times \frac{5}{8} \times \frac{3}{8} \times \frac{5}{8}}{P(aaba)} \approx \frac{0.069}{P(aaba)} \end{align}\)

\(\begin{align} P(B|aaba) &= \frac{P(B) * P(a|B) * P(a|B) * P(b|B) * P(a|B)}{P(aaba)} \\ &= \frac{\frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} \times \frac{3}{4} \times \frac{1}{4}}{P(aaba)} \approx \frac{0.003}{P(aaba)} \end{align}\)

\(\Rightarrow P(A|aaba) > P(B|aaba)\), so we predict that the class of the data point is \(A\).
\(a\):

\(\begin{align} P(A|a) &= \frac{P(A) * P(a|A)}{P(a)} = \frac{\frac{3}{4} \times \frac{5}{8}}{P(a)} \approx \frac{0.469}{P(a)} \end{align}\) \(\begin{align} P(B|a) &= \frac{P(B) * P(a|B)}{P(a)} = \frac{\frac{1}{4} \times \frac{1}{4}}{P(a)} \approx \frac{0.063}{P(a)} \end{align}\)

\(\Rightarrow P(A|a) > P(B|a)\), so we predict that the class of the data point is \(A\).
\(bbba\):

\(\begin{align} P(A|bbba) &= \frac{P(A) * P(b|A) * P(b|A) * P(b|A) * P(a|A)}{P(bbba)} \\ &= \frac{\frac{3}{4} \times \frac{3}{8} \times \frac{3}{8} \times \frac{3}{8} \times \frac{5}{8}}{P(bbba)} \approx \frac{0.025}{P(bbba)} \end{align}\) \(\begin{align} P(B|bbba) &= \frac{P(B) * P(b|B) * P(b|B) * P(b|B) * P(a|B)}{P(bbba)} \\ &= \frac{\frac{1}{4} \times \frac{3}{4} \times \frac{3}{4} \times \frac{3}{4} \times \frac{1}{4}}{P(bbba)} \approx \frac{0.026}{P(bbba)} \end{align}\)

\(\Rightarrow P(A|bbba) < P(B|bbba)\), so we predict that the class of the data point is \(B\).
\(bccbba\): we have new word ‘c’ that was not in the vocabulary during training. The data point is reduced to ‘bbba’ and we can use the probabilities calculated for ‘bbba’ to predict the class of the data point.

\(\Rightarrow\) \(bccbba\) is predicted to be in class \(B\) (similar to \(bbba\)).
\(bbbb\):

\(\begin{align} P(A|bbbb) &= \frac{P(A) * P(b|A) * P(b|A) * P(b|A) * P(b|A)}{P(bbbb)} \\ &= \frac{\frac{3}{4} \times \frac{3}{8} \times \frac{3}{8} \times \frac{3}{8} \times \frac{3}{8}}{P(bbbb)} \approx \frac{0.015}{P(bbbb)} \end{align}\) \(\begin{align} P(B|bbbb) &= \frac{P(B) * P(b|B) * P(b|B) * P(b|B) * P(b|B)}{P(bbbb)} \\ &= \frac{\frac{1}{4} \times \frac{3}{4} \times \frac{3}{4} \times \frac{3}{4} \times \frac{3}{4}}{P(bbbb)} \approx \frac{0.079}{P(bbbb)} \end{align}\)

\(\Rightarrow bbbb\) is predicted to be in class \(B\).

Note

Thr probabilities \(P(aaba)\), \(P(a)\), \(P(bbba)\), and \(P(bbbb)\) are the probabilities of the input features and no need to be calculated because they are not needed to compare the posterior probabilities of the classes.

However, they can be calculated as follows:

\(\begin{align} P(aaba) = & P(aaba|A) \times P(A) + P(aaba|B) \times P(B) \\ = & P(a|A) \times P(a|A) \times P(b|A) \times P(a|A) \times P(A) \\ & + P(a|B) \times P(a|B) \times P(b|B) \times P(b|B) \times P(B) \end{align}\)

\(\ldots\)

Naive Bayes

Contents

Naive Bayes#

Intuition #

Pen and Paper Exercises #

Exercise 1 #

Problem #

Answer #

Exercise 2 #

Problem #

Answer #

Practice #