基于朴素贝叶斯的垃圾邮件判别系统
本文我们使用朴素贝叶斯方法用Python
实现一个相对简单的英文垃圾邮件分类判别系统,借助文中提供的小型带标签邮件数据集完成对模型的训练与测试。
贝叶斯定理
数学上的贝叶斯定理,是关于两个随机变量A和B的条件概率或边缘概率(Marginal probability)的一则普适性定理,其数学表示如下。
$$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$$
转换到概率与统计学中,其特性非常适合于概率统计推理。其可以将已知的条件概率转化为待求的统计学问题,也可以根据先验知识和事实证据,给出新的事实条件下假设置信度的衡量。
$$P(Hypothesis|Data)=\frac{P(Data|Hypothesis)P(Hypothesis)}{P(Data)}$$
其中对应的名词解释:
- \(P(Hypothesis|Data)\) 后验概率(Posterior): 待求的基于已知事实的假设的置信度。
- \(P(Data|Hypothesis)\) 可能性条件概率(Likelihood): 给定假设下事件发生的概率,一般为客观数据。
- \(P(Hypothesis)\) 先验概率(Prior): 获取事实数据之前对假设的置信度衡量,需要事先人为确定。
- \(P(Data)\) 事实证据(Evidence): 真实的事实事件发生的概率,该项作为分母也承担着归一化后验概率的职责,通常根据所有可能的条件通过全概率公式计算得到。
对于二分类数据来说(如本例的邮件分类),事实证据仅有两类情况决定,因此全概率公式仅包含两项。记其中一个分类为 \(S\) ,则对应的另一类为 \(\neg S\) ,故表示事实证据的全概率公式可以写为:
$$P(Data)=P(Data|S)P(S)+P(Data|\neg S)P(\neg S)$$
另外,朴素贝叶斯(The naive Bayes)的特别之处在于对可能性条件概率分布做出了条件独立性的假设,这是一个较强的假设,其有效的缩小了条件概率计算的复杂度,降低了条件概率的向量空间。对于一个 \(n\) 维 \(Data\) 数据向量 \(x\) ,设其内部元素为 \(x_1, x_2, …, x_n\) ,条件假设 \(Hypothesis\) 记为 \(c\) ,由条件独立性的假设可知可能性条件概率的计算方式如下:
$$\begin{align}P(x|c)&=P(x_1|c)P(x_2|c)\cdot \cdot \cdot P(x_n|c)\\ &=\prod_{i=1}^n{P(x_i|c)}\end{align}$$
即 \(x\) 的每个组分单独条件概率的乘积。若组分 \(x_i\) 均有2个取值,则在条件独立性的假设下,计算复杂度由 \(2^n\) 缩小为 \(n\) ,但随之而来的缺陷是向量内部组分之间的关系完全无法被表示(条件独立)。
朴素贝叶斯法属于生成模型。
变量定义
结合贝叶斯定理的概念解释,本例中我们可以使用如下表达式来计算给定事件 \(E\) 下某一则邮件消息属于垃圾邮件 \(S\) 的概率。其中 \(E\) 代指事件——由许多单词组合成的某一则邮件消息。
$$P(S|E)=\frac{P(E|S)P(S)}{P(E|S)P(S)+P(E|\neg S)P(\neg S)}$$
具体的:
- \(P(S|E)\) :在确定事件出现的条件下,一则邮件消息是垃圾邮件的概率。
- \(P(S)\) :一则邮件消息是垃圾邮件的先验概率。
- \(P(\neg S)\) :一则邮件消息不是垃圾邮件的先验概率。
- \(P(E|S)\) :在垃圾邮件中事件 \(E\) 出现的概率。
- \(P(E|\neg S)\) :在正常邮件中事件 \(E\) 出现的概率。
数据读取
本次使用的数据集为csv格式的邮件数据集。第一列为标签列,ham
表示正常邮件,spam
为垃圾邮件;剩余其它列均包含邮件内容。
使用pandas
库将数据集读取为DataFrame
对象,读取时使用latin-1
编码以避免解码错误(垃圾邮件中可能包含莫名其妙的拉丁字符)。
import pandas as pd df = pd.read_csv("spam.csv", keep_default_na=False, encoding='latin-1') df.head(20)
v1 | v2 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | |
---|---|---|---|---|---|
0 | ham | Go until jurong point, crazy.. Available only … | |||
1 | ham | Ok lar… Joking wif u oni… | |||
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina… | |||
3 | ham | U dun say so early hor… U c already then say… | |||
4 | ham | Nah I don’t think he goes to usf, he lives aro… | |||
5 | spam | FreeMsg Hey there darling it’s been 3 week’s n… | |||
6 | ham | Even my brother is not like to speak with me. … | |||
7 | ham | As per your request ‘Melle Melle (Oru Minnamin… | |||
8 | spam | WINNER!! As a valued network customer you have… | |||
9 | spam | Had your mobile 11 months or more? U R entitle… | |||
10 | ham | I’m gonna be home soon and i don’t want to tal… | |||
11 | spam | SIX chances to win CASH! From 100 to 20,000 po… | |||
12 | spam | URGENT! You have won a 1 week FREE membership … | |||
13 | ham | I’ve been searching for the right words to tha… | |||
14 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! | |||
15 | spam | XXXMobileMovieClub: To use your credit, click … | |||
16 | ham | Oh k…i’m watching here:) | |||
17 | ham | Eh u remember how 2 spell his name… Yes i di… | |||
18 | ham | Fine if thatåÕs the way u feel. ThatåÕs the wa… | |||
19 | spam | England v Macedonia – dont miss the goals/team… |
数据清洗
我们仅对邮件中的单词感兴趣,因此需要去除数据集中所有的标点符号和数字,使其仅包含字母(均转化为小写)字符。同时,我们将所有的数据列合并,且重命名为有意义的列名称。
# combine column 1 to 4 (messages) df["v2"] += df[df.columns[2]] + df[df.columns[3]] + df[df.columns[4]] df = df.drop(columns=df.columns[[2, 3, 4]]) # rename columns clean = df.rename(columns={"v1": "Category", "v2": "Message"}) # filter and clean messages for i in clean.index: s = "" for c in clean.loc[i, "Message"]: if c.isalpha() or c == " ": s += c clean.loc[i, "Message"] = s.lower() clean.head(5)
Category | Message | |
---|---|---|
0 | ham | go until jurong point crazy available only in … |
1 | ham | ok lar joking wif u oni |
2 | spam | free entry in a wkly comp to win fa cup final… |
3 | ham | u dun say so early hor u c already then say |
4 | ham | nah i dont think he goes to usf he lives aroun… |
数据集分隔
将总体数据集分割为两个随机的样本集,分别作为训练集和测试集,所含数据比例3:1。
# split dataset into 3:1 train_data = clean.sample(frac=0.75) test_data = clean[~clean.index.isin(train_data.index)] train_data = train_data.reset_index(drop=True) test_data = test_data.reset_index(drop=True) print("size of train_data:", len(train_data)) print("size of test_data:", len(test_data))
size of train_data: 4179 size of test_data: 1393
计算词频
对于训练集中的每一种单词,分别计算包含该单词的垃圾邮件和正常邮件的数目,并作为词频统计结果保存。词频不仅可以让我们直观的发现不同类型邮件的特征,同时也将作为之后计算可能性条件概率和事实证据的依据。
# count word frequency freq_count = {} for i in train_data.index: # is a spam message isSpam = train_data.loc[i, "Category"] == "spam" # split words and use set to remove duplicated words words = set(train_data.loc[i, "Message"].split()) for w in words: # skip empty words if not w: continue # record frequency with [#Spam, #Ham] if w not in freq_count.keys(): freq_count[w] = [0, 0] freq_count[w][0 if isSpam else 1] += 1 # transform into 2-D list freq_count = [[w]+freq_count[w] for w in freq_count] # generate dataframe word_freq = pd.DataFrame(freq_count, columns=["Word", "#Spam", "#Ham"]) word_freq
Word | #Spam | #Ham | |
---|---|---|---|
0 | until | 4 | 16 |
1 | back | 14 | 93 |
2 | me | 18 | 502 |
3 | i | 29 | 1200 |
4 | im | 9 | 315 |
… | … | … | … |
7383 | greeting | 0 | 1 |
7384 | tosend | 0 | 1 |
7385 | treats | 0 | 1 |
7386 | meive | 0 | 1 |
7387 | gotany | 0 | 1 |
7388 rows × 3 columns
数据可视化(词云)
使用词云库将出现在垃圾邮件中最多的单词可视化。
import wordcloud wordcloud.WordCloud(width=500, height=300).fit_words(dict((word_freq.loc[i, "Word"], word_freq.loc[i, "#Spam"]) for i in word_freq.index)).to_image()
计算条件概率
即计算 \(P(E|S)\) 和 \(P(E|\neg S)\) 。对于前者,用刚刚保存的每个单词出现在垃圾邮件中的词频除以垃圾邮件数目。类似地,对于后者,用每个单词出现在正常邮件中的词频除以正常邮件数目得到。(此处还有另一种计算方法为:每种单词本身出现的频率除以短语库中的单词总数)
需要注意的是,若某一个单词未曾出现在正常邮件或垃圾邮件中,则计算的结果为0。这将造成之后的可能性条件概率计算结果同样为0,从而失去参考价值。因此,我们将常数 \(k\) 同时添加到每一类的词频计算中,当 \(k=1\) 时该操作称为拉普拉斯平滑(Laplace Estimator)。
例如,对于 \(P(E|S)\) :
$$P(E|S) = (number\ of\ spams\ containing\ the\ word + k) / (total\ number\ of\ spams + 2 * k)$$
在下面的计算中,我取 \(k=0.5\) 。
# set k value (The Laplace Estimator -> k = 1) k = 0.5 # calculate probabilities prob_list = [] spam_msg_num = len(train_data[train_data["Category"] == "spam"]) ham_msg_num = len(train_data[train_data["Category"] == "ham"]) for i in word_freq.index: prob_list.append((word_freq.loc[i, "Word"], (word_freq.loc[i, "#Spam"] + k) / (spam_msg_num + 2 * k), (word_freq.loc[i, "#Ham"] + k) / (ham_msg_num + 2 * k))) # generate dataframe word_prob = pd.DataFrame(prob_list, columns=["Word", "P(E|S)", "P(E|¬S)"]) word_prob
Word | P(E|S) | P(E|¬S) | |
---|---|---|---|
0 | until | 0.008108 | 0.004550 |
1 | back | 0.026126 | 0.025786 |
2 | me | 0.033333 | 0.138582 |
3 | i | 0.053153 | 0.331081 |
4 | im | 0.017117 | 0.087010 |
… | … | … | … |
7383 | greeting | 0.000901 | 0.000414 |
7384 | tosend | 0.000901 | 0.000414 |
7385 | treats | 0.000901 | 0.000414 |
7386 | meive | 0.000901 | 0.000414 |
7387 | gotany | 0.000901 | 0.000414 |
7388 rows × 3 columns
计算单个单词的后验概率
在计算后验概率之前,还需要确定一则邮件消息属于垃圾邮件或正常邮件的先验概率。我们可以根据经验为先验概率人为确定合理的值,也可以分别使用数据集中垃圾邮件和正常邮件的比例作为先验概率的衡量。注意,在计算先验概率时,同样需要添加平滑参数 \(k\) 以避免0值的出现。
至此,我们已经完成了全部的模型训练部分,取得了计算所需的模型参数。让我们尝试根据公式计算某一个单词属于垃圾邮件和正常邮件的后验概率。该阶段用于确认模型的可行性而并非测试,因此我们依然使用训练数据集。
# use proportion in the dataset as the prior value (do not forget pseudo-counts) prior_spam = (spam_msg_num + k) / (len(train_data) + 2 * k) prior_ham = (ham_msg_num + k) / (len(train_data) + 2 * k) # choose word word = 'free' # calculate evidence (total probability) evidence = (word_prob[word_prob['Word'] == word]['P(E|S)'] * prior_spam) + (word_prob[word_prob['Word'] == word]['P(E|¬S)'] * prior_ham) # calculate posterior spamliness = (word_prob[word_prob['Word'] == word]['P(E|S)'] * prior_spam) / evidence hamliness = (word_prob[word_prob['Word'] == word]['P(E|¬S)'] * prior_ham) / evidence # generate ouput print("Output") print("Word = ['" + word + "']") print("P(E|S) = [" + str(float(word_prob[word_prob['Word'] == word]['P(E|S)'])) + "]") print("P(E|¬S) = [" + str(float(word_prob[word_prob['Word'] == word]['P(E|¬S)'])) + "]") print("P(S|E) = [" + str(float(spamliness)) + "]") print("P(¬S|E) = [" + str(float(hamliness)) + "]")
Output Word = ['free'] P(E|S) = [0.22612612612612612] P(E|¬S) = [0.013375620518477661] P(S|E) = [0.7211108654760914] P(¬S|E) = [0.2788891345239086]
计算一组单词的后验概率
一则邮件消息通常由许多单词组成,因此我们还需要能够计算一组单词组成的消息为垃圾和正常邮件的后验概率。
这里我们采用朴素贝叶斯方法的条件独立的假设,即每一个单词的单独事件之间相互独立互不影响,所以总的可能性条件概率只需将所有单词的条件概率相乘即可。(考虑到自然语言的语法和上下文语境,实际情况中可能该独立性假设并不严格成立,此处暂时忽略模型缺陷)
$$P(E|S)=P(x_1,x_2,\cdot \cdot \cdot ,x_n|S)=\prod_{i=1}^n{P(x_i|S)}$$
若一则消息为垃圾邮件的概率大于正常邮件的概率,则可将其判定为垃圾邮件;反之亦然。
计算过程中若遇到词库word_prob
之外的单词,直接做忽略处理。
# set and clean input message myMessage = "Because I don't count so good, I put some podcasts in Week 10 instead of Week 11. These end the topic of machine learning. Ethics is the last topic, which Stewart put in the correct week." s = "" for c in myMessage: if c.isalpha() or c == " ": s += c myMessage = s.lower().split() # calculate prob_spam & ham_spam (numerators) prob_spam = float(prior_spam) prob_ham = float(prior_ham) for word in myMessage: # skip words which are not in word_prob if not (word_prob['Word'] == word).any(): continue prob_spam *= float(word_prob[word_prob['Word'] == word]['P(E|S)']) prob_ham *= float(word_prob[word_prob['Word'] == word]['P(E|¬S)']) print("prob_spam (numerator):", prob_spam) print("prob_ham (numerator):", prob_ham) # calculate evidence evidence = prob_spam + prob_ham # calculate probability prob_spam /= evidence prob_ham /= evidence # output print("prob_spam:", prob_spam) print("prob_ham:", prob_ham) if float(prob_spam) > float(prob_ham): print("This is a spam message.") else: print("This is not a spam message.")
prob_spam (numerator): 1.6114026051818035e-44 prob_ham (numerator): 1.6612941946990633e-43 prob_spam: 0.08842033251559195 prob_ham: 0.9115796674844081 This is not a spam message.
模型简化与避免下溢(Underflow)
仔细观察后验概率的计算公式可以发现,需要计算的两类的后验概率的分母(事实证据)完全相同,而我们的最终目的为比较后验概率,因此可以同时省略分母的计算,且可以确保新的计算结果与真实的后验概率成正比。
$$P(S|x_1,x_2,\cdot \cdot \cdot ,x_n) \propto P(S)\cdot P(x_1|S)\cdot \cdot \cdot P(x_n|S)$$
观察上个部分的计算结果可以发现,分子部分的计算结果极小( \(10^{-44}\) ),在实际计算过程中很可能超出浮点数精度而导致下溢(Underflow)错误。为了处理这个问题,可以对上述计算公式取对数,利用对数的性质将乘法变为加法,巧妙化解极小值的出现。
$$\begin{align} log(P(S|x_1,x_2,\cdot \cdot \cdot ,x_n)) &\propto log(P(S)\cdot P(x_1|S)\cdot \cdot \cdot P(x_n|S)) \\ &= log(P(S))+log(P(x_1|S))+\cdot \cdot \cdot +log(P(x_n|S)) \end{align}$$
from math import log2 # set and clean input message myMessage = "Because I don't count so good, I put some podcasts in Week 10 instead of Week 11. These end the topic of machine learning. Ethics is the last topic, which Stewart put in the correct week." s = "" for c in myMessage: if c.isalpha() or c == " ": s += c myMessage = s.lower() def spam_filter(message, showDetail=False): message = message.split() # calculate log_prob_spam & log_prob_ham log_prob_spam = log2(float(prior_spam)) log_prob_ham = log2(float(prior_ham)) for word in message: # skip words which are not in word_prob if not (word_prob['Word'] == word).any(): continue # sum the log value log_prob_spam += log2(float(word_prob[word_prob['Word'] == word]['P(E|S)'])) log_prob_ham += log2(float(word_prob[word_prob['Word'] == word]['P(E|¬S)'])) # output detail info if showDetail: print("log_prob_spam (numerator):", log_prob_spam) print("log_prob_ham (numerator):", log_prob_ham) return 'spam' if float(log_prob_spam) > float(log_prob_ham) else 'ham' # output if spam_filter(myMessage, showDetail=True) == 'spam': print("This is a spam message.") else: print("This is not a spam message.")
log_prob_spam (numerator): -145.4765191819722 log_prob_ham (numerator): -142.11060050074536 This is not a spam message.
模型测试
万事俱备,让我们用测试集进行最终的测试评估并输出统计结果。
match_spam = match_ham = thought_ham_is_spam = thought_spam_is_ham = 0 # iterate over test data for i in test_data.index: if spam_filter(test_data.loc[i, 'Message']) == 'spam': if test_data.loc[i, 'Category'] == 'spam': match_spam += 1 else: thought_ham_is_spam += 1 else: if test_data.loc[i, 'Category'] == 'ham': match_ham += 1 else: thought_spam_is_ham += 1 # output print('match_spam', match_spam) print('match_ham', match_ham) print('thought_ham_is_spam', thought_ham_is_spam) print('thought_spam_is_ham', thought_spam_is_ham) print('Accuracy:', (match_spam + match_ham) / len(test_data.index))
match_spam 187 match_ham 1104 thought_ham_is_spam 96 thought_spam_is_ham 6 Accuracy: 0.9267767408470926
可以看到该模型取得了较为满意的正确率。
提升与改进
- 使用更大的数据集来扩容词库并获取更准确的条件概率参数。
- 尝试不同先验概率的值进行优化。
- 排除词库中短小的对于判断类别没有帮助的常用数词和连词如”a”, “to”等。
- 考虑更多垃圾邮件的典型特征如电话号码,url网址,电子邮件地址等。