Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Python spam identification


Jun 01, 2021 Article blog



This article was reproduced to Know ID: Charles (Bai Lu) knows his personal column

Download the W3Cschool Mobile App, 0 Foundation Anytime, Anywhere Learning Programming >> Poke this to learn

Lead

Use simple machine learning algorithms for spam identification.

Let's get off to a pleasant start


Related documents

Baidu web download link: https://pan.baidu.com/s/1Hsno4oREMROxWwcC_jYAOA

Password: qa49

The dataset originated from the network and was deleted by the intrusion.


Develop tools

Python version: 3.6.4

Related modules:

scikit-learn module;

Jieba module;

numpy module;

and some Python's own modules.


Environment construction

Install Python and add it to the environment variable, and pip installs the relevant modules that are required.

Step-by-step implementation

(1) Divide the data set

Most of the data sets used on the Internet for spam identification are English messages, so in order to show sincerity, I took a moment to find a data set Chinese mail. The data sets are divided as follows:

Training dataset:

7063 normal messages (under the data/normal folder);

7775 spam messages (under the data/spam folder).

Test dataset:

A total of 392 messages (under the data/test folder).

(2) Create a dictionary

The message content in the dataset is generally as follows:

 Python spam identification1

First, we use regular expressions to filter out non-Chinese characters, then use thejieba thesaurus to break up statements and clear some deactivated words, and finally use the above results to create a dictionary in the following format:

"Word 1": Word 1 Word Frequency, "Word 2": Word 2 Word Frequency...

The specific implementation of these contents is reflected in the "utils.py" file and can be called in the main program (train.py):

 Python spam identification2

The final result is saved in the "results.pkl" file.

Is it done? Of course not!!!

There are now 52,113 words in the dictionary, which is obviously too many, some of which appear only once or twice, and it is clearly unwise to keep a dimension empty when subsequent features are extracted. Therefore, we only keep the 4000 words with the highest frequency of words as the dictionary that was eventually created:

 Python spam identification3

The final result is saved in the "wordsDict.pkl" file.

(3) Feature extraction

Once the dictionary is ready, we can convert the contents of each letter to a word vector, apparently with a dimension of 4000, each dimension represents the frequency of a high-frequency word appearing in the letter, and finally, we combine these word vectors into a large feature vector matrix, the size of which is:

(7063+7775)×4000

That is, the characteristic vector of the first 7063 behavior normal messages, and the rest is the characteristic vector of spam.

The specific implementation of the above is still reflected in the "utils.py" file and is called in the main program as follows:

 Python spam identification4

The final result is saved in the "fvs_% d_%d.npy" file, where the first formatting character represents the number of normal messages and the second formatting character represents the number of spam messages.

(4) Train classifiers

We use the scikit-learn machine learning library to train classifiers, and the model selects the simple Bayes classifier and SVM (support vector machine):

 Python spam identification5

(5) Performance testing

Test the model with a test dataset:

 Python spam identification6

The results are as follows:

 Python spam identification7  Python spam identification8

It can be found that the performance of the two models is similar (SVM is slightly better than plain Bayesian), but SVM is more inclined to spam the judgment.

That's all~

The full source code can be found in the relevant files.


more

Model principles are not specified, as a series may follow, with a more complete and detailed description of common algorithms in machine learning. So, let's do it first