Python spam identification

Jun 01, 2021 Article blog

This article was reproduced to Know ID: Charles (Bai Lu) knows his personal column

Download the W3Cschool Mobile App, 0 Foundation Anytime, Anywhere Learning Programming >> Poke this to learn

Lead

Use simple machine learning algorithms for spam identification.

Let's get off to a pleasant start

Develop tools

Python version: 3.6.4

Related modules:

scikit-learn module;

Jieba module;

numpy module;

and some Python's own modules.

Environment construction

Install Python and add it to the environment variable, and pip installs the relevant modules that are required.

Step-by-step implementation

(1) Divide the data set

Most of the data sets used on the Internet for spam identification are English messages, so in order to show sincerity, I took a moment to find a data set Chinese mail. The data sets are divided as follows:

Training dataset:

7063 normal messages (under the data/normal folder);

7775 spam messages (under the data/spam folder).

Test dataset:

A total of 392 messages (under the data/test folder).

(2) Create a dictionary

The message content in the dataset is generally as follows:

First, we use regular expressions to filter out non-Chinese characters, then use thejieba thesaurus to break up statements and clear some deactivated words, and finally use the above results to create a dictionary in the following format:

"Word 1": Word 1 Word Frequency, "Word 2": Word 2 Word Frequency...

The specific implementation of these contents is reflected in the "utils.py" file and can be called in the main program (train.py):

The final result is saved in the "results.pkl" file.

Is it done? Of course not!!!

There are now 52,113 words in the dictionary, which is obviously too many, some of which appear only once or twice, and it is clearly unwise to keep a dimension empty when subsequent features are extracted. Therefore, we only keep the 4000 words with the highest frequency of words as the dictionary that was eventually created:

The final result is saved in the "wordsDict.pkl" file.

(3) Feature extraction

Once the dictionary is ready, we can convert the contents of each letter to a word vector, apparently with a dimension of 4000, each dimension represents the frequency of a high-frequency word appearing in the letter, and finally, we combine these word vectors into a large feature vector matrix, the size of which is:

(7063+7775)×4000

That is, the characteristic vector of the first 7063 behavior normal messages, and the rest is the characteristic vector of spam.

The specific implementation of the above is still reflected in the "utils.py" file and is called in the main program as follows:

The final result is saved in the "fvs_% d_%d.npy" file, where the first formatting character represents the number of normal messages and the second formatting character represents the number of spam messages.

(4) Train classifiers

We use the scikit-learn machine learning library to train classifiers, and the model selects the simple Bayes classifier and SVM (support vector machine):

(5) Performance testing

Test the model with a test dataset:

The results are as follows:

It can be found that the performance of the two models is similar (SVM is slightly better than plain Bayesian), but SVM is more inclined to spam the judgment.

That's all~

The full source code can be found in the relevant files.

Model principles are not specified, as a series may follow, with a more complete and detailed description of common algorithms in machine learning. So, let's do it first

Python spam identification

This article was reproduced to Know ID: Charles (Bai Lu) knows his personal column

Download the W3Cschool Mobile App, 0 Foundation Anytime, Anywhere Learning Programming >> Poke this to learn

Lead

Related documents

Develop tools

Environment construction

more

Cookie Consent