May 31, 2021 Article blog
Hello everyone, I'm your dear w3cschool editor.
Recently, the hottest TV series is "Daqin Fu", small editors have been chasing, December 1st broadcast, that word-of-mouth bar. B ut after three episodes, the editor-in-chief that's a good guy, I'm. I 'm sure I've seen the little buddies and I'm the same. The show's rating on the pods also dropped from an initial 8.9 to 6.0 today.
So the editor can't help but use Python crawled the pods on the "Daqin Fu" comments related data, carried out a wave of analysis, but also to share ideas.
Now that you're going to do the analysis, you've got to get the data first. S mall editor choose is the bean petal net, the reason is very simple, its home data is all now, anti-pickpocketing difficulty is not big. The information we mainly want to get is the star rating of the review and the time of the comment, along with the commenter and comment we also pull down.
Technically, there are five libraries for requests, bs4, pandas, time, matplotlib, with Pychram and Anaconda jupyter nootbooks.
Before we start writing, let's look at the rules of change on the website page url:
https://movie.douban.com/subject/26413293/comments?status=P
https://movie.douban.com/subject/26413293/comments?start=20&limit=20&status=P&sort=new_score
https://movie.douban.com/subject/26413293/comments?start=40&limit=20&status=P&sort=new_score
Small partners see out, it is not difficult to find that only start in the page transformation is not the same, so in the follow-up page-turning operation we only need to modify the start parameters can, then small editor in this to give everyone a question, home url how to solve it?
About Anti-Picking:
For the climbing measures of the pods, it is extremely easy to find the real short review link. But the editor suggested that you should log in and carry cookie information for data crawling, otherwise the pods will identify you for the reptile block your ip, as to what to put a small editor to give you a reference:
headers = {
"Accept":"application/json, text/plain, */*",
"Accept-Language":"zh-CN,zh;q=0.9",
"Connection":"keep-alive",
"Host":"movie.douban.com",
"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
"Cookie": 'Here is your own cookie'
}
So where's the cookie information, we open the web page under F12, or right-click to check, click NetWork and refresh the interface, find the first information refreshed, and then pull it to find it.
Add a little: the small editor intended to "Daqin Fu" all the short reviews are pulled down, but eventually only crawled to 500, this should also be the anti-pickpocketing measures of the pods (if there is a god, you can go down to study how to crack, and then tell the small editor).
Second, data processing
1, reject duplicate values
View the number of duplicates with the following command:
np.sum (df.duplicated ()) # Calculates the number of repetitions
Delete rows where all variables are duplicated with the following command:
DF.DROP_DUPLICATES (Keep = false) # 行 行 行 行
The results are as follows:
As you can see, the results are still very beautiful.
Third, visual analysis
As the saying goes: Words are not as good as tables, tables are not as good as charts. T he data we crawl to, ultimately, has to have a visual representation, so that we can analyze the laws behind the data, so that we have a clear understanding. The following sub-editors visualize the data we get in the following ways.
1, the number of comments over time change trend
It's not hard to see from the chart that the number of short reviews has been on the rise until December 4th, peaking on December 4th. It's the same as watching emotional changes in a small editor, from anticipation to disappointment.
2, star rating pie chart
As can be seen from the figure: the evaluation of the play is still very low, 1 star and 2 stars basically occupy the whole pie chart, that is to say, the play has not been recognized by everyone. Well, not much to say, the editor-in-chief is the one-star contributor.
Recommended lessons:
Introduction to Python Data Analysis Case Study