Discover The Top VideoS For Station B 2020 Week (Data AnalysiS)

May 30, 2021 Article blog

1. Data crawl

Getting data sets is our first step in data analysis. T he main ways to get data today are generally: off-the-shelf data; write your own crawlers to crawl data; use existing reptile tools to crawl what you need, save it to a database, or save it locally as a file.

Recommended reading: Python static reptiles, Python Scrapy web crawlers

Reptile design ideas

First determine the URL address of the page that needs to be crawled
Get the appropriate HTML page through the HTTP/HTTPS protocol
Extract useful data from HTML pages

a. Save the data if it is needed

b. If it is another URL on the page, proceed to the second step.

The basic process of reptiles

To initiate a request to the target site through the HTTP library is to send a Request that can contain additional header and other information, waiting for the server to respond Get response Content If the server responds properly, you get a Reponse, Reponse's content is the page content to be obtained, and the type may be HTML, json string, binary data (such as picture video) and so on. T he content obtained by parsing content may be HTML, can be parsed using regular expressions, page parsing library, may be json, can be directly converted to JSON resolution object resolution, may be binary data, can be saved or further processed. Save data in a variety of forms, either as text, into a database, or in a specific format

Anti-reptile mechanism and countermeasure

mechanism

Anti-reptiles by analyzing headers requested by users. The most applied in the site
By verifying the user's behavior to carry out anti-reptile, it is better to determine whether the same ip in a short period of time frequent visits to the corresponding website and so on for analysis.
Increase the difficulty of crawling through dynamic pages to achieve anti-reptile purposes.

countermeasure

Construct the headers information requested by these users in the reptile to disguise the reptile as a browser
Using a proxy server and switching proxy servers frequently can generally overcome restrictions.
Using software such as selenium-phantomJS, you can overcome anti-reptile methods: user-agent, proxy, verification code, dynamic data loading, encryption data

The selection and processing of data

Web page text, such as HTML document json-formatted text;
Picture, get is binary file saved as a picture format;
Video The binary file obtained by the video can be saved in video format;
Others can be obtained as long as they can be requested.

How it is parsed

Direct processing
json parsing
regular expression
BeautifulSoup
PyQuery
XPath

2. Data cleaning

Data get hands, we need to wash the data we climb, for the future data analysis to do the paving, if the cleaning is not in place will inevitably affect the data analysis after. The following will be treated with uniform, empty values from the data format.

The format is uniform

Remove spaces from the data and strip() the crawled string to convert the Chinese data to Arabic numerals when the data is crawled with a reptile.

For example, 17,000 becomes 17,000, the code is as follows:

def get_int(s):




if s[-1]=="万":
    s=s[0:-1]
    s=int(float(s)*10000)
else:
    s=int(s)
return s

The results are as follows

if __name__ == '__main__':
    s="1.2万"
    price = get_int(s)
    print(price)#12000

Empty value processing

When crawling data with a reptile, if the value of the crawl is not present, an error is reported, and the non-existent video information data is skipped with the exception-handling statement try?except:pass (the code for crawling video information).

try:
    html=requests.get(Link).text
    doc=BeautifulSoup(html);
    List=doc.find('div',{'class':'ops'}).findAll('span')
    like=List[0].text.strip()#点赞
    like=self.getint(like)
    coin=List[1].text.strip()#投币
    coin=self.getint(coin)
    collection=List[2].text.strip()#收藏
    collection=self.getint(collection)
    print('点赞',like)
    print('投币',coin)
    print('收藏',collection)
    # #将数据 拼合成字典 
    data={
        'Title':Title,
        'link':Link,
        'Up':Up,
        'Play':Play,
        'Like':like,
        'Coin':coin,
        'Collection':collection,
    }
    # 存储到csv文件
        self.write_dictionary_to_csv(data,'blibli2.csv')
        pass
    except:
        pass

3. Data analysis and visualization

The table parameter information is shown in the figure

Discover The Top VideoS For Station B 2020 Week (Data AnalysiS)1

Analysis of video emissions

Analysis of B-station hits and 4 ratings for 2020 popular videos More than 10 million emissions is a rating 5 million to 10 million playbacks are a rating 5 million to 1 million playbacks are a rating Of less than one million playbacks is a rating

l1=len(data[data['Play'] >= 10000000])
l2=len(data[(data['Play'] < 10000000) & (data['Play'] >=5000000)])
l3=len(data[(data['Play'] < 5000000) & (data['Play'] >=1000000)])
l4=len(data[data['Play'] < 1000000])

The data is then visualized through the matplotlib library. Get the figure below.

plt.figure(figsize=(9,13)) #调节图形大小
labels = ['大于一千万','一千万到五百万','五百万到一百万','小于一百万'] #定义标签
sizes = [l1, l2, l3, l4] #每块值
colors = ['green', 'yellow', 'blue', 'red'] #每块颜色定义
explode = (0,0,0,0) #将某一块分割出来，值越大分割出的间隙越大
# 中文乱码和坐标轴负号处理
plt.rcParams['font.sans-serif'] = ['KaiTi']
plt.rcParams['axes.unicode_minus'] = False
patches,text1,text2 = plt.pie(sizes,
                      explode=explode,
                      labels=labels,
                      colors=colors,
                      autopct = '%3.2f%%', #数值保留固定小数位
                      shadow = False, #无阴影设置
                      startangle =90, #逆时针起始角度设置
                      pctdistance = 0.6) #数值距圆心半径倍数距离
#patches饼图的返回值，texts1饼图外label的文本，texts2饼图内部的文本
# x，y轴刻度设置一致，保证饼图为圆形
plt.axis('equal')
plt.title("B站热门播放量分布图")
plt.legend() # 右上角显示
plt.show()

Discover The Top VideoS For Station B 2020 Week (Data AnalysiS)2

As you can see from the figure, most of the top-recommended videos on station B are between five million and one million per week, and videos below one million views are hard to see on the weekly must-see top recommendations, and there are very few videos that reach 10 million in a year. Let's take a look at the top 10 videos that play volume are those that look good

data.nlargest(10,columns='Play')

Discover The Top VideoS For Station B 2020 Week (Data AnalysiS)3

The data is then visualized through the matplotlib library. Get the figure below.

d.plot.bar(figsize = (10,8),x='Title',y='Play',title='Play top 10')
plt.xticks(rotation=60)#夹角旋转60度
plt.show()

Discover The Top VideoS For Station B 2020 Week (Data AnalysiS)4

From the figure, it can be seen that the Beep Mile New Year's Festival is the most popular and broadcast much more than other videos, indicating that Station B's 2020 New Year's Festival program was a success.

The author is analyzed

The author's work is the most popular through data analysis to determine the author's most popular in 2020. The author is divided and the number of occurrences is counted

d2=data.loc[:,'Up'].value_counts()
d2=d2.head(10)

The data is then visualized through the matplotlib library. Get the figure below.

d2.plot.bar(figsize = (10,8),title='UP top 10')
plt.show()

Discover The Top VideoS For Station B 2020 Week (Data AnalysiS)5

Explain that the author with the most weekly hits on Station B is Cool Wind Kaze, which has been trending 48 times a year for 52 weeks, with almost weekly hits showing his videos. According to the data, the most popular author in 2020 is Cool Wind Kaze.

Analysis of video parameters

Analyze the average percentage of likes, coins, and favorites of popular videos

data['点赞比例'] = data['Like'] /data['Play']
data['投币比例'] = data['Coin'] /data['Play'] 
data['收藏比例'] = data['Collection'] /data['Play']
d3=data.iloc[:,8:11]
d3=d3.mean()

Discover The Top VideoS For Station B 2020 Week (Data AnalysiS)6

The data is then visualized through the matplotlib library. Get the figure below.

d3.plot.bar(figsize = (10,8),title='UP top 10')
plt.show()

Discover The Top VideoS For Station B 2020 Week (Data AnalysiS)7

The highest percentage of likes will be in 2020, at about 9 per cent. E xplain that on average, one in 10 people who watch a video on Station B will like it. On average, only one in 20 people will coin a video.

The title is analyzed

Extract the high frequency of the title, see which kind of title is more popular First of all the titles are traversed, stored in the string s

d4=data['Title']
s=''
for i in d4:
    s=s+i

Then visualize with the word cloud

Discover The Top VideoS For Station B 2020 Week (Data AnalysiS)8

The title with "Zhu Once, Half Buddha, Luo Xiang" and other author name or "League of Heroes, the original god" and other popular videos of the game.

Discover The Top VideoS For Station B 2020 Week (Data AnalysiS)

Table of contents

1. Data crawl

Reptile design ideas

The basic process of reptiles

Anti-reptile mechanism and countermeasure

The selection and processing of data

2. Data cleaning

The format is uniform

Empty value processing

3. Data analysis and visualization

Analysis of video emissions

The author is analyzed

Analysis of video parameters

The title is analyzed

Cookie Consent