May 30, 2021 Article blog
Getting data sets is our first step in data analysis. T he main ways to get data today are generally: off-the-shelf data; write your own crawlers to crawl data; use existing reptile tools to crawl what you need, save it to a database, or save it locally as a file.
Recommended reading: Python static reptiles, Python Scrapy web crawlers
a. Save the data if it is needed
b. If it is another URL on the page, proceed to the second step.
To initiate a request to the target site through the HTTP library is to send a Request that can contain additional header and other information, waiting for the server to respond Get response Content If the server responds properly, you get a Reponse, Reponse's content is the page content to be obtained, and the type may be HTML, json string, binary data (such as picture video) and so on. T he content obtained by parsing content may be HTML, can be parsed using regular expressions, page parsing library, may be json, can be directly converted to JSON resolution object resolution, may be binary data, can be saved or further processed. Save data in a variety of forms, either as text, into a database, or in a specific format
mechanism
countermeasure
How it is parsed
Data get hands, we need to wash the data we climb, for the future data analysis to do the paving, if the cleaning is not in place will inevitably affect the data analysis after. The following will be treated with uniform, empty values from the data format.
Recommended Lessons: Python 3 Advanced: Data Analysis and Visualization
Remove spaces from the data and strip() the crawled string to convert the Chinese data to Arabic numerals when the data is crawled with a reptile.
For example, 17,000 becomes 17,000, the code is as follows:
def get_int(s):
if s[-1]=="万":
s=s[0:-1]
s=int(float(s)*10000)
else:
s=int(s)
return s
The results are as follows
if __name__ == '__main__':
s="1.2万"
price = get_int(s)
print(price)#12000
When crawling data with a reptile, if the value of the crawl is not present, an error is reported, and the non-existent video information data is skipped with the exception-handling statement try?except:pass (the code for crawling video information).
try:
html=requests.get(Link).text
doc=BeautifulSoup(html);
List=doc.find('div',{'class':'ops'}).findAll('span')
like=List[0].text.strip()#点赞
like=self.getint(like)
coin=List[1].text.strip()#投币
coin=self.getint(coin)
collection=List[2].text.strip()#收藏
collection=self.getint(collection)
print('点赞',like)
print('投币',coin)
print('收藏',collection)
# #将数据 拼合成字典
data={
'Title':Title,
'link':Link,
'Up':Up,
'Play':Play,
'Like':like,
'Coin':coin,
'Collection':collection,
}
# 存储到csv文件
self.write_dictionary_to_csv(data,'blibli2.csv')
pass
except:
pass
The table parameter information is shown in the figure
Analysis of B-station hits and 4 ratings for 2020 popular videos More than 10 million emissions is a rating 5 million to 10 million playbacks are a rating 5 million to 1 million playbacks are a rating Of less than one million playbacks is a rating
l1=len(data[data['Play'] >= 10000000])
l2=len(data[(data['Play'] < 10000000) & (data['Play'] >=5000000)])
l3=len(data[(data['Play'] < 5000000) & (data['Play'] >=1000000)])
l4=len(data[data['Play'] < 1000000])
The data is then visualized through the matplotlib library. Get the figure below.
plt.figure(figsize=(9,13)) #调节图形大小
labels = ['大于一千万','一千万到五百万','五百万到一百万','小于一百万'] #定义标签
sizes = [l1, l2, l3, l4] #每块值
colors = ['green', 'yellow', 'blue', 'red'] #每块颜色定义
explode = (0,0,0,0) #将某一块分割出来,值越大分割出的间隙越大
# 中文乱码和坐标轴负号处理
plt.rcParams['font.sans-serif'] = ['KaiTi']
plt.rcParams['axes.unicode_minus'] = False
patches,text1,text2 = plt.pie(sizes,
explode=explode,
labels=labels,
colors=colors,
autopct = '%3.2f%%', #数值保留固定小数位
shadow = False, #无阴影设置
startangle =90, #逆时针起始角度设置
pctdistance = 0.6) #数值距圆心半径倍数距离
#patches饼图的返回值,texts1饼图外label的文本,texts2饼图内部的文本
# x,y轴刻度设置一致,保证饼图为圆形
plt.axis('equal')
plt.title("B站热门播放量分布图")
plt.legend() # 右上角显示
plt.show()
As you can see from the figure, most of the top-recommended videos on station B are between five million and one million per week, and videos below one million views are hard to see on the weekly must-see top recommendations, and there are very few videos that reach 10 million in a year. Let's take a look at the top 10 videos that play volume are those that look good
data.nlargest(10,columns='Play')
The data is then visualized through the matplotlib library. Get the figure below.
d.plot.bar(figsize = (10,8),x='Title',y='Play',title='Play top 10')
plt.xticks(rotation=60)#夹角旋转60度
plt.show()
From the figure, it can be seen that the Beep Mile New Year's Festival is the most popular and broadcast much more than other videos, indicating that Station B's 2020 New Year's Festival program was a success.
The author's work is the most popular through data analysis to determine the author's most popular in 2020. The author is divided and the number of occurrences is counted
d2=data.loc[:,'Up'].value_counts()
d2=d2.head(10)
The data is then visualized through the matplotlib library. Get the figure below.
d2.plot.bar(figsize = (10,8),title='UP top 10')
plt.show()
Explain that the author with the most weekly hits on Station B is Cool Wind Kaze, which has been trending 48 times a year for 52 weeks, with almost weekly hits showing his videos. According to the data, the most popular author in 2020 is Cool Wind Kaze.
Analyze the average percentage of likes, coins, and favorites of popular videos
data['点赞比例'] = data['Like'] /data['Play']
data['投币比例'] = data['Coin'] /data['Play']
data['收藏比例'] = data['Collection'] /data['Play']
d3=data.iloc[:,8:11]
d3=d3.mean()
The data is then visualized through the matplotlib library. Get the figure below.
d3.plot.bar(figsize = (10,8),title='UP top 10')
plt.show()
The highest percentage of likes will be in 2020, at about 9 per cent. E xplain that on average, one in 10 people who watch a video on Station B will like it. On average, only one in 20 people will coin a video.
Extract the high frequency of the title, see which kind of title is more popular First of all the titles are traversed, stored in the string s
d4=data['Title']
s=''
for i in d4:
s=s+i
Then visualize with the word cloud
The title with "Zhu Once, Half Buddha, Luo Xiang" and other author name or "League of Heroes, the original god" and other popular videos of the game.