Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Python reptiles have a practical effect


May 30, 2021 Article blog



Network reptile is an "old" network technology with the birth and evolution of the Internet, with the Internet into the era of big data, reptile technology ushered in a new wave of revitalization. T his article through the enterprise and the Internet two scenarios to tell you about the book reptile played an important role. This article is from the book Bugs - Python Stunt.

Data collection and data storage play an extremely important role in big data architecture and can be said to be the core foundation of big data. R eptile technology accounts for a large proportion of these two core technical levels. W hy is this said? Let's take a real-world scenario to see what role the reptile actually plays.

Active - The reptile's focus is on crawling, which is an active behavior. In other words, it is an application that can run independently and according to certain rules.

Automation - Because the processed data can be very dispersed, the data retention has a certain timeliness, so it is an unattended automation program.

In my nearly 20 years of IT career, enterprise management systems are the largest proportion of projects or products I have worked on. During the development of these projects and products, I observed that many data processing scenarios within the enterprise can actually be handled using reptile technology, thus replacing the original artificial operation with amazing efficiency.

Take what I've seen within e-commerce companies in recent years, for example, Alibaba has shown its global strength and status in e-commerce, almost equating e-commerce with Ali. A li offers a wide range of excellent operating tools for stores and businesses. W e take it for granted that the degree of information management within e-commerce enterprises must be very high, won't we? O n the contrary, however, most of the small and medium-sized e-commerce enterprises I have seen and even the information level within the three-board listed enterprises are still very backward, and many enterprises still rely on Excel for such a large number of manpower-led form processing. So the question is, why Alibaba, JD.com, these e-commerce platforms have provided a large number of high-quality operating tools, and e-commerce enterprises are very low level of information technology, but also rely on labor-intensive ways to operate?

First, e-commerce companies will not only open stores on one platform, usually in multiple platforms at the same time to open multiple stores to broaden the market sales channels;

E-commerce companies can usually monitor price fluctuations and sales of certain products only through the special tools available on a particular platform, rather than a comprehensive and unified understanding of how the products they sell perform on each platform. H owever, this demand is clearly urgent, because only understanding changes in sales data can adjust sales strategies in real time. I have seen the most practice is that the enterprise arranges a person from the major e-commerce platform to export running data, and then merged into Excel, and then a statistical, manually make a variety of statistical reports as the basis for analysis, this practice often has to do a single product!

(1) Lack of a unified source of data - this is irreconcilable because e-commerce is inherently diverse.

(2) Structured data coexists with unstructured data - the most common format for data interaction between enterprises is Excel, and the interactive tools are WeChat and QQ.

(3) There are many time versions of a data - QQ or WeChat on the same file modified more and repeated transmission will appear a variety of data .xlsx, data(1).xlsx... data(n).xlsx。

(4) The data structure may be arbitrary - columns named in English are rarely seen in Excel files, and even columns with the same effect are likely to have different Chinese names.

(5) Data search becomes difficult - finding the same copy of data for a certain period of time between an e-commerce company and a supplier can be a terrible event.

Let's boldly assume what would happen if these things were replaced by reptiles.

(1) Every day reptiles at a fixed time to Taobao, JD.com or other e-commerce platforms automatically download the current business data of merchants.

(3) Download the Excel files sent daily from other resellers from the designated folder of one PC on the intranet, organize them and save them to the database.

(4) Find that some goods are not in stock enough to automatically generate the supplier's prescribed format of the order form, sent by e-mail.

(5) Decision makers (operations managers/bosses) view daily statistics on their mobile phones or PCs using data visualization tools, or send statistical reports directly to their mailboxes by the reptile system.

At this point, you might wonder: Aren't reptiles crawling alone for data? W hy can we handle so many things? I s this still the technical field of reptiles? T he answer is yes, the above example is simplified by the real case in a project I've been through, and the reptile's behavior combines the post-processing of crawl data with Python automation. I n fact, reptiles can do more things, the specific implementation of the actual needs within the enterprise related. And on the Internet, it's more like a robot with "intelligence".

Enterprise intranet crawlers are only a small range of Internet reptiles, reptile technology and automation technology is a comprehensive application, and automation technology may account for more than the means of reptile technology.


Internet crawlers are more single and common than corporate reptiles, and in an age when data is readily available, it is not uncommon to use them for gold. S uch as the search engine itself is a "master of wormcraft", as long as they want to climb the site, almost no crawl do not wear. T he hottest content apps on the App Store are always aggregated apps for certain news categories, and most website developers know that it's just a comprehensive platform that aggregates links to various news sites, and that their content relies on "bugs" to get first-hand news information in major news portals. What's more, the news is "free" and can be easily accessed by any user from the Internet, and of course it can include "bugs".

There is a lot of free content on the Internet, such as news and information, or data shared by governments, businesses, third-party organizations, groups, and even individuals. F or example, we can easily go to the Bureau of Meteorology's website to get information about rainfall in an area in the last decade, or to get price movements for each stock on the stock exchange that day, or to tweet details of one of the most contagious events of the day. In other words, as long as you have a clear target data source, as long as you have access to the data source, then you can also let the reptile work for you to get all the data you want from the data source at once.


To successfully crawl data from the Internet through reptiles, you need to understand the characteristics of the data, and then take targeted measures to be able to do so without disadvantage. In general, the crawlable data in the Internet can be divided into the following categories:

(1) General web pages - Web pages that comply with the W3C specification can be considered a semi-structured content, can be used through some page element analysis tools to read the specified data from the page, because of the great freedom of web page development, almost no site structure is exactly the same. A nd there are many variables, perhaps page reading to pass the permissions of the review, or the page by the client's JavaScript to draw to present the final effect, and even the page may come from CDN, its content may not be up-to-date, just a copy of a network cache, and so on. But don't worry, when you've mastered the worm, it's not going to stop you anymore.

(2) API Resources - API resources are the most suitable data sources to crawl (none of them), because the RESTful API is structured data that is called or returned in the form of XML or JSON, which can be read even without an API instruction manual.

(3) File resources - file resources belong to the most troublesome data sources, unless the crawled file is presented in a structured data format, otherwise as free text, because it is unstructured, we need to do some post-processing of the content of the text, so that the reptiles "read" the text content, and then determine which content is the target of acquisition.

(4) Media resources - such as pictures and videos, such as the crawl action is basically similar to the file, but because the picture and video and other resources are generally relatively large, may also need to do some analysis of the meta-information of the file to determine whether it has the value of crawling, in order to avoid allowing the reptile to consume excessive unnecessary network traffic and crawl time.

Reptiles cover a wide range of technical areas and use a wide range of technologies, from basic network access to complex machine learning, which can make the idea of starters hope to get away with it. In order to give everyone a comprehensive understanding, we have deliberately summarized the techniques to be learned and used in the three stages of primary, intermediate and advanced levels into the following diagram for reference.

The Internet's fiercest battleground, apart from between security experts and hackers, is probably the field of reptiles and anti-reptiles. A ccording to statistics, reptile traffic has already exceeded the human real access request traffic. T he Internet is full of reptiles of all kinds, cloud, traditional industries have different sizes of users are targeted by reptile enthusiasts, where do these reptiles come from? W hose data was crawled? W here will the data be used?