Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

What do reptiles do? What linguistic reptiles are good?


May 31, 2021 Article blog



Before we explain, let's understand how Baidu Encyclopedia defines web crawlers:

Web crawlers (also known as web spiders, web bots, and more often in the FOAF community, more often as web chasers) are programs or scripts that automatically grab World Wide Web information according to certain rules. Other less commonly used names include ants, automatic indexes, simulators, or worms.

In layman's terms, a web crawler is a program or script that automatically accesses the Internet and can download what we want on a Web site, similar to a robot. Be able to get other people's website information to their computers, and then be able to filter, filter, summarize, organize, sort and so on.

The languages currently used primarily for reptile development are PHP, Java, Python, and C, so why is Python able to fire up quickly in so many languages where reptiles can be written? The reason for this is that the editor has a few experiences here to share with you.

1. The vagaries of network reptiles

The little mates who have written about reptiles may have had this experience: the reptiles, who ran well yesterday, suddenly hung up today and didn't work. T he reason why do not care is: the revision of the web page, the blocking of the website and so on. In this case, we have to debug as quickly as possible to find out what the problem is and fix it as quickly as possible to get it up and running.

2, random Python

The changes required by today's reptiles are anywhere, anytime and complex, so writing a web crawler will undoubtedly require a language that can be developed quickly and flexibly, and it must be supported by a complete and rich library. T hese conditions are undoubtedly in the sword finger Python. That's why Python is rightly the language of choice for developing web crawlers.

3. Simple and rich Python

After listening to the editor-in-chief say so much, you can't help but wonder, Python's natural advantages for web reptile development are those? And listen to me:

First, simple syntax

Python's syntax is very concise, it advocates: simple but not simple, Python developers philosophy is "one way to do one thing, preferably only one way." T his concept makes the code written by Python developers less personal, which makes it easy for others to understand other people's code. P ython's simplicity also allows developers to implement functionality in just a few lines of code. F or the same functionality, using Java may require dozens or hundreds of lines, while using C+ may require hundreds of lines. Even Bruce Eckel, a big cow in C+

Life is short,you need Python!
Life is short, you need Python!

Python's clean syntax makes it easy to implement and modify reptiles. Simply put, it allows you to write Thief 6.

Second, rich Python module

I'm sure you've been more or less exposed to Python's modules (libraries), but you probably haven't had time or opportunity to touch more types of modules. Y ou can say, "Almost all the features you want to implement, Python has modules to implement." I t's crazy, that's what's in your heart at this point. B ut it can be said that Python's existing modules to meet 90 percent of your needs are completely free of problems. S o, keep in mind that in the future development process, if you need to implement some basic functionality, you might want to search first and ask if someone has implemented the feature and uploaded it to pypi. If so, congratulations, all you have to do is pip install, and of course you have to verify it!

Next, let's give you examples of some of the features commonly used by web reptiles and the modules they need:

Download the web page: Python comes with standard module urllib.request, third-party open source module requests.

Process URL: Python comes with the module urllib.parse.

Parsing html: Python comes with the module HTMLParser, a third-party open source module, BeautifulSoup.

Mature reptile frame: historic scrapy, up-and-coming show pyspider.

The above is common to develop network reptiles some basic modules, in the specific development encountered in the need to implement a function may wish to search first, perhaps you need to achieve the function of others have been developed. O r, in that sentence, "Almost all the features you want to implement, Python has modules to implement." All you need to do is combine the modules organically like building blocks to achieve the functionality you need.

What's the reason not to choose Python to implement your web crawler like a building block? Recommended reading: Python static reptiles, Python Scrapy web crawlers.