Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Websites are common anti-pickpocketing methods


May 31, 2021 Article blog



Today's small editor to take you to understand, the site anti-reptile way, first of all to give you three small questions, seriously think Oh!

  • What are the ways in which websites are anti-reptiles?
  • Why the site should be anti-reptile
  • How to deal with anti-reptiles

Website anti-reptile approach:

First, control access through User-Agent:

Whether it's a browser or a reptile, when you make a web request to the server, you send a header file: headers, identifying, and the most important field for reptiles is that many User-Agent websites have user-Agent whitelists, and only User-Agent, which is within the normal scope, can be accessed normally.

Workaround: You can set up User-Agent yourself, or better yet, you can randomly pick out a standard-compliant use from a series of User-Agents.


Second, through the JS script to prevent reptiles:

For example, if you want to crawl a web site, it has a verification page to verify that you are a reptile before requesting it. H ow does it work? He generates a large random number from the JS code, then asks the browser to sum the string of numbers through the js operation, and then returns it to the server.

Solution: Use PhantomJS! PhantomJS is a Python package that fully simulates a browser without a graphical interface", and JS scripts verify that what is no longer a problem.


Third, through IP restrictions to anti-reptiles:

If a fixed ip in a short period of time, a large number of quick access to a website, it will naturally attract attention, administrators can use some means to seal this ip, reptiles will naturally do nothing.

Workaround: A more mature approach is to simply access ip from different ips through ip agents so that ip is not blocked. But ip agent acquisition itself is a very troublesome thing, there are free and paid on the Internet, but the quality is uneven.

If you need it in your enterprise, you can purchase cluster cloud services from a pool of agents by purchasing your own.

Here's how it works:

def get_ip_poll():

'''

The analog agent returns a key value pair of a dictionary type.

'''

ip_poll = ["http://xx.xxx.xxx.xxx:9999",

"http://xx.xxx.xxx.xxx:8000",

"http://xx.xxx.xxx.xxx:8080",

"http://xx.xxx.xxx.xxx:9922",

"http://xx.xxx.xxx.xxx:8090"]

addresses = {}

addresses['http'] = ip_poll[random.randint(0, len(ip_poll))]

return addresses


Fourth, through robots .txt to limit reptiles:

The biggest and best reptile in the world is Google, the search engine itself is a super-big reptile, Google developed the reptile 24h continuously crawling online to crawl new information and return to the database, but these search engine crawlers all adhere to a protocol: robots.txt robots .txt (unified lowercase) is a text file stored in the site's root asCII encoded, I t usually tells the web search engine's rover (also known as the web spider) which content in this site should not be obtained by the search engine's rover and which can be obtained by the web search engine. robots .txt protocol is not a specification, but only conventional, so it does not guarantee the privacy of the website.

Note that robots .txt is to use string comparisons to determine whether to get the URL, so there is a different URL at the end of the directory than there is no slash "/" representing.

robots .txt allow wildcards such as "Disallow: .gif" to be used. B ecause URLs in some systems are case sensitive, the file names of robots .txt should be lowercase. R obots .txt should be placed at the root of the site. If you want to define the behavior of a search engine's rover when it accesses a subdirector individually, you can merge your own settings into the robots .txt under the root directory, or use robots metadata ( also known as metadata).

Of course, in certain situations, such as our reptiles to get the speed of the web page, and human browsing the web is similar, this does not cause too much performance loss to the server, in this case, we can not adhere to the robots protocol.


That's where to share today. Recommended lessons: Python3 Getting Started, Python3 Advanced.