Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

The reptile library that is best suited for the little white hand ---- a detailed explanation of the common functions of the ullib library


May 31, 2021 Article blog


Table of contents


First, what is a urllib library?

Python3 consolidates the urllib and urllib2 libraries in Python2 into one urllib library, so what is the urllib library in Python3? H ow do I use the urllib library? What are the common functions of the urllib library?

Urllib is divided into the following four functional modules:

  • urllib.request (request module)
  • urllib.parse (resolution module)
  • urllib.error (exception handling module)
  • urllib.robotparser (robots.txt file resolution module)

Urllib is Python's built-in HTTP request library, which requires no installation, can be used directly, and is a commonly used library for reptile developers.

Second, urllib usage explanation

1, urllib.request.urlopen() function

Create a file object that identifies the remote url, and then manipulate the class file object like a local file to get the remote data. The syntax is as follows:

urllib.request.urlopen(url,data = None,[timeout]*,cafile = None,capath = None,cadefault = False,context = None)

  • url:requested url;
  • data: the requested data, if this value is set, will become a post request;
  • timeout: set the site's access timeout handle object;
  • cafile and capath: Used in HTTPS requests to set the CA certificate and its path.

example

from urllib import request

Response = Request.urlopen ('http://www.baidu.com') #get method request

Print (response.read (). Decode ('UTF-8') # Get the content of the response and decoding the operation

Urlopen() returns objects in a number of ways:

  1. read(), readline(), readlines(), fileno(), close(): operate on HTTPR esponse type data;
  2. info(): Returns the HTTPMessage object, which represents the header information returned by the remote server;
  3. getcode(): Returns HTTP status code geturl(): returns the requested url;
  4. Getheaders(): Head information for response;
  5. getheader ('Server'): returns the value of the response header specifying the parameter Server;
  6. status: return status code;
  7. reason: Returns details of the status.


2, urllib.request.urlretrieve() function

This function makes it easy to save a file on a Web page locally. T he syntax is as follows:

urllib.request.urlretrieve(url, filename=None, reporthook=None, data=None)

  • url: address of remote data;
  • filename: save the path of the file, if empty, download as a temporary file;
  • reporthook: hook function The successful connection to the server and each block of data is called once when the download is complete, containing 3 parameters, followed by the downloaded block, the size of the block, the size of the total file, can be used to show the download progress;
  • data:post data to the server.

example

from urllib import request

Request.urlReve ('http://www.baidu.com/' ,'baidu.html') # Download Baidu's Home Information to Local


3, urllib.parse.urlencode() function

urlencode converts dictionary data to URL-encoded data. The syntax is as follows:

urllib.parse.urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=quote_plus)

  • query: query parameters;
  • doseq: whether the sequence elements are converted separately;
  • safe: the default value for security;
  • encoding: coding;
  • errors: error default;
  • quote_via: When the composition of the query parameters is str, safe, encoding, errors are passed to the specified functions, defaulting to quote_plus(), the enhanced version (quote).

example

from urllib import parse

Data = {'Name': 'W3cschool', 'Say': 'Hello W3cschool', 'Age': 100}

qs = parse.urlencode(data)

print(qs)

# %E5%A7%93%E5%90%8D=W3CSchool&%E9%97%AE%E5%A5%BD=Hello+W3CSchool&%E5%B9%B4%E9%BE%84=100


4, urllib.parse.parse_qs() function

The encoded url parameter can be decoded. T he syntax is as follows:

urllib.parse.parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace')

  • keep_blank_values: When value is empty, it indicates whether key needs to be displayed, defaulting to False
  • strict_parsing: Indicates how to handle flags that resolve errors. I f false (the default), the error is automatically ignored. Otherwise, the error throws the ValueError exception.

example

from urllib import parse

Data = {'Name': 'W3cschool', 'Say': 'Hello W3cschool', 'Age': 100}

qs = parse.urlencode(data)

print(qs)

#%E5%A7%93%E5%90%8D=W3CSchool&%E9%97%AE%E5%A5%BD=hello+W3CSchool&%E5%B9%B4%E9%BE%84=100

print(parse.parse_qs(qs))

# {'Name': ['W3CSChool'], 'Say ": [' Hello W3cschool '],' Age ': [' 100 ']}


5, urllib.parse.parse_qsl() function

The basic usage is consistent with the parse_qs() function, except that the urllib.parse.parse_qs() function returns to the dictionary and the urllib.parse.parse_qsl() function is put back into the list. The syntax is as follows:

urllib.parse.parse_qsl(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace')

example

from urllib import parse

Data = {'Name': 'W3cschool', 'Say': 'Hello W3cschool', 'Age': 100}

qs = parse.urlencode(data)

print(parse.parse_qs(qs))

# [('Name', 'W3cschool'), ('Say ",' Hello W3cschool ', (' Age ',' 100 ')]


6, urllib.parse.urlparse() and urllib.parse.urlsplit() functions

Sometimes when you get a url and want to split the components in that url, you can use urlparse or urlsplit to split it. T he syntax is as follows:

urllib.parse.urlparse(urlstring, scheme=”, allow_fragments=True)

urllib.parse.urlsplit(urlstring, scheme=”, allow_fragments=True)

urlparse and urlsplit are basically the same.

The only difference is that the urlparse has a params property, and urlsplit does not have this params property.

example

from urllib import parse

url = 'http://www.baidu.com/index.html;user?id=S#comment'

result = parse.urlparse(url)

# result = parse.urlsplit(url)

print(result)

print(result.scheme)

print(result.netloc)

print(result.path)

Print.Params # urlparse has params properties, and Urlsplit does not have this property.


7, urllib.error module

The error module of the urllib defines the exceptions that are generated by the urllib.request request. I f the request fails, urllib.request throws an exception to the error module.

  • URLError
The URLError class comes from the error module of the urllib, which inherits the OSError module, is the base class of the exception module, and has the property reason, which returns the cause of the error.

example

from urllib import request, error

try:

resp = request.urlopen('https://w3cschool.c/index.html')

except error.URLError as e:

print(e.reason)

# [Errno 11001] getaddrinfo failed


  • HTTPError

It is a subclass of URLError, designed to handle HTTP request errors, with three properties:

  1. code: Returns the HTTP status code
  2. reason: The cause of the exception
  3. Headers: Request header

example

from urllib import request

Response = Request.urlopen ('http://www.baidu.com') #get method request

Print (response.read (). Decode ('UTF-8') # Get the content of the response and decoding the operation

#404

Of course, most of the time, the URLError and HTTPError are combined to deal with exceptions, first through HTTPError to catch url error status code, exception reason, request first information, if not this type of error, then capture URLError, output error cause, and finally else to handle normal logic.

example

from urllib.request import Request, urlopen

from urllib.error import URLError, HTTPError

resp = Request('http://www.baidu.cnc/')

try:

response = urlopen(resp)

except HTTPError as e:

Print ('(www.baidu.cnc) server cannot complete the request.')

Print ('error code:', e.code)

except URLError as e:

Print ('We cannot connect to the server.')

Print ('Cause:', E.reason) ELSE:

Print ("The link is successful!")

print(response.read().decode("utf-8"))

The above is a commonly used function in the urllib library, I hope you pay more attention to contact in front of the screen, the combination of theory and practice is the best way to learn! Recommended reading: Python static reptiles, Python Scrapy web crawlers.

Finally, let's summarize the common various status code meanings:

  • 200: the request is normal, the server is normal to return data;
  • 301: Permanent redirection. For example, when visiting www.jingdong.com, it is redirected to the www.jd.com;
  • 302: Temporary redirection. For example, when you visit a page that needs to be logged in, and you are not logged in at this time, you will be redirected to the login page;
  • 400: The requested url could not be found on the server. In other words, the request url error;
  • 403: The server denies access, the permissions are not enough;
  • 500: There is an internal error on the server. T here may be a bug on the server.