Python urllib3 | Mr Bluyee's Blog

First things first, import the urllib3 module

1	import urllib3

Making requests

PoolManager instance：This object handles all of the details of connection pooling and thread safety。

1	http = urllib3.PoolManager()

To make a request use request()：

1 2	r = http.request('GET', 'http://httpbin.org/robots.txt') r.data

request() returns a HTTPResponse object

You can use request() to make requests using any HTTP verb：

>>> r = http.request(
...     'POST',
...     'http://httpbin.org/post',
...     fields={'hello': 'world'})

The Request data section covers sending other kinds of requests data, including JSON, files, and binary data.

Response content

The HTTPResponse object provides status, data, and header attributes：

>>> r = http.request('GET', 'http://httpbin.org/ip')
>>> r.status
200
>>> r.data
b'{\n  "origin": "104.232.115.37"\n}\n'
>>> r.headers
HTTPHeaderDict({'Content-Length': '33', ...})

JSON content

JSON content can be loaded by decoding and deserializing the data attribute of the request：

>>> import json
>>> r = http.request('GET', 'http://httpbin.org/ip')
>>> json.loads(r.data.decode('utf-8'))
{'origin': '127.0.0.1'}

Binary content

The data attribute of the response is always set to a byte string representing the response content：

1
2
3

>>> r = http.request('GET', 'http://httpbin.org/bytes/8')
>>> r.data
b'\xaa\xa5H?\x95\xe9\x9b\x11'

Note: For larger responses, it’s sometimes better to stream the response.

Request data

Headers

You can specify headers as a dictionary in the headers argument in request()

>>> r = http.request(
...     'GET',
...     'http://httpbin.org/headers',
...     headers={
...         'X-Something': 'value'
...     })
>>> json.loads(r.data.decode('utf-8'))['headers']
{'X-Something': 'value', ...}

Query parameters

For GET, HEAD, and DELETE requests, you can simply pass the arguments as a dictionary in the fields argument to request()

>>> r = http.request(
...     'GET',
...     'http://httpbin.org/get',
...     fields={'arg': 'value'})
>>> json.loads(r.data.decode('utf-8'))['args']
{'arg': 'value'}

For POST and PUT requests, you need to manually encode query parameters in the URL

>>> from urllib.parse import urlencode
>>> encoded_args = urlencode({'arg': 'value'})
>>> url = 'http://httpbin.org/post?' + encoded_args
>>> r = http.request('POST', url)
>>> json.loads(r.data.decode('utf-8'))['args']
{'arg': 'value'}

Form data

For PUT and POST requests, urllib3 will automatically form-encode the dictionary in the fields argument provided to request()

>>> r = http.request(
...     'POST',
...     'http://httpbin.org/post',
...     fields={'field': 'value'})
>>> json.loads(r.data.decode('utf-8'))['form']
{'field': 'value'}

multipart/form-data请求:

1、multipart/form-data的基础方法是post，也就是说是由post方法来组合实现的
2、multipart/form-data与post方法的不同之处：请求头，请求体。
3、multipart/form-data的请求头必须包含一个特殊的头信息：Content-Type，且其值也必须规定为multipart/form-data，同时还需要规定一个内容分割符用于分割请求体中的多个post的内容，如文件内容和文本内容自然需要分割开来，不然接收方就无法正常解析和还原这个文件了。
具体的头信息如下：
Content-Type: multipart/form-data; boundary=${bound}
其中${bound} 是一个占位符，代表我们规定的分割符，可以自己任意规定，但为了避免和正常文本重复了，尽量要使用复杂一点的内容。
4、multipart/form-data的请求体也是一个字符串，不过和post的请求体不同的是它的构造方式，post是简单的name=value值连接，而multipart/form-data则是添加了分隔符等内容的构造体。

JSON

You can sent JSON a request by specifying the encoded data as the body argument and setting the Content-Type header when calling request()

>>> import json
>>> data = {'attribute': 'value'}
>>> encoded_data = json.dumps(data).encode('utf-8')
>>> r = http.request(
...     'POST',
...     'http://httpbin.org/post',
...     body=encoded_data,
...     headers={'Content-Type': 'application/json'})
>>> json.loads(r.data.decode('utf-8'))['json']
{'attribute': 'value'}

Files & binary data

For uploading files using multipart/form-data encoding you can use the same approach as Form data and specify the file field as a tuple of (file_name, file_data)

>>> with open('example.txt') as fp:
...     file_data = fp.read()
>>> r = http.request(
...     'POST',
...     'http://httpbin.org/post',
...     fields={
...         'filefield': ('example.txt', file_data),
...     })
>>> json.loads(r.data.decode('utf-8'))['files']
{'filefield': '...'}

While specifying the filename is not strictly required, it’s recommended in order to match browser behavior. You can also pass a third item in the tuple to specify the file’s MIME type explicitly:

>>> r = http.request(
...     'POST',
...     'http://httpbin.org/post',
...     fields={
...         'filefield': ('example.txt', file_data, 'text/plain'),
...     })

For sending raw binary data simply specify the body argument. It’s also recommended to set the Content-Type header

>>> with open('example.jpg', 'rb') as fp:
...     binary_data = fp.read()
>>> r = http.request(
...     'POST',
...     'http://httpbin.org/post',
...     body=binary_data,
...     headers={'Content-Type': 'image/jpeg'})
>>> json.loads(r.data.decode('utf-8'))['data']
b'...'

Certificate verification

SSL证书验证：
简单来说是验证两个问题：

1、证书是否是信任的有效证书。所谓信任：浏览器内置了信任的根证书，就是看看web服务器的证书是不是这些信任根发的或者信任根的二级证书机构颁发的。所谓有效，就是看看web服务器证书是否在有效期，是否被吊销了。
2、对方是不是上述证书的合法持有者。简单来说证明对方是否持有证书的对应私钥。验证方法两种，一种是对方签个名，我用证书验证签名；另外一种是用证书做个信封，看对方是否能解开。

It is highly recommended to always use SSL certificate verification. By default, urllib3 does not verify HTTPS requests.

In order to enable verification you will need a set of root certificates. The easiest and most reliable method is to use the certifi package which provides Mozilla’s root certificate bundle.

1	pip install certifi

Once you have certificates, you can create a PoolManager that verifies certificates when making requests.

>>> import certifi
>>> import urllib3
>>> http = urllib3.PoolManager(
...     cert_reqs='CERT_REQUIRED',
...     ca_certs=certifi.where())

The PoolManager will automatically handle certificate verification and will raise SSLError if verification fails.

>>> http.request('GET', 'https://google.com')
(No exception)
>>> http.request('GET', 'https://expired.badssl.com')
urllib3.exceptions.SSLError ...

Note：
You can use OS-provided certificates if desired. Just specify the full path to the certificate bundle as the ca_certs argument instead of certifi.where(). For example, most Linux systems store the certificates at /etc/ssl/certs/ca-certificates.crt.

Using timeouts

Timeouts allow you to control how long requests are allowed to run before being aborted. In simple cases, you can specify a timeout as a float to request()

>>> http.request(
...     'GET', 'http://httpbin.org/delay/3', timeout=4.0)
<urllib3.response.HTTPResponse>
>>> http.request(
...     'GET', 'http://httpbin.org/delay/3', timeout=2.5)
MaxRetryError caused by ReadTimeoutError

For more granular control you can use a Timeout instance which lets you specify separate connect and read timeouts

>>> http.request(
...     'GET',
...     'http://httpbin.org/delay/3',
...     timeout=urllib3.Timeout(connect=1.0))
<urllib3.response.HTTPResponse>
>>> http.request(
...     'GET',
...     'http://httpbin.org/delay/3',
...     timeout=urllib3.Timeout(connect=1.0, read=2.0))
MaxRetryError caused by ReadTimeoutError

If you want all requests to be subject to the same timeout, you can specify the timeout at the PoolManager level.You still override this pool-level timeout by specifying timeout to request()

1
2
3

>>> http = urllib3.PoolManager(timeout=3.0)
>>> http = urllib3.PoolManager(
...     timeout=urllib3.Timeout(connect=1.0, read=2.0))

Retrying requests

urllib3 can automatically retry idempotent requests. This same mechanism also handles redirects. You can control the retries using the retries parameter to request(). By default, urllib3 will retry requests 3 times and follow up to 3 redirects.

To change the number of retries just specify an integer

1	>>> http.requests('GET', 'http://httpbin.org/ip', retries=10)

To disable all retry and redirect logic specify retries=False

>>> http.request(
...     'GET', 'http://nxdomain.example.com', retries=False)
NewConnectionError
>>> r = http.request(
...     'GET', 'http://httpbin.org/redirect/1', retries=False)
>>> r.status
302

To disable redirects but keep the retrying logic, specify redirect=False

>>> r = http.request(
...     'GET', 'http://httpbin.org/redirect/1', redirect=False)
>>> r.status
302

For more granular control you can use a Retry instance. This class allows you far greater control of how requests are retried.For example, to do a total of 3 retries, but limit to only 2 redirects

>>> http.request(
...     'GET',
...     'http://httpbin.org/redirect/3',
...     retries=urllib3.Retry(3, redirect=2))
MaxRetryError

You can also disable exceptions for too many redirects and just return the 302 response

>>> r = http.request(
...     'GET',
...     'http://httpbin.org/redirect/3',
...     retries=urllib3.Retry(
...         redirect=2, raise_on_redirect=False))
>>> r.status
302

If you want all requests to be subject to the same retry policy, you can specify the retry at the PoolManager level.You still override this pool-level retry policy by specifying retries to request()

1
2
3

>>> http = urllib3.PoolManager(retries=False)
>>> http = urllib3.PoolManager(
...     retries=urllib3.Retry(5, redirect=2))

Errors & Exceptions

urllib3 wraps lower-level exceptions, for example

>>> try:
...     http.request('GET', 'nx.example.com', retries=False)
>>> except urllib3.exceptions.NewConnectionError:
...     print('Connection failed.')

Logging

If you are using the standard library logging module urllib3 will emit several logs. In some cases this can be undesirable. You can use the standard logger interface to change the log level for urllib3’s logger

1	>>> logging.getLogger("urllib3").setLevel(logging.WARNING)

Customizing pool behavior

PoolManager

The PoolManager class automatically handles creating ConnectionPool instances for each host as needed.

By default, it will keep a maximum of 10 ConnectionPool instances.

If you’re making requests to many different hosts it might improve performance to increase this number:

1 2	>>> import urllib3 >>> http = urllib3.PoolManager(num_pools=50)

However, keep in mind that this does increase memory and socket consumption.

ConnectionPool

Similarly, the ConnectionPool class keeps a pool of individual HTTPConnection instances.

These connections are used during an individual request and returned to the pool when the request is complete.

By default only one connection will be saved for re-use.

If you are making many requests to the same host simultaneously it might improve performance to increase this number:

>>> import urllib3
>>> http = urllib3.PoolManager(maxsize=10)
# Alternatively
>>> http = urllib3.HTTPConnectionPool('google.com', maxsize=10)

The behavior of the pooling for ConnectionPool is different from PoolManager.

By default, if a new request is made and there is no free connection in the pool then a new connection will be created. However, this connection will not be saved if more than maxsize connections exist.

This means that maxsize does not determine the maximum number of connections that can be open to a particular host, just the maximum number of connections to keep in the pool.

However, if you specify block=True then there can be at most maxsize connections open to a particular host:

1
2
3

>>> http = urllib3.PoolManager(maxsize=10, block=True)
# Alternatively
>>> http = urllib3.HTTPConnectionPool('google.com', maxsize=10, block=True)

Any new requests will block until a connection is available from the pool.

This is a great way to prevent flooding a host with too many connections in multi-threaded applications.

Streaming and IO

When dealing with large responses it’s often better to stream the response content

>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request(
...     'GET',
...     'http://httpbin.org/bytes/1024',
...     preload_content=False)
>>> for chunk in r.stream(32):
...     print(chunk)
b'...'
b'...'
...
>>> r.release_conn()

Setting preload_content to False means that urllib3 will stream the response content.

stream() lets you iterate over chunks of the response content.

Note:
When using preload_content=False, you should call release_conn() to release the http connection back to the connection pool so that it can be re-used.

However, you can also treat the HTTPResponse instance as a file-like object. This allows you to do buffering:

>>> r = http.request(
...     'GET',
...     'http://httpbin.org/bytes/1024',
...     preload_content=False)
>>> r.read(4)
b'\x88\x1f\x8b\xe5'

Calls to read() will block until more response data is available.

>>> import io
>>> reader = io.BufferedReader(r, 8)
>>> reader.read(4)
>>> r.release_conn()

You can use this file-like object to do things like decode the content using codecs.

>>> import codecs
>>> reader = codecs.getreader('utf-8')
>>> r = http.request(
...     'GET',
...     'http://httpbin.org/ip',
...     preload_content=False)
>>> json.load(reader(r))
{'origin': '127.0.0.1'}
>>> r.release_conn()

Proxies

You can use ProxyManager to tunnel requests through an HTTP proxy

1
2
3

>>> import urllib3
>>> proxy = urllib3.ProxyManager('http://localhost:3128/')
>>> proxy.request('GET', 'http://google.com/')

The usage of ProxyManager is the same as PoolManager.

You can use SOCKSProxyManager to connect to SOCKS4 or SOCKS5 proxies. In order to use SOCKS proxies you will need to install PySocks or install urllib3 with the socks extra:

1	pip install urllib3[socks]

Once PySocks is installed, you can use SOCKSProxyManager:

1
2
3

>>> from urllib3.contrib.socks import SOCKSProxyManager
>>> proxy = SOCKSProxyManager('socks5://localhost:8889/')
>>> proxy.request('GET', 'http://google.com/')