HTML Document Charset

Why does charset matter?

By default, Grab automatically detects the charset of the body of the HTML document. It uses this detected charset to

  • build a DOM tree
  • convert the bytes from the body of the document into a unicode stream
  • search for some unicode string in the body of the document
  • convert unicode into bytes data, then some unicode data needs to be sent to the server from which the response was received.

The original content of the network response is always accessible at response.body attribute. A unicode representation of the document body can be obtained by calling response.unicode_body():

>>> g.go('http://mail.ru/')
<grab.response.Response object at 0x7f7d38af8940>
>>> type(g.response.body)
<type 'str'>
>>> type(g.response.unicode_body())
<type 'unicode'>
>>> g.response.charset
'utf-8'

Charset Detection Algorithm

Grab checks multiple sources to find out the real charset of the document’s body. The order of sources (from most important to less):

  • HTML meta tag:

    <meta name="http-equiv" content="text/html; charset=cp1251" >
    
  • XML declaration (in case of XML document):

    <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
    
  • Content-Type HTTP header:

    Content-Type: text/html; charset=koi8-r
    

If no source indicates the charset, or if the found charset has an invalid value, then grab falls back to a default of UTF-8.

Setting the charset manually

You can bypass automatic charset detection and specify it manually with charset option.