Module grab.spider.task

class grab.spider.task.Task(name=None, url=None, grab=None, grab_config=None, priority=None, priority_set_explicitly=True, network_try_count=0, task_try_count=1, disable_cache=False, refresh_cache=False, valid_status=None, use_proxylist=True, cache_timeout=None, delay=None, raw=False, callback=None, fallback_name=None, **kwargs)[source]

Task for spider.

__init__(name=None, url=None, grab=None, grab_config=None, priority=None, priority_set_explicitly=True, network_try_count=0, task_try_count=1, disable_cache=False, refresh_cache=False, valid_status=None, use_proxylist=True, cache_timeout=None, delay=None, raw=False, callback=None, fallback_name=None, **kwargs)[source]

Create Task object.

If more than one of url, grab and grab_config options are non-empty then they processed in following order: * grab overwrite grab_config * grab_config overwrite url

Args:
param name:

name of the task. After successful network operation task’s result will be passed to task_<name> method.

param url:

URL of network document. Any task requires url or grab option to be specified.

param grab:

configured Grab instance. You can use that option in case when url option is not enough. Do not forget to configure url option of Grab instance because in this case the url option of Task constructor will be overwritten with grab.config[‘url’].

param priority:
  • priority of the Task. Tasks with lower priority

will be processed earlier. By default each new task is assigned with random priority from (80, 100) range.

param priority_set_explicitly:
 
  • internal flag which tells if that

task priority was assigned manually or generated by spider according to priority generation rules.

param network_try_count:
 

you’ll probably will not need to use it. It is used internally to control how many times this task was restarted due to network errors. The Spider instance has network_try_limit option. When network_try_count attribute of the task exceeds the network_try_limit attribute then processing of the task is abandoned.

param task_try_count:
 

the as network_try_count but it increased only then you use clone method. Also you can set it manually. It is useful if you want to restart the task after it was cancelled due to multiple network errors. As you might guessed there is task_try_limit option in Spider instance. Both options network_try_count and network_try_limit guarantee you that you’ll not get infinite loop of restarting some task.

param disable_cache:
 

if True disable cache subsystem. The document will be fetched from the Network and it will not be saved to cache.

param refresh_cache:
 

if True the document will be fetched from the Network and saved to cache.

param valid_status:
 

extra status codes which counts as valid

param use_proxylist:
 

it means to use proxylist which was configured via setup_proxylist method of spider

param delay:

if specified tells the spider to schedule the task and execute it after delay seconds

param raw:

if raw is True then the network response is forwarding to the corresponding handler without any check of HTTP status code of network error, if raw is False (by default) then failed response is putting back to task queue or if tries limit is reached then the processing of this request is finished.

param callback:

if you pass some function in callback option then the network response will be passed to this callback and the usual ‘task_*’ handler will be ignored and no error will be raised if such ‘task_*’ handler does not exist.

param fallback_name:
 

the name of method that is called when spider gives up to do the task (due to multiple network errors)

Any non-standard named arguments passed to Task constructor will be saved as attributes of the object. You can get their values later as attributes or with get method which allows to use default value if attribute does not exist.

clone(**kwargs)[source]

Clone Task instance.

Reset network_try_count, increase task_try_count. Reset priority attribute if it was not set explicitly.

get(key, default=None)[source]

Return value of attribute or None if such attribute does not exist.