Task Object

Any Grab::Spider crawler is a set of handlers that process network responses. Each handler can spawn new network requests or just process/save data. The spider add each new request to task queue and process this task when there is free network stream. Each task is assigned a name that defines its type. Each type of task are handles by specific handler. To find the handler the Spider takes name of the task and then looks for task_<name> method.

For example, to handle result of task named “contact_page” we need to define “task_contact_page” method:

    ...
    self.add_task(Task('contact_page', url='http://domain.com/contact.html'))
    ...

def task_contact_page(self, grab, task):
    ...

Constructor of Task Class

Constructor of Task Class accepts multiple arguments. At least you have to specify name of task and either URL or Request instance. Next, you see examples of different task creation. All three examples do the same:

# Using `url` argument
t = Task('wikipedia', url 'http://wikipedia.org/')

# Using Request instancec
t = Task('wikipedia', Request(url="http://wikipedia.org/"))

Task Object as Data Storage

If you pass the argument that is unknown then it will be saved in the Task object. That allows you to pass data between network request/response.

There is get method that return value of task attribute or None if that attribute have not been defined.

t = Task('bing', url='http://bing.com/', disable_cache=True, foo='bar')
t.foo # == "bar"
t.get('foo') # == "bar"
t.get('asdf') # == None
t.get('asdf', 'qwerty') # == "qwerty"

Cloning Task Object

Sometimes it is useful to create copy of Task object. For example:

# task.clone()
# TODO: example

Setting Up Initial Tasks

When you call run method of your spider it starts working from initial tasks. There are few ways to setup initial tasks.

initial_urls

You can specify list of URLs in self.initial_urls. For each URl in this list the spider will create Task object with name “initial”:

class ExampleSpider(Spider):
    initial_urls = ['http://google.com/', 'http://yahoo.com/']

task_generator

More flexible way to define initial tasks is to use task_generator method. Its interface is simple, you just have to yield new Task objects.

There is common use case when you need to process big number of URLs from the file. With task_generator you can iterate over lines of the file and yield new tasks. That will save memory used by the script because you will not read whole file into the memory. Spider consumes only portion of tasks from task_generator. When there are free networks resources the spiders consumes next portion of task. And so on.

Example:

class ExampleSpider(Spider):
    def task_generator(self):
        for line in open('var/urls.txt'):
            yield Task('download', url=line.strip())

Explicit Ways to Add New Task

Adding Tasks With add_task method

You can use add_task method anywhere, even before the spider have started working:

bot = ExampleSpider()
bot.add_task('google', url='http://google.com')
bot.run()

Yield New Tasks

You can use yield statement to add new tasks in two places. First, in task_generator. Second, in any handler. Using yield is completely equal to using add_task method. The yielding is just a bit more beautiful:

class ExampleSpider(Spider):
    initial_urls = ['http://google.com']

    def task_initial(self, grab, task):
        # Google page was fetched
        # Now let's download yahoo page
        yield Task('yahoo', url='yahoo.com')

    def task_yahoo(self, grab, task):
        pass

Default Grab Instance

You can control the default config of Grab instances used in spider tasks. Define the create_grab_instance method in your spider class:

class TestSpider(Spider):
    def create_grab_instance(self, **kwargs):
        g = super(TestSpider, self).create_grab_instance(**kwargs)
        g.setup(timeout=20)
        return g

Be aware, that this method allows you to control only those Grab instances that were created automatically. If you create task with explicit grab instance it will not be affected by create_grab_instance_method:

class TestSpider(Spider):
    def create_grab_instance(self, **kwargs):
        g = Grab(**kwargs)
        g.setup(timeout=20)
        return g

    def task_generator(self):
        g = Grab(url='http://example.com')
        yield Task('page', grab_config=g.dump_config())
        # The grab instance in the yielded task
        # will not be affected by `create_grab_instance` method.