Module grab.spider¶

class grab.spider.base.Spider(thread_number=None, network_try_limit=None, task_try_limit=None, request_pause=<object object>, priority_mode='random', meta=None, only_cache=False, config=None, args=None, parser_requests_per_process=10000, parser_pool_size=1, http_api_port=None, network_service='multicurl', grab_transport='pycurl', transport=None)[source]¶

Asynchronous scraping framework.

add_task(task, queue=None, raise_error=False)[source]¶: Add task to the task queue.

check_task_limits(task)[source]¶

Check that task’s network & try counters do not exceed limits.

Returns: * if success: (True, None) * if error: (False, reason)

is_valid_network_response_code(code, task)[source]¶: Answer the question: if the response could be handled via usual task handler or the task failed and should be processed as error.

load_proxylist(source, source_type=None, proxy_type='http', auto_init=True, auto_change=True)[source]¶

Load proxy list.

Parameters:	source – Proxy source. Accepts string (file path, url) or `BaseProxySource` instance. source_type – The type of the specified source. Should be one of the following: ‘text_file’ or ‘url’. proxy_type – Should be one of the following: ‘socks4’, ‘socks5’ or’http’. auto_change – If set to True then automatical random proxy rotation will be used.

Proxy source format should be one of the following (for each line):

ip:port
ip:port:login:password

prepare()[source]¶: You can do additional spider customization here before it has started working. Simply redefine this method in your Spider class.

process_grab_proxy(task, grab)[source]¶: Assign new proxy from proxylist to the task

process_next_page(grab, task, xpath, resolve_base=False, **kwargs)[source]¶

Generate task for next page.

Parameters:	grab – Grab instance task – Task object which should be assigned to next page url xpath – xpath expression which calculates list of URLS **kwargs – extra settings for new task object

Example:

self.follow_links(grab, 'topic', '//div[@class="topic"]/a/@href')

setup_cache(backend='mongodb', database=None, **kwargs)[source]¶

Setup cache.

Parameters:	backend – Backend name Should be one of the following: ‘mongo’, ‘mysql’ or ‘postgresql’. database – Database name. kwargs – Additional credentials for backend.

setup_queue(backend='memory', **kwargs)[source]¶

Setup queue.

Parameters:	backend – Backend name Should be one of the following: ‘memory’, ‘redis’ or ‘mongo’. kwargs – Additional credentials for backend.

shutdown()[source]¶: You can override this method to do some final actions after parsing has been done.

stop()[source]¶: This method set internal flag which signal spider to stop processing new task and shuts down.

task_generator()[source]¶

You can override this method to load new tasks smoothly.

It will be used each time as number of tasks in task queue is less then number of threads multiplied on 2 This allows you to not overload all free memory if total number of tasks is big.

update_grab_instance(grab)[source]¶: Use this method to automatically update config of any Grab instance created by the spider.