Module grab.spider¶
-
class
grab.spider.base.
Spider
(thread_number=None, network_try_limit=None, task_try_limit=None, request_pause=<object object>, priority_mode='random', meta=None, only_cache=False, config=None, args=None, parser_requests_per_process=10000, parser_pool_size=1, http_api_port=None, network_service='multicurl', grab_transport='pycurl', transport=None)[source]¶ Asynchronous scraping framework.
-
check_task_limits
(task)[source]¶ Check that task’s network & try counters do not exceed limits.
Returns: * if success: (True, None) * if error: (False, reason)
-
is_valid_network_response_code
(code, task)[source]¶ Answer the question: if the response could be handled via usual task handler or the task failed and should be processed as error.
-
load_proxylist
(source, source_type=None, proxy_type='http', auto_init=True, auto_change=True)[source]¶ Load proxy list.
Parameters: - source – Proxy source.
Accepts string (file path, url) or
BaseProxySource
instance. - source_type – The type of the specified source. Should be one of the following: ‘text_file’ or ‘url’.
- proxy_type – Should be one of the following: ‘socks4’, ‘socks5’ or’http’.
- auto_change – If set to True then automatical random proxy rotation will be used.
- Proxy source format should be one of the following (for each line):
- ip:port
- ip:port:login:password
- source – Proxy source.
Accepts string (file path, url) or
-
prepare
()[source]¶ You can do additional spider customization here before it has started working. Simply redefine this method in your Spider class.
-
process_next_page
(grab, task, xpath, resolve_base=False, **kwargs)[source]¶ Generate task for next page.
Parameters: - grab – Grab instance
- task – Task object which should be assigned to next page url
- xpath – xpath expression which calculates list of URLS
- **kwargs –
extra settings for new task object
Example:
self.follow_links(grab, 'topic', '//div[@class="topic"]/a/@href')
-
setup_cache
(backend='mongodb', database=None, **kwargs)[source]¶ Setup cache.
Parameters: - backend – Backend name Should be one of the following: ‘mongo’, ‘mysql’ or ‘postgresql’.
- database – Database name.
- kwargs – Additional credentials for backend.
-
setup_queue
(backend='memory', **kwargs)[source]¶ Setup queue.
Parameters: - backend – Backend name Should be one of the following: ‘memory’, ‘redis’ or ‘mongo’.
- kwargs – Additional credentials for backend.
-
shutdown
()[source]¶ You can override this method to do some final actions after parsing has been done.
-
stop
()[source]¶ This method set internal flag which signal spider to stop processing new task and shuts down.
-