grab.spider

Subpackages

Submodules

Package Contents

Classes

Spider

Asynchronous scraping framework.

Task

Task for spider.

class grab.spider.Spider(task_queue: None | grab.spider.queue_backend.base.BaseTaskQueue = None, thread_number: None | int = None, network_try_limit: None | int = None, task_try_limit: None | int = None, priority_mode: str = 'random', meta: None | dict[str, Any] = None, config: None | dict[str, Any] = None, parser_requests_per_process: int = 10000, parser_pool_size: int = 1, network_service: None | grab.spider.service.network.BaseNetworkService = None, grab_transport: None | grab.base.BaseTransport[grab.request.HttpRequest, grab.document.Document] | type[grab.base.BaseTransport[grab.request.HttpRequest, grab.document.Document]] = None)[source]

Asynchronous scraping framework.

spider_name
initial_urls: list[str] = []
collect_runtime_event(name: str, value: None | str) None[source]
setup_queue(*_args: Any, **_kwargs: Any) None[source]

Set up queue.

add_task(task: grab.spider.task.Task, queue: None | grab.spider.queue_backend.base.BaseTaskQueue = None, raise_error: bool = False) bool[source]

Add task to the task queue.

stop() None[source]

Instruct spider to stop processing new tasks and start shutting down.

load_proxylist(source: str | proxylist.base.BaseProxySource, source_type: None | str = None, proxy_type: str = 'http', auto_init: bool = True, auto_change: bool = True) None[source]

Load proxy list.

Parameters
  • source – Proxy source. Accepts string (file path, url) or BaseProxySource instance.

  • source_type – The type of the specified source. Should be one of the following: ‘text_file’ or ‘url’.

  • proxy_type – Should be one of the following: ‘socks4’, ‘socks5’ or’http’.

  • auto_change – If set to True then automatically random proxy rotation will be used.

Proxy source format should be one of the following (for each line): - ip:port - ip:port:login:password

render_stats() str[source]
prepare() None[source]

Do additional spider customization here.

This method runs before spider has started working.

shutdown() None[source]

Override this method to do some final actions after parsing has been done.

create_grab_instance(**kwargs: Any) grab.Grab[source]
task_generator() collections.abc.Iterator[grab.spider.task.Task][source]

You can override this method to load new tasks.

It will be used each time as number of tasks in task queue is less then number of threads multiplied on 2 This allows you to not overload all free memory if total number of tasks is big.

check_task_limits(task: grab.spider.task.Task) tuple[bool, str][source]

Check that task’s network & try counters do not exceed limits.

Returns: * if success: (True, None) * if error: (False, reason)

generate_task_priority() int[source]
process_initial_urls() None[source]
get_task_from_queue() None | Literal[True] | grab.spider.task.Task[source]
is_valid_network_response_code(code: int, task: grab.spider.task.Task) bool[source]

Test if response is valid.

Valid response is handled with associated task handler. Failed respoosne is processed with error handler.

process_parser_error(func_name: str, task: grab.spider.task.Task, exc_info: tuple[type[Exception], Exception, types.TracebackType]) None[source]
find_task_handler(task: grab.spider.task.Task) collections.abc.Callable[Ellipsis, Any][source]
log_network_result_stats(res: grab.spider.service.network.NetworkResult, task: grab.spider.task.Task) None[source]
process_grab_proxy(task: grab.spider.task.Task, grab: grab.Grab) None[source]

Assign new proxy from proxylist to the task.

change_active_proxy(task: grab.spider.task.Task, grab: grab.Grab) None[source]
get_task_queue() grab.spider.queue_backend.base.BaseTaskQueue[source]
is_idle_estimated() bool[source]
is_idle_confirmed(services: list[grab.spider.service.base.BaseService]) bool[source]

Test if spider is fully idle.

WARNING: As side effect it stops all services to get state of queues anaffected by sercies.

Spider is full idle when all conditions are met: * all services are paused i.e. the do not change queues * all queues are empty * task generator is completed

run() None[source]
shutdown_services(services: list[grab.spider.service.base.BaseService]) None[source]
log_failed_network_result(res: grab.spider.service.network.NetworkResult) None[source]
log_rejected_task(task: grab.spider.task.Task, reason: str) None[source]
get_fallback_handler(task: grab.spider.task.Task) None | collections.abc.Callable[Ellipsis, Any][source]
srv_process_service_result(result: grab.spider.task.Task | None | Exception | dict[str, Any], task: grab.spider.task.Task, meta: None | dict[str, Any] = None) None[source]

Process result submitted from any service to task dispatcher service.

Result could be: * Task * None * Task instance * ResponseNotValidError-based exception * Arbitrary exception * Network response:

{ok, ecode, emsg, exc, grab, grab_config_backup}

Exception can come only from parser_service and it always has meta {“from”: “parser”, “exc_info”: <…>}

srv_process_network_result(result: grab.spider.service.network.NetworkResult, task: grab.spider.task.Task) None[source]
srv_process_task(task: grab.spider.task.Task) None[source]
class grab.spider.Task(name: None | str = None, url: None | str | grab.request.HttpRequest = None, request: None | grab.request.HttpRequest = None, priority: None | int = None, priority_set_explicitly: bool = True, network_try_count: int = 0, task_try_count: int = 1, valid_status: None | list[int] = None, use_proxylist: bool = True, delay: None | float = None, raw: bool = False, callback: None | collections.abc.Callable[Ellipsis, None] = None, fallback_name: None | str = None, store: None | dict[str, Any] = None, **kwargs: Any)[source]

Bases: BaseTask

Task for spider.

check_init_kwargs(kwargs: collections.abc.Mapping[str, Any]) None[source]
get(key: str, default: Any = None) Any[source]

Return value of attribute or None if such attribute does not exist.

process_delay_option(delay: None | float) None[source]
clone(url: None | str = None, request: None | grab.request.HttpRequest = None, **kwargs: Any) Task[source]

Clone Task instance.

Reset network_try_count, increase task_try_count. Reset priority attribute if it was not set explicitly.

__repr__() str[source]

Return repr(self).

__lt__(other: Task) bool[source]

Return self<value.

__eq__(other: object) bool[source]

Return self==value.