grab.spider.base

Module Contents

Classes

Spider

Asynchronous scraping framework.

Attributes

DEFAULT_TASK_PRIORITY

DEFAULT_NETWORK_STREAM_NUMBER

DEFAULT_TASK_TRY_LIMIT

DEFAULT_NETWORK_TRY_LIMIT

RANDOM_TASK_PRIORITY_RANGE

logger

system_random

HTTP_STATUS_ERROR

HTTP_STATUS_NOT_FOUND

WAIT_SERVICE_SHUTDOWN_SEC

grab.spider.base.DEFAULT_TASK_PRIORITY = 100[source]
grab.spider.base.DEFAULT_NETWORK_STREAM_NUMBER = 3[source]
grab.spider.base.DEFAULT_TASK_TRY_LIMIT = 5[source]
grab.spider.base.DEFAULT_NETWORK_TRY_LIMIT = 5[source]
grab.spider.base.RANDOM_TASK_PRIORITY_RANGE = (50, 100)[source]
grab.spider.base.logger[source]
grab.spider.base.system_random[source]
grab.spider.base.HTTP_STATUS_ERROR = 400[source]
grab.spider.base.HTTP_STATUS_NOT_FOUND = 404[source]
grab.spider.base.WAIT_SERVICE_SHUTDOWN_SEC = 10[source]
class grab.spider.base.Spider(task_queue: None | grab.spider.queue_backend.base.BaseTaskQueue = None, thread_number: None | int = None, network_try_limit: None | int = None, task_try_limit: None | int = None, priority_mode: str = 'random', meta: None | dict[str, Any] = None, config: None | dict[str, Any] = None, parser_requests_per_process: int = 10000, parser_pool_size: int = 1, network_service: None | grab.spider.service.network.BaseNetworkService = None, grab_transport: None | grab.base.BaseTransport[grab.request.HttpRequest, grab.document.Document] | type[grab.base.BaseTransport[grab.request.HttpRequest, grab.document.Document]] = None)[source]

Asynchronous scraping framework.

spider_name[source]
initial_urls: list[str] = [][source]
collect_runtime_event(name: str, value: None | str) None[source]
setup_queue(*_args: Any, **_kwargs: Any) None[source]

Set up queue.

add_task(task: grab.spider.task.Task, queue: None | grab.spider.queue_backend.base.BaseTaskQueue = None, raise_error: bool = False) bool[source]

Add task to the task queue.

stop() None[source]

Instruct spider to stop processing new tasks and start shutting down.

load_proxylist(source: str | proxylist.base.BaseProxySource, source_type: None | str = None, proxy_type: str = 'http', auto_init: bool = True, auto_change: bool = True) None[source]

Load proxy list.

Parameters
  • source – Proxy source. Accepts string (file path, url) or BaseProxySource instance.

  • source_type – The type of the specified source. Should be one of the following: ‘text_file’ or ‘url’.

  • proxy_type – Should be one of the following: ‘socks4’, ‘socks5’ or’http’.

  • auto_change – If set to True then automatically random proxy rotation will be used.

Proxy source format should be one of the following (for each line): - ip:port - ip:port:login:password

render_stats() str[source]
prepare() None[source]

Do additional spider customization here.

This method runs before spider has started working.

shutdown() None[source]

Override this method to do some final actions after parsing has been done.

create_grab_instance(**kwargs: Any) grab.Grab[source]
task_generator() collections.abc.Iterator[grab.spider.task.Task][source]

You can override this method to load new tasks.

It will be used each time as number of tasks in task queue is less then number of threads multiplied on 2 This allows you to not overload all free memory if total number of tasks is big.

check_task_limits(task: grab.spider.task.Task) tuple[bool, str][source]

Check that task’s network & try counters do not exceed limits.

Returns: * if success: (True, None) * if error: (False, reason)

generate_task_priority() int[source]
process_initial_urls() None[source]
get_task_from_queue() None | Literal[True] | grab.spider.task.Task[source]
is_valid_network_response_code(code: int, task: grab.spider.task.Task) bool[source]

Test if response is valid.

Valid response is handled with associated task handler. Failed respoosne is processed with error handler.

process_parser_error(func_name: str, task: grab.spider.task.Task, exc_info: tuple[type[Exception], Exception, types.TracebackType]) None[source]
find_task_handler(task: grab.spider.task.Task) collections.abc.Callable[Ellipsis, Any][source]
log_network_result_stats(res: grab.spider.service.network.NetworkResult, task: grab.spider.task.Task) None[source]
process_grab_proxy(task: grab.spider.task.Task, grab: grab.Grab) None[source]

Assign new proxy from proxylist to the task.

change_active_proxy(task: grab.spider.task.Task, grab: grab.Grab) None[source]
get_task_queue() grab.spider.queue_backend.base.BaseTaskQueue[source]
is_idle_estimated() bool[source]
is_idle_confirmed(services: list[grab.spider.service.base.BaseService]) bool[source]

Test if spider is fully idle.

WARNING: As side effect it stops all services to get state of queues anaffected by sercies.

Spider is full idle when all conditions are met: * all services are paused i.e. the do not change queues * all queues are empty * task generator is completed

run() None[source]
shutdown_services(services: list[grab.spider.service.base.BaseService]) None[source]
log_failed_network_result(res: grab.spider.service.network.NetworkResult) None[source]
log_rejected_task(task: grab.spider.task.Task, reason: str) None[source]
get_fallback_handler(task: grab.spider.task.Task) None | collections.abc.Callable[Ellipsis, Any][source]
srv_process_service_result(result: grab.spider.task.Task | None | Exception | dict[str, Any], task: grab.spider.task.Task, meta: None | dict[str, Any] = None) None[source]

Process result submitted from any service to task dispatcher service.

Result could be: * Task * None * Task instance * ResponseNotValidError-based exception * Arbitrary exception * Network response:

{ok, ecode, emsg, exc, grab, grab_config_backup}

Exception can come only from parser_service and it always has meta {“from”: “parser”, “exc_info”: <…>}

srv_process_network_result(result: grab.spider.service.network.NetworkResult, task: grab.spider.task.Task) None[source]
srv_process_task(task: grab.spider.task.Task) None[source]