socio4health.utils.standard_spider.StandardSpider#
- class socio4health.utils.standard_spider.StandardSpider(*args: Any, **kwargs: Any)[source]#
A standard spider for scraping links from a given
URL.- ext#
A list of file extensions to filter links. Default includes common document formats.
- Type:
list, optional
- key_words#
A list of keywords or regex conditions to filter links by filename. By default it is an empty list.
- Type:
list, optional
- __init__(url=None, depth=0, ext=None, key_words=None, *args, **kwargs)[source]#
Initialize the spider with parameters.
Methods
__delattr__(name, /)Implement delattr(self, name).
__dir__()Default dir() implementation.
__eq__(value, /)Return self==value.
__format__(format_spec, /)Default object formatter.
__ge__(value, /)Return self>=value.
__getattribute__(name, /)Return getattr(self, name).
__getstate__()Helper for pickle.
__gt__(value, /)Return self>value.
__hash__()Return hash(self).
__init__([url, depth, ext, key_words])Initialize the spider with parameters.
__init_subclass__This method is called when a class is subclassed.
__le__(value, /)Return self<=value.
__lt__(value, /)Return self<value.
__ne__(value, /)Return self!=value.
__new__(cls, *args, **kwargs)__reduce__()Helper for pickle.
__reduce_ex__(protocol, /)Helper for pickle.
__repr__()Return repr(self).
__setattr__(name, value, /)Implement setattr(self, name, value).
__sizeof__()Size of object in memory, in bytes.
__str__()Return str(self).
__subclasshook__Abstract classes can override this to customize issubclass().
_parse(response, **kwargs)_set_crawler(crawler)close(spider, reason)closed(reason)Handle actions to perform when the spider is closed.
from_crawler(crawler, *args, **kwargs)handles_request(request)log(message[, level])Log the given message at the given log level
parse(response[, current_depth])Parse the response to extract links based on criteria.
start()Yield the initial
Requestobjects to send.start_requests()update_settings(settings)Attributes
__annotations____dict____doc____module____slots____weakref__list of weak references to the object
custom_settingsStart URLs.