socio4health.utils.standard_spider.StandardSpider#
- class socio4health.utils.standard_spider.StandardSpider(url=None, depth=0, ext=None, key_words=None, *args, **kwargs)[source]#
A standard spider for scraping links from a given
URL.- ext#
A list of file extensions to filter links. Default includes common document formats.
- Type:
list, optional
- key_words#
A list of keywords or regex conditions to filter links by filename. By default it is an empty list.
- Type:
list, optional
- __init__(url=None, depth=0, ext=None, key_words=None, *args, **kwargs)[source]#
Initialize the spider with parameters.
Methods
__delattr__(name, /)Implement delattr(self, name).
__dir__()Default dir() implementation.
__eq__(value, /)Return self==value.
__format__(format_spec, /)Default object formatter.
__ge__(value, /)Return self>=value.
__getattribute__(name, /)Return getattr(self, name).
__getstate__()Helper for pickle.
__gt__(value, /)Return self>value.
__hash__()Return hash(self).
__init__([url, depth, ext, key_words])Initialize the spider with parameters.
__init_subclass__This method is called when a class is subclassed.
__le__(value, /)Return self<=value.
__lt__(value, /)Return self<value.
__ne__(value, /)Return self!=value.
__new__(*args, **kwargs)__reduce__()Helper for pickle.
__reduce_ex__(protocol, /)Helper for pickle.
__repr__()Return repr(self).
__setattr__(name, value, /)Implement setattr(self, name, value).
__sizeof__()Size of object in memory, in bytes.
__str__()Return str(self).
__subclasshook__Abstract classes can override this to customize issubclass().
closed(reason)Handle actions to perform when the spider is closed.
parse(response[, current_depth])Parse the response to extract links based on criteria.
parse_item(response)Extract a simple item from a response.
Attributes
__annotations____dict____doc____module____weakref__list of weak references to the object