socio4health.Extractor#

class socio4health.Extractor(input_path: str = None, depth: int = None, down_ext: list = None, output_path: str = None, key_words: list = None, encoding: str = 'latin1', is_fwf: bool = False, colnames: list = None, colspecs: list = None, sep: str = None, ddtype: str | Dict = 'object', dtype: str = None, engine: str = None, sheet_name: str = None, geodriver: str = None)[source]#

A class for extracting data from various sources, including online scraping and local file processing. This class supports downloading files, extracting data from compressed formats, and reading fixed-width or CSV files. It handles both online and local modes of operation, allowing for flexible data extraction workflows.

input_path#

The path to the input data source, which can be a URL or a local directory.

Type:

str

depth#

The depth of web scraping to perform when input_path is a URL.

Type:

int

down_ext#

A list of file extensions to look for when downloading files. Available options include compressed formats such as .zip, .7z, .tar, .gz, and .tgz, as well as other file types like .csv, .txt, etc. This list can be customized based on the expected file types.

Type:

list

output_path#

The directory where downloaded files will be saved. Defaults to the user’s data directory.

Type:

str

key_words#

A list of keywords to filter downloadable files during web scraping.

Type:

list

encoding#

The character encoding to use when reading files. Defaults to 'latin1'.

Type:

str

is_fwf#

Whether the files to be processed are fixed-width files (FWF). Defaults to False.

Type:

bool

colnames#

Column names to use when reading fixed-width files. Required if is_fwf is True.

Type:

list

colspecs#

Column specifications for fixed-width files, defining the widths of each column. Required if is_fwf is True.

Type:

list

sep#

The separator to use when reading CSV files. Defaults to ','.

Type:

str

ddtype#

The data type to use when reading files. Can be a single type or a dictionary mapping column names to types. Defaults to object.

Type:

Union[str, Dict]

dtype#

Data types to use when reading files with pandas. Can be a string (e.g., 'object') or a dictionary mapping column names to data types.

Type:

Union[str, Dict]

engine#

The engine to use for reading Excel files (e.g., 'openpyxl' or 'xlrd'). Leave as None to use the default engine based on file extension.

Type:

str

sheet_name#

The name or index of the Excel sheet to read. Can also be a list to read multiple sheets or None to read all sheets. Defaults to the first sheet (0).

Type:

Union[str, int, list, None]

geodriver#

The driver to use for reading geospatial files with geopandas.read_file() (e.g., 'ESRI Shapefile', 'KML', etc.). Optional.

Type:

str

Important

In case is_fwf is True and fixed-width files are given, both colnames and colspecs must be provided.

See also

Extractor.extract

Extracts data from the specified input path, either by scraping online or processing local files.

Extractor.delete_download_folder

Safely deletes the download folder and all its contents, with safety checks to prevent accidental deletion of important directories.

__init__(input_path: str = None, depth: int = None, down_ext: list = None, output_path: str = None, key_words: list = None, encoding: str = 'latin1', is_fwf: bool = False, colnames: list = None, colspecs: list = None, sep: str = None, ddtype: str | Dict = 'object', dtype: str = None, engine: str = None, sheet_name: str = None, geodriver: str = None)[source]#

Methods

__delattr__(name, /)

Implement delattr(self, name).

__dir__()

Default dir() implementation.

__eq__(value, /)

Return self==value.

__format__(format_spec, /)

Default object formatter.

__ge__(value, /)

Return self>=value.

__getattribute__(name, /)

Return getattr(self, name).

__getstate__()

Helper for pickle.

__gt__(value, /)

Return self>value.

__hash__()

Return hash(self).

__init__([input_path, depth, down_ext, ...])

__init_subclass__

This method is called when a class is subclassed.

__le__(value, /)

Return self<=value.

__lt__(value, /)

Return self<value.

__ne__(value, /)

Return self!=value.

__new__(*args, **kwargs)

__reduce__()

Helper for pickle.

__reduce_ex__(protocol, /)

Helper for pickle.

__repr__()

Return repr(self).

__setattr__(name, value, /)

Implement setattr(self, name, value).

__sizeof__()

Size of object in memory, in bytes.

__str__()

Return str(self).

__subclasshook__

Abstract classes can override this to customize issubclass().

_extract_local_mode()

Local mode extraction that now uses the shared processing logic

_extract_online_mode()

Optimized online data extraction with better error handling and progress tracking

_process_downloaded_files(downloaded_files)

Process downloaded files using local mode logic

_process_files_locally(files)

Shared local processing logic used by both modes

_read_csv(filepath)

_read_excel(filepath)

_read_file(filepath)

_read_geospatial(filepath)

_read_json(filepath)

_read_parquet(filepath)

_read_txt(filepath)

delete_download_folder([folder_path])

Safely delete the download folder and all its contents.

extract()

Extracts data from the specified input path, either by scraping online sources or processing local files.

Attributes

__annotations__

__dict__

__doc__

__module__

__weakref__

list of weak references to the object