Modules#

Extractor#

Extractor class for downloading and processing data files from various sources. This class supports both online scraping and local file processing, handling compressed files, fixed-width files, and CSV formats. It includes methods for downloading files, extracting data, and cleaning up after processing.

class socio4health.extractor.Extractor(input_path: str = None, depth: int = None, down_ext: list = None, output_path: str = None, key_words: list = None, encoding: str = 'latin1', is_fwf: bool = False, colnames: list = None, colspecs: list = None, sep: str = None, ddtype: str | Dict = 'object', dtype: str = None, engine: str = None, sheet_name: str = None, geodriver: str = None)[source]#

Bases: object

A class for extracting data from various sources, including online scraping and local file processing. This class supports downloading files, extracting data from compressed formats, and reading fixed-width or CSV files. It handles both online and local modes of operation, allowing for flexible data extraction workflows.

input_path#

The path to the input data source, which can be a URL or a local directory.

Type:

str

depth#

The depth of web scraping to perform when input_path is a URL.

Type:

int

down_ext#

A list of file extensions to look for when downloading files. Available options include compressed formats such as .zip, .7z, .tar, .gz, and .tgz, as well as other file types like .csv, .txt, etc. This list can be customized based on the expected file types.

Type:

list

output_path#

The directory where downloaded files will be saved. Defaults to the user’s data directory.

Type:

str

key_words#

A list of keywords to filter downloadable files during web scraping.

Type:

list

encoding#

The character encoding to use when reading files. Defaults to 'latin1'.

Type:

str

is_fwf#

Whether the files to be processed are fixed-width files (FWF). Defaults to False.

Type:

bool

colnames#

Column names to use when reading fixed-width files. Required if is_fwf is True.

Type:

list

colspecs#

Column specifications for fixed-width files, defining the widths of each column. Required if is_fwf is True.

Type:

list

sep#

The separator to use when reading CSV files. Defaults to ','.

Type:

str

ddtype#

The data type to use when reading files. Can be a single type or a dictionary mapping column names to types. Defaults to object.

Type:

Union[str, Dict]

dtype#

Data types to use when reading files with pandas. Can be a string (e.g., 'object') or a dictionary mapping column names to data types.

Type:

Union[str, Dict]

engine#

The engine to use for reading Excel files (e.g., 'openpyxl' or 'xlrd'). Leave as None to use the default engine based on file extension.

Type:

str

sheet_name#

The name or index of the Excel sheet to read. Can also be a list to read multiple sheets or None to read all sheets. Defaults to the first sheet (0).

Type:

Union[str, int, list, None]

geodriver#

The driver to use for reading geospatial files with geopandas.read_file() (e.g., 'ESRI Shapefile', 'KML', etc.). Optional.

Type:

str

Important

In case is_fwf is True and fixed-width files are given, both colnames and colspecs must be provided.

See also

Extractor.extract

Extracts data from the specified input path, either by scraping online or processing local files.

Extractor.delete_download_folder

Safely deletes the download folder and all its contents, with safety checks to prevent accidental deletion of important directories.

delete_download_folder(folder_path: str | None = None) bool[source]#

Safely delete the download folder and all its contents.

Parameters:

folder_path – Optional path to delete (defaults to the download_dir used in extraction)

Returns:

True if deletion was successful, False otherwise

Return type:

bool

Raises:
  • ValueError – If no folder path is provided and no download_dir exists

  • OSError – If folder deletion fails

extract()[source]#

Extracts data from the specified input path, either by scraping online sources or processing local files.

This method determines the operation mode based on the input path:

  • If the input path is a URL, it performs online scraping to find downloadable files.

  • If the input path is a local directory, it processes files directly from that directory.

Returns:

List of Dask DataFrames containing the extracted data.

Return type:

list of dask.dataframe.DataFrame

Raises:

ValueError – If extraction fails due to an invalid input path, missing column specifications for fixed-width files, or if no valid data files are found after processing.

socio4health.extractor.get_default_data_dir()[source]#

Returns the default data directory for storing downloaded files.

Returns:

pathlib.Path object representing the default data directory.

Return type:

Path

Note

This function ensures that the directory exists by creating it if necessary.

Harmonizer#

Harmonizer class for harmonizing and processing Dask DataFrames in health data integration.

class socio4health.harmonizer.Harmonizer(min_common_columns: int = 1, similarity_threshold: float = 1, nan_threshold: float = 1.0, sample_frac: float | None = None, column_mapping: Type[Enum] | Dict[str, Dict[str, str]] | str | Path | None = None, value_mappings: Type[Enum] | Dict[str, Dict[str, Dict[str, str]]] | str | Path | None = None, theme_info: Dict[str, List[str]] | str | Path | None = None, default_country: str | None = None, strict_mapping: bool = False, dict_df: DataFrame | None = None, categories: List[str] | None = None, key_col: str | None = None, key_val: List[str | int | float] | None = None, extra_cols: List[str] | None = None, join_key: str = None, aux_key: str | None = None)[source]#

Bases: object

Initialize the Harmonizer class for harmonizing and processing Dask DataFrames in health data integration.

min_common_columns#

Minimum number of common columns required for vertical merge (default is 1).

Type:

int

similarity_threshold#

Similarity threshold to consider for vertical merge (default is 0.8).

Type:

float

nan_threshold#

Percentage threshold of NaN values to drop columns (default is 1.0).

Type:

float

sample_frac#

Fraction of rows to sample for NaN detection (default is None).

Type:

float or None

column_mapping#

Column mapping configuration (default is None).

Type:

Enum, dict, str or Path

value_mappings#

Categorical value mapping configuration (default is None).

Type:

Enum, dict, str or Path

theme_info#

Theme/category information (default is None).

Type:

dict, str or Path

default_country#

Default country for mapping (default is None).

Type:

str

strict_mapping#

Whether to enforce strict mapping of columns and values (default is False).

Type:

bool

dict_df#

DataFrame with variable dictionary (default is None).

Type:

pandas.DataFrame

categories#

Categories for data selection (default is an empty list).

Type:

list of str

key_col#

Key column for data selection (default is None).

Type:

str

key_val#

Key values for data selection (default is an empty list).

Type:

list of str, int or float

extra_cols#

Extra columns for data selection (default is an empty list).

Type:

list of str

compare_with_dict(ddfs: List[DataFrame]) DataFrame[source]#

Compare the columns available in the DataFrames with the variables in the dictionary and return a DataFrame with the columns that do not match in both directions.

Parameters:

ddfs (list of dask.dataframe.DataFrame) – List of DataFrames to evaluate.

Returns:

DataFrame with mismatched columns.

Return type:

pd.DataFrame

data_selector(ddfs: List[DataFrame]) List[DataFrame][source]#

Select rows from Dask DataFrames based on the instance parameters.

Parameters:

ddfs – List of Dask DataFrames to filter.

Returns:

List of filtered Dask DataFrames according to the key column, key values, categories, and extra columns.

Return type:

list of dask.dataframe.DataFrame

Raises:

KeyError – If the key column is not found in a DataFrame.

drop_nan_columns(ddf_or_ddfs: DataFrame | List[DataFrame]) DataFrame | List[DataFrame][source]#

Drop columns where the majority of values are NaN using instance parameters.

Parameters:

ddf_or_ddfs

The Dask DataFrame or list of Dask DataFrames to process.

Returns:

The DataFrame(s) with columns dropped where the proportion of NaN values is greater than nan_threshold.

Return type:

dask.dataframe.DataFrame or list of dask.dataframe.DataFrame

Raises:

ValueError – If nan_threshold is not between 0 and 1, or if sample_frac is not None or a float between 0 and 1.

static get_available_columns(ddf_or_ddfs: DataFrame | List[DataFrame]) List[str][source]#

Get a list of unique column names from a single Dask DataFrame or a list of Dask DataFrames.

Parameters:

ddf_or_ddfs

or list of dask.dataframe.DataFrame A single Dask DataFrame or a list of Dask DataFrames to extract column names from.

Returns:

Sorted list of unique column names across all provided Dask DataFrames.

Return type:

list of str

Raises:

TypeError – If the input is not a Dask DataFrame or a list of Dask DataFrames.

harmonize_dataframes(country_dfs: Dict[str, List[DataFrame]]) Dict[str, List[DataFrame]][source]#

Harmonize Dask DataFrames using the instance parameters.

Parameters:

country_dfs

Dictionary mapping country names to lists of Dask DataFrames to be harmonized.

Returns:

Dictionary mapping country names to lists of harmonized Dask DataFrames.

Return type:

dict of str to list of dask.dataframe.DataFrame

Note

  • Column and value mappings are applied per country using the provided configuration.

  • If strict_mapping is enabled, unmapped columns or values will raise a ValueError.

  • Column renaming and categorical value harmonization are performed in-place.

join_data(ddfs: List[DataFrame]) DataFrame[source]#

Join multiple Dask DataFrames on a specified key column, removing duplicate columns.

Parameters:

ddfs (list of dask.dataframe.DataFrame) – List of Dask DataFrames to join.

Returns:

Merged DataFrame with duplicate columns removed.

Return type:

pandas.DataFrame

vertical_merge(ddfs: List[DataFrame]) List[DataFrame][source]#

Merge a list of Dask DataFrames vertically using instance parameters.

Parameters:

ddfs

List of Dask DataFrames to be merged.

Returns:

List of merged Dask <https://docs.dask.org>`_ DataFrames, where each group contains DataFrames with sufficient column overlap and compatible data types.

Return type:

list of dask.dataframe.DataFrame

Important

  • DataFrames are grouped and merged if they share at least min_common_columns columns and their column similarity is above similarity_threshold.

  • Only columns with matching data types are considered compatible for merging.