Modules#

Extractor#

Extractor class for downloading and processing data files from various sources. This class supports both online scraping and local file processing, handling compressed files, fixed-width files, and CSV formats. It includes methods for downloading files, extracting data, and cleaning up after processing.

class socio4health.extractor.Extractor(input_path: str = None, depth: int = None, down_ext: list = None, output_path: str = None, key_words: list = None, encoding: str = 'latin1', is_fwf: bool = False, colnames: list = None, colspecs: list = None, sep: str = None, ddtype: str | Dict = 'object', dtype: str = None, engine: str = None, sheet_name: str = None, geodriver: str = None)[source]#

Bases: object

A class for extracting data from various sources, including online scraping and local file processing. This class supports downloading files, extracting data from compressed formats, and reading fixed-width or CSV files. It handles both online and local modes of operation, allowing for flexible data extraction workflows.

input_path#

The path to the input data source, which can be a URL or a local directory.

Type:: str

depth#

The depth of web scraping to perform when input_path is a URL.

Type:: int

down_ext#

A list of file extensions to look for when downloading files. Available options include compressed formats such as .zip, .7z, .tar, .gz, and .tgz, as well as other file types like .csv, .txt, etc. This list can be customized based on the expected file types.

Type:: list

output_path#

The directory where downloaded files will be saved. Defaults to the user’s data directory.

Type:: str

key_words#

A list of keywords to filter downloadable files during web scraping.

Type:: list

encoding#

The character encoding to use when reading files. Defaults to 'latin1'.

Type:: str

is_fwf#

Whether the files to be processed are fixed-width files (FWF). Defaults to False.

Type:: bool

colnames#

Column names to use when reading fixed-width files. Required if is_fwf is True.

Type:: list

colspecs#

Column specifications for fixed-width files, defining the widths of each column. Required if is_fwf is True.

Type:: list

sep#

The separator to use when reading CSV files. Defaults to ','.

Type:: str

ddtype#

The data type to use when reading files. Can be a single type or a dictionary mapping column names to types. Defaults to object.

Type:: Union[str, Dict]

dtype#

Data types to use when reading files with pandas. Can be a string (e.g., 'object') or a dictionary mapping column names to data types.

Type:: Union[str, Dict]

engine#

The engine to use for reading Excel files (e.g., 'openpyxl' or 'xlrd'). Leave as None to use the default engine based on file extension.

Type:: str

sheet_name#

The name or index of the Excel sheet to read. Can also be a list to read multiple sheets or None to read all sheets. Defaults to the first sheet (0).

Type:: Union[str, int, list, None]

geodriver#

The driver to use for reading geospatial files with geopandas.read_file() (e.g., 'ESRI Shapefile', 'KML', etc.). Optional.

Type:: str

Important

In case is_fwf is True and fixed-width files are given, both colnames and colspecs must be provided.

Harmonizer#

Harmonizer class for harmonizing and processing Dask DataFrames in health data integration.

Bases: object

Initialize the Harmonizer class for harmonizing and processing Dask DataFrames in health data integration.

min_common_columns#

Minimum number of common columns required for vertical merge (default is 1).

Type:: int

similarity_threshold#

Similarity threshold to consider for vertical merge (default is 0.8).

Type:: float

nan_threshold#

Percentage threshold of NaN values to drop columns (default is 1.0).

Type:: float

sample_frac#

Fraction of rows to sample for NaN detection (default is None).

Type:: float or None

column_mapping#

Column mapping configuration (default is None).

Type:: Enum, dict, str or Path

value_mappings#

Categorical value mapping configuration (default is None).

Type:: Enum, dict, str or Path

theme_info#

Theme/category information (default is None).

Type:: dict, str or Path

default_country#

Default country for mapping (default is None).

Type:: str

strict_mapping#

Whether to enforce strict mapping of columns and values (default is False).

Type:: bool

dict_df#

DataFrame with variable dictionary (default is None).

Type:: pandas.DataFrame

categories#

Categories for data selection (default is an empty list).

Type:: list of str

key_col#

Key column for data selection (default is None).

Type:: str

key_val#

Key values for data selection (default is an empty list).

Type:: list of str, int or float

extra_cols#

Extra columns for data selection (default is an empty list).

Type:: list of str

join_key#

Key column for joining DataFrames (default is None).

Type:: str

aux_key#

Auxiliary key column for joining DataFrames (default is None).

Type:: str

drop_nan_columns(ddf_or_ddfs: DataFrame | List[DataFrame]) → DataFrame | List[DataFrame][source]#

Drop columns where the majority of values are NaN using instance parameters.

Parameters:

ddf_or_ddfs –

The Dask DataFrame or list of Dask DataFrames to process.

Returns:

The DataFrame(s) with columns dropped where the proportion of NaN values is greater than nan_threshold.

Return type:

dask.dataframe.DataFrame or list of dask.dataframe.DataFrame

Raises:

ValueError – If nan_threshold is not between 0 and 1, or if sample_frac is not None or a float between 0 and 1.

s4h_compare_with_dict(ddfs: List[DataFrame]) → DataFrame[source]#

Compare the columns available in the DataFrames with the variables in the dictionary and return a DataFrame with the columns that do not match in both directions.

Parameters:: ddfs (list of dask.dataframe.DataFrame) – List of DataFrames to evaluate.
Returns:: DataFrame with mismatched columns.
Return type:: pd.DataFrame

s4h_data_selector(ddfs: List[DataFrame]) → List[DataFrame][source]#

Select rows from Dask DataFrames based on the instance parameters.

Parameters:

ddfs – List of Dask DataFrames to filter.

Returns:

List of filtered Dask DataFrames according to the key_col, key_val, categories, and extra_cols.

Return type:

list of dask.dataframe.DataFrame

Raises:

KeyError – If the key_col is not found in a DataFrame.

static s4h_get_available_columns(df_or_dfs: DataFrame | DataFrame | List[DataFrame | DataFrame]) → List[str][source]#

Get a list of unique column names from a single DataFrame or a list of DataFrames. Supports both Dask and pandas DataFrames.

Parameters:: df_or_dfs (Union[dask.dataframe.DataFrame, pandas.DataFrame,) – List[Union[dask.dataframe.DataFrame, pandas.DataFrame]]] A single DataFrame or a list of DataFrames to extract column names from. Can be Dask DataFrames, pandas DataFrames, or a mix of both.
Returns:: Sorted list of unique column names across all provided DataFrames.
Return type:: list of str
Raises:: TypeError – If the input is not a DataFrame or a list of DataFrames.

s4h_harmonize_dataframes(country_dfs: Dict[str, List[DataFrame]]) → Dict[str, List[DataFrame]][source]#

Harmonize Dask DataFrames using the instance parameters.

Parameters:

country_dfs –

Dictionary mapping country names to lists of Dask DataFrames to be harmonized.

Returns:

Dictionary mapping country names to lists of harmonized Dask DataFrames.

Return type:

dict of str to list of dask.dataframe.DataFrame

Note

Column and value mappings are applied per country using the provided configuration.
If strict_mapping is enabled, unmapped columns or values will raise a ValueError.
Column renaming and categorical value harmonization are performed in-place.

s4h_join_data(ddfs: List[DataFrame]) → DataFrame[source]#

Join multiple Dask DataFrames on a specified key_col, removing duplicate columns.

Parameters:: ddfs – List of Dask DataFrames to join.
Returns:: Merged DataFrame with duplicate columns removed.
Return type:: pandas.DataFrame

s4h_vertical_merge(ddfs: List[DataFrame]) → List[DataFrame][source]#

Merge a list of Dask DataFrames vertically using instance parameters.

Parameters:

ddfs –

List of Dask DataFrames to be merged.

Returns:

List of merged Dask DataFrames, where each group contains DataFrames with sufficient column overlap and compatible data types.

Return type:

list of dask.dataframe.DataFrame

Notes

DataFrames are grouped and merged if they share at least min_common_columns columns and their column similarity is above similarity_threshold.
Only columns with matching data types are considered compatible for merging.

Modules#

Extractor#

Harmonizer#

This Page