Modules#
Extractor#
Extractor class for downloading and processing data files from various sources.
This class supports both online scraping and local file processing, handling compressed files, fixed-width files, and CSV
formats.
It includes methods for downloading files, extracting data, and cleaning up after processing.
- class socio4health.extractor.Extractor(input_path: str = None, depth: int = None, down_ext: list = None, output_path: str = None, key_words: list = None, encoding: str = 'latin1', is_fwf: bool = False, colnames: list = None, colspecs: list = None, sep: str = None, ddtype: str | Dict = 'object', dtype: str = None, engine: str = None, sheet_name: str = None, geodriver: str = None)[source]#
Bases:
object
A class for extracting data from various sources, including online scraping and local file processing. This class supports downloading files, extracting data from compressed formats, and reading fixed-width or
CSV
files. It handles both online and local modes of operation, allowing for flexible data extraction workflows.- down_ext#
A list of file extensions to look for when downloading files. Available options include compressed formats such as
.zip
,.7z
,.tar
,.gz
, and.tgz
, as well as other file types like.csv
,.txt
, etc. This list can be customized based on the expected file types.- Type:
- output_path#
The directory where downloaded files will be saved. Defaults to the user’s data directory.
- Type:
- is_fwf#
Whether the files to be processed are fixed-width files (FWF). Defaults to
False
.- Type:
- colnames#
Column names to use when reading fixed-width files. Required if is_fwf is
True
.- Type:
- colspecs#
Column specifications for fixed-width files, defining the widths of each column. Required if
is_fwf
isTrue
.- Type:
- ddtype#
The data type to use when reading files. Can be a single type or a dictionary mapping column names to types. Defaults to
object
.- Type:
Union[str, Dict]
- dtype#
Data types to use when reading files with
pandas
. Can be a string (e.g.,'object'
) or a dictionary mapping column names to data types.- Type:
Union[str, Dict]
- engine#
The engine to use for reading Excel files (e.g.,
'openpyxl'
or'xlrd'
). Leave asNone
to use the default engine based on file extension.- Type:
- sheet_name#
The name or index of the Excel sheet to read. Can also be a list to read multiple sheets or
None
to read all sheets. Defaults to the first sheet (0
).
- geodriver#
The driver to use for reading geospatial files with
geopandas.read_file()
(e.g.,'ESRI Shapefile'
,'KML'
, etc.). Optional.- Type:
Important
In case
is_fwf
isTrue
and fixed-width files are given, bothcolnames
andcolspecs
must be provided.See also
Extractor.extract
Extracts data from the specified input path, either by scraping online or processing local files.
Extractor.delete_download_folder
Safely deletes the download folder and all its contents, with safety checks to prevent accidental deletion of important directories.
- delete_download_folder(folder_path: str | None = None) bool [source]#
Safely delete the download folder and all its contents.
- Parameters:
folder_path – Optional path to delete (defaults to the download_dir used in extraction)
- Returns:
True
if deletion was successful,False
otherwise- Return type:
- Raises:
ValueError – If no folder path is provided and no download_dir exists
OSError – If folder deletion fails
- extract()[source]#
Extracts data from the specified input path, either by scraping online sources or processing local files.
This method determines the operation mode based on the input path:
If the input path is a
URL
, it performs online scraping to find downloadable files.If the input path is a local directory, it processes files directly from that directory.
- Returns:
List of Dask DataFrames containing the extracted data.
- Return type:
list of dask.dataframe.DataFrame
- Raises:
ValueError – If extraction fails due to an invalid input path, missing column specifications for fixed-width files, or if no valid data files are found after processing.
- socio4health.extractor.get_default_data_dir()[source]#
Returns the default data directory for storing downloaded files.
- Returns:
pathlib.Path object representing the default data directory.
- Return type:
Path
Note
This function ensures that the directory exists by creating it if necessary.
Harmonizer#
Harmonizer class for harmonizing and processing Dask DataFrames in health data integration.
- class socio4health.harmonizer.Harmonizer(min_common_columns: int = 1, similarity_threshold: float = 1, nan_threshold: float = 1.0, sample_frac: float | None = None, column_mapping: Type[Enum] | Dict[str, Dict[str, str]] | str | Path | None = None, value_mappings: Type[Enum] | Dict[str, Dict[str, Dict[str, str]]] | str | Path | None = None, theme_info: Dict[str, List[str]] | str | Path | None = None, default_country: str | None = None, strict_mapping: bool = False, dict_df: DataFrame | None = None, categories: List[str] | None = None, key_col: str | None = None, key_val: List[str | int | float] | None = None, extra_cols: List[str] | None = None, join_key: str = None, aux_key: str | None = None)[source]#
Bases:
object
Initialize the Harmonizer class for harmonizing and processing Dask DataFrames in health data integration.
- min_common_columns#
Minimum number of common columns required for vertical merge (default is 1).
- Type:
- similarity_threshold#
Similarity threshold to consider for vertical merge (default is 0.8).
- Type:
- sample_frac#
Fraction of rows to sample for NaN detection (default is
None
).- Type:
float or
None
- column_mapping#
Column mapping configuration (default is
None
).- Type:
Enum
, dict, str orPath
- value_mappings#
Categorical value mapping configuration (default is
None
).- Type:
Enum
, dict, str orPath
- theme_info#
Theme/category information (default is
None
).- Type:
dict, str or
Path
- strict_mapping#
Whether to enforce strict mapping of columns and values (default is
False
).- Type:
- dict_df#
DataFrame with variable dictionary (default is
None
).- Type:
- compare_with_dict(ddfs: List[DataFrame]) DataFrame [source]#
Compare the columns available in the DataFrames with the variables in the dictionary and return a DataFrame with the columns that do not match in both directions.
- Parameters:
ddfs (list of dask.dataframe.DataFrame) – List of DataFrames to evaluate.
- Returns:
DataFrame with mismatched columns.
- Return type:
pd.DataFrame
- data_selector(ddfs: List[DataFrame]) List[DataFrame] [source]#
Select rows from Dask DataFrames based on the instance parameters.
- Parameters:
ddfs – List of Dask DataFrames to filter.
- Returns:
List of filtered Dask DataFrames according to the key column, key values, categories, and extra columns.
- Return type:
list of dask.dataframe.DataFrame
- Raises:
KeyError – If the key column is not found in a DataFrame.
- drop_nan_columns(ddf_or_ddfs: DataFrame | List[DataFrame]) DataFrame | List[DataFrame] [source]#
Drop columns where the majority of values are
NaN
using instance parameters.- Parameters:
ddf_or_ddfs –
- Returns:
The DataFrame(s) with columns dropped where the proportion of
NaN
values is greater than nan_threshold.- Return type:
dask.dataframe.DataFrame or list of dask.dataframe.DataFrame
- Raises:
ValueError – If
nan_threshold
is not between 0 and 1, or ifsample_frac
is notNone
or a float between 0 and 1.
- static get_available_columns(ddf_or_ddfs: DataFrame | List[DataFrame]) List[str] [source]#
Get a list of unique column names from a single Dask DataFrame or a list of Dask DataFrames.
- Parameters:
ddf_or_ddfs –
or list of dask.dataframe.DataFrame A single Dask DataFrame or a list of Dask DataFrames to extract column names from.
- Returns:
Sorted list of unique column names across all provided Dask DataFrames.
- Return type:
- Raises:
TypeError – If the input is not a Dask DataFrame or a list of Dask DataFrames.
- harmonize_dataframes(country_dfs: Dict[str, List[DataFrame]]) Dict[str, List[DataFrame]] [source]#
Harmonize Dask DataFrames using the instance parameters.
- Parameters:
country_dfs –
Dictionary mapping country names to lists of Dask DataFrames to be harmonized.
- Returns:
Dictionary mapping country names to lists of harmonized Dask DataFrames.
- Return type:
dict of str to list of dask.dataframe.DataFrame
Note
Column and value mappings are applied per country using the provided configuration.
If
strict_mapping
is enabled, unmapped columns or values will raise a ValueError.Column renaming and categorical value harmonization are performed in-place.
- join_data(ddfs: List[DataFrame]) DataFrame [source]#
Join multiple Dask DataFrames on a specified key column, removing duplicate columns.
- Parameters:
ddfs (list of dask.dataframe.DataFrame) – List of Dask DataFrames to join.
- Returns:
Merged DataFrame with duplicate columns removed.
- Return type:
- vertical_merge(ddfs: List[DataFrame]) List[DataFrame] [source]#
Merge a list of Dask DataFrames vertically using instance parameters.
- Parameters:
ddfs –
List of Dask DataFrames to be merged.
- Returns:
List of merged Dask <https://docs.dask.org>`_ DataFrames, where each group contains DataFrames with sufficient column overlap and compatible data types.
- Return type:
list of dask.dataframe.DataFrame
Important
DataFrames are grouped and merged if they share at least
min_common_columns
columns and their column similarity is abovesimilarity_threshold
.Only columns with matching data types are considered compatible for merging.