socio4health.Harmonizer#

class socio4health.Harmonizer(min_common_columns: int = 1, similarity_threshold: float = 1, nan_threshold: float = 1.0, sample_frac: float | None = None, column_mapping: Type[Enum] | Dict[str, Dict[str, str]] | str | Path | None = None, value_mappings: Type[Enum] | Dict[str, Dict[str, Dict[str, str]]] | str | Path | None = None, theme_info: Dict[str, List[str]] | str | Path | None = None, default_country: str | None = None, strict_mapping: bool = False, dict_df: DataFrame | None = None, categories: List[str] | None = None, key_col: str | None = None, key_val: List[str | int | float] | None = None, extra_cols: List[str] | None = None, join_key: str = None, aux_key: str | None = None)[source]#

Initialize the Harmonizer class for harmonizing and processing Dask DataFrames in health data integration.

min_common_columns#

Minimum number of common columns required for vertical merge (default is 1).

Type:

int

similarity_threshold#

Similarity threshold to consider for vertical merge (default is 0.8).

Type:

float

nan_threshold#

Percentage threshold of NaN values to drop columns (default is 1.0).

Type:

float

sample_frac#

Fraction of rows to sample for NaN detection (default is None).

Type:

float or None

column_mapping#

Column mapping configuration (default is None).

Type:

Enum, dict, str or Path

value_mappings#

Categorical value mapping configuration (default is None).

Type:

Enum, dict, str or Path

theme_info#

Theme/category information (default is None).

Type:

dict, str or Path

default_country#

Default country for mapping (default is None).

Type:

str

strict_mapping#

Whether to enforce strict mapping of columns and values (default is False).

Type:

bool

dict_df#

DataFrame with variable dictionary (default is None).

Type:

pandas.DataFrame

categories#

Categories for data selection (default is an empty list).

Type:

list of str

key_col#

Key column for data selection (default is None).

Type:

str

key_val#

Key values for data selection (default is an empty list).

Type:

list of str, int or float

extra_cols#

Extra columns for data selection (default is an empty list).

Type:

list of str

__init__(min_common_columns: int = 1, similarity_threshold: float = 1, nan_threshold: float = 1.0, sample_frac: float | None = None, column_mapping: Type[Enum] | Dict[str, Dict[str, str]] | str | Path | None = None, value_mappings: Type[Enum] | Dict[str, Dict[str, Dict[str, str]]] | str | Path | None = None, theme_info: Dict[str, List[str]] | str | Path | None = None, default_country: str | None = None, strict_mapping: bool = False, dict_df: DataFrame | None = None, categories: List[str] | None = None, key_col: str | None = None, key_val: List[str | int | float] | None = None, extra_cols: List[str] | None = None, join_key: str = None, aux_key: str | None = None)[source]#

Initialize the Harmonizer class with default parameters.

Methods

__delattr__(name, /)

Implement delattr(self, name).

__dir__()

Default dir() implementation.

__eq__(value, /)

Return self==value.

__format__(format_spec, /)

Default object formatter.

__ge__(value, /)

Return self>=value.

__getattribute__(name, /)

Return getattr(self, name).

__getstate__()

Helper for pickle.

__gt__(value, /)

Return self>value.

__hash__()

Return hash(self).

__init__([min_common_columns, ...])

Initialize the Harmonizer class with default parameters.

__init_subclass__

This method is called when a class is subclassed.

__le__(value, /)

Return self<=value.

__lt__(value, /)

Return self<value.

__ne__(value, /)

Return self!=value.

__new__(*args, **kwargs)

__reduce__()

Helper for pickle.

__reduce_ex__(protocol, /)

Helper for pickle.

__repr__()

Return repr(self).

__setattr__(name, value, /)

Implement setattr(self, name, value).

__sizeof__()

Size of object in memory, in bytes.

__str__()

Return str(self).

__subclasshook__

Abstract classes can override this to customize issubclass().

compare_with_dict(ddfs)

Compare the columns available in the DataFrames with the variables in the dictionary and return a DataFrame with the columns that do not match in both directions.

data_selector(ddfs)

drop_nan_columns(ddf_or_ddfs)

Drop columns where the majority of values are NaN using instance parameters.

get_available_columns(ddf_or_ddfs)

harmonize_dataframes(country_dfs)

join_data(ddfs)

Join multiple Dask DataFrames on a specified key column, removing duplicate columns.

vertical_merge(ddfs)

Attributes

__annotations__

__dict__

__doc__

__module__

__weakref__

list of weak references to the object