socio4health.Harmonizer#
- class socio4health.Harmonizer(min_common_columns: int = 1, similarity_threshold: float = 1, nan_threshold: float = 1.0, sample_frac: float | None = None, column_mapping: Type[Enum] | Dict[str, Dict[str, str]] | str | Path | None = None, value_mappings: Type[Enum] | Dict[str, Dict[str, Dict[str, str]]] | str | Path | None = None, theme_info: Dict[str, List[str]] | str | Path | None = None, default_country: str | None = None, strict_mapping: bool = False, dict_df: DataFrame | None = None, categories: List[str] | None = None, key_col: str | None = None, key_val: List[str | int | float] | None = None, extra_cols: List[str] | None = None, join_key: str = None, aux_key: str | None = None)[source]#
Initialize the Harmonizer class for harmonizing and processing Dask DataFrames in health data integration.
- min_common_columns#
Minimum number of common columns required for vertical merge (default is 1).
- Type:
- similarity_threshold#
Similarity threshold to consider for vertical merge (default is 0.8).
- Type:
- sample_frac#
Fraction of rows to sample for NaN detection (default is
None
).- Type:
float or
None
- column_mapping#
Column mapping configuration (default is
None
).- Type:
Enum
, dict, str orPath
- value_mappings#
Categorical value mapping configuration (default is
None
).- Type:
Enum
, dict, str orPath
- theme_info#
Theme/category information (default is
None
).- Type:
dict, str or
Path
- strict_mapping#
Whether to enforce strict mapping of columns and values (default is
False
).- Type:
- dict_df#
DataFrame with variable dictionary (default is
None
).- Type:
- __init__(min_common_columns: int = 1, similarity_threshold: float = 1, nan_threshold: float = 1.0, sample_frac: float | None = None, column_mapping: Type[Enum] | Dict[str, Dict[str, str]] | str | Path | None = None, value_mappings: Type[Enum] | Dict[str, Dict[str, Dict[str, str]]] | str | Path | None = None, theme_info: Dict[str, List[str]] | str | Path | None = None, default_country: str | None = None, strict_mapping: bool = False, dict_df: DataFrame | None = None, categories: List[str] | None = None, key_col: str | None = None, key_val: List[str | int | float] | None = None, extra_cols: List[str] | None = None, join_key: str = None, aux_key: str | None = None)[source]#
Initialize the Harmonizer class with default parameters.
Methods
__delattr__
(name, /)Implement delattr(self, name).
__dir__
()Default dir() implementation.
__eq__
(value, /)Return self==value.
__format__
(format_spec, /)Default object formatter.
__ge__
(value, /)Return self>=value.
__getattribute__
(name, /)Return getattr(self, name).
__getstate__
()Helper for pickle.
__gt__
(value, /)Return self>value.
__hash__
()Return hash(self).
__init__
([min_common_columns, ...])Initialize the Harmonizer class with default parameters.
__init_subclass__
This method is called when a class is subclassed.
__le__
(value, /)Return self<=value.
__lt__
(value, /)Return self<value.
__ne__
(value, /)Return self!=value.
__new__
(*args, **kwargs)__reduce__
()Helper for pickle.
__reduce_ex__
(protocol, /)Helper for pickle.
__repr__
()Return repr(self).
__setattr__
(name, value, /)Implement setattr(self, name, value).
__sizeof__
()Size of object in memory, in bytes.
__str__
()Return str(self).
__subclasshook__
Abstract classes can override this to customize issubclass().
compare_with_dict
(ddfs)Compare the columns available in the DataFrames with the variables in the dictionary and return a DataFrame with the columns that do not match in both directions.
data_selector
(ddfs)drop_nan_columns
(ddf_or_ddfs)Drop columns where the majority of values are
NaN
using instance parameters.get_available_columns
(ddf_or_ddfs)harmonize_dataframes
(country_dfs)join_data
(ddfs)Join multiple Dask DataFrames on a specified key column, removing duplicate columns.
vertical_merge
(ddfs)Attributes
__annotations__
__dict__
__doc__
__module__
__weakref__
list of weak references to the object