API Reference#

This section contains the Documentation of the Application Programming Interface (API) of socio4health. The information in this section is automatically created from the documentation strings in the original Python code.

Extractor#

Methods

Extractor([input_path, depth, down_ext, ...])

A class for extracting data from various sources, including online scraping and local file processing.

extractor.get_default_data_dir()

Returns the default data directory for storing downloaded files.

Extractor.extract()

Extracts data from the specified input path, either by scraping online sources or processing local files.

Extractor.delete_download_folder([folder_path])

Safely delete the download folder and all its contents.

Harmonizer#

Methods

Harmonizer([min_common_columns, ...])

Initialize the Harmonizer class for harmonizing and processing Dask DataFrames in health data integration.

Harmonizer.vertical_merge(ddfs)

Harmonizer.drop_nan_columns(ddf_or_ddfs)

Drop columns where the majority of values are NaN using instance parameters.

Harmonizer.get_available_columns(ddf_or_ddfs)

Harmonizer.harmonize_dataframes(country_dfs)

Harmonizer.data_selector(ddfs)

Harmonizer.compare_with_dict(ddfs)

Compare the columns available in the DataFrames with the variables in the dictionary and return a DataFrame with the columns that do not match in both directions.

Harmonizer.join_data(ddfs)

Join multiple Dask DataFrames on a specified key column, removing duplicate columns.

Utils#

Extractor

utils.extractor_utils.compressed2files(...)

Extract files from a compressed archive and return the paths of the extracted files.

utils.extractor_utils.download_request(url, ...)

Download a file from the specified URL and save it to the given directory.

utils.extractor_utils.run_standard_spider(...)

Run the Scrapy spider to extract data from the given URL .

Harmonizer

utils.harmonizer_utils.classify_rows(data, ...)

Classify each row using a fine-tuned multiclass classification BERT model.

utils.harmonizer_utils.get_classifier(MODEL_PATH)

Load the BERT fine-tuned model for classification only once.

utils.harmonizer_utils.standardize_dict(raw_dict)

Cleans and structures a dictionary-like DataFrame of variables by standardizing text fields, grouping possible answers, and removing duplicates.

utils.harmonizer_utils.translate_column(...)

Translates the content of selected columns in a DataFrame using Google Translate.

Spider

utils.standard_spider.StandardSpider(*args, ...)

A standard spider for scraping links from a given URL.

utils.standard_spider.StandardSpider.parse(...)

Parse the response to extract links based on criteria.

Enums#