API Reference#

This section contains the Documentation of the Application Programming Interface (API) of socio4health. The information in this section is automatically created from the documentation strings in the original Python code.

Extractor#

Methods

Extractor

A class for extracting data from various sources, including online scraping and local file processing.

extractor.s4h_get_default_data_dir

Returns the default data directory for storing downloaded files.

Extractor.s4h_extract

Extracts data from the specified input path, either by scraping online sources or processing local files.

Extractor.s4h_delete_download_folder

Safely delete the download folder and all its contents.

Harmonizer#

Methods

Harmonizer

Initialize the Harmonizer class for harmonizing and processing Dask DataFrames in health data integration.

Harmonizer.s4h_vertical_merge

Harmonizer.s4h_get_available_columns

Get a list of unique column names from a single DataFrame or a list of DataFrames.

Harmonizer.s4h_harmonize_dataframes

Harmonizer.s4h_data_selector

Harmonizer.s4h_compare_with_dict

Compare the columns available in the DataFrames with the variables in the dictionary and return a DataFrame with the columns that do not match in both directions.

Harmonizer.s4h_join_data

Utils#

Extractor

utils.extractor_utils.compressed2files

Extract files from a compressed archive and return the paths of the extracted files.

utils.extractor_utils.download_request

Download a file from the specified URL and save it to the given directory.

utils.extractor_utils.s4h_parse_fwf_dict

Parse a dictionary DataFrame to extract column names and fixed-width format specifications.

utils.extractor_utils.run_standard_spider

Run the Scrapy spider to extract data from the given URL .

Harmonizer

utils.harmonizer_utils.s4h_classify_rows

Classify each row using a fine-tuned multiclass classification BERT model.

utils.harmonizer_utils.s4h_get_classifier

Load the BERT fine-tuned model for classification only once.

utils.harmonizer_utils.s4h_standardize_dict

Cleans and structures a dictionary-like DataFrame of variables by standardizing text fields, grouping possible answers, and removing duplicates.

utils.harmonizer_utils.s4h_translate_column

Translates the content of selected columns in a DataFrame using Google Translate.

Spider

utils.standard_spider.StandardSpider

A standard spider for scraping links from a given URL.

utils.standard_spider.StandardSpider.parse

Parse the response to extract links based on criteria.

Enums#