{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "\n",
"id": "48dfedfd3488ed92"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# Harmonization of data",
"id": "422e92e63d201714"
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**Run the tutorial via free cloud platforms:** [](https://github.com/harmonize-tools/socio4health/blob/main/docs/source/notebooks/extractor.ipynb) \n",
"
\n",
"\n",
"\n"
],
"id": "1599a585fa8204d6"
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"This notebook provides you with a tutorial on how to process the sociodemographic and economic data from online data sources from **Brazil**. This tutorial assumes you have an **intermediate** or **advanced** understanding of **Python** and data manipulation.\n",
"\n",
"## Setting up the enviornment\n",
"\n",
"To run this notebook, you need to have the following prerequisites:\n",
"\n",
"- Python 3.10+\n",
"\n",
"Additionally, you need to install the `socio4health` and `pandas` package, which can be done using ``pip``:\n",
"\n"
],
"id": "a04649ff2f2f8680"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-08-11T16:17:52.108769Z",
"start_time": "2025-08-11T16:17:46.522060Z"
}
},
"cell_type": "code",
"source": "!pip install socio4health pandas ipywidgets -q",
"id": "59bb2e9841851c30",
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"[notice] A new release of pip is available: 25.1.1 -> 25.2\n",
"[notice] To update, run: python.exe -m pip install --upgrade pip\n"
]
}
],
"execution_count": 2
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Import Libraries",
"id": "af633dbea31aaaab"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-08-11T16:18:23.504993Z",
"start_time": "2025-08-11T16:18:23.496551Z"
}
},
"cell_type": "code",
"source": [
"import pandas as pd\n",
"from socio4health import Extractor\n",
"from socio4health.enums.data_info_enum import BraColnamesEnum, BraColspecsEnum\n",
"from socio4health.harmonizer import Harmonizer\n",
"from socio4health.utils import harmonizer_utils\n",
"import tqdm as tqdm\n"
],
"id": "e448c769134fe36d",
"outputs": [],
"execution_count": 4
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Extracting data from Brazil\n",
"\n",
"In this example, we will extract the Brazilian National Continuous Household Sample Survey (**PNADC**) for the year 2024 from the Brazilian Institute of Geography and Statistics (**IBGE**) website."
],
"id": "8b286730445109e9"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-08-11T16:18:26.327230Z",
"start_time": "2025-08-11T16:18:26.316349Z"
}
},
"cell_type": "code",
"source": "bra_online_extractor = Extractor(input_path=\"https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/\", down_ext=['.txt','.zip'], is_fwf=True, colnames=BraColnamesEnum.PNADC.value, colspecs=BraColspecsEnum.PNADC.value, output_path=\"../data\", depth=0)",
"id": "338d2512725fe9f0",
"outputs": [],
"execution_count": 5
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Providing the raw dictionary",
"id": "79484d3e09b568d6"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "We need to provide a **raw dictionary** to the **harmonizer** that contains the column names and their corresponding data types. This is necessary for the harmonization process, as it allows the harmonizer to understand the structure of the data. To know more about how to construct the raw dictionary, you can check the [documentation](https://harmonize-tools.github.io/socio4health/dictionary.html).",
"id": "ac3ad903ed8cb378"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-08-11T16:18:30.084645Z",
"start_time": "2025-08-11T16:18:29.613160Z"
}
},
"cell_type": "code",
"source": "raw_dict = pd.read_excel('raw_dictionary.xlsx')",
"id": "6773c8ef688101e2",
"outputs": [],
"execution_count": 6
},
{
"metadata": {},
"cell_type": "markdown",
"source": "The raw dictionary is then standardized using the `standardize_dict` method, which ensures that the dictionary is in a consistent format, making it easier to work with during the harmonization process.",
"id": "492302dd9245be14"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-08-11T16:18:33.116476Z",
"start_time": "2025-08-11T16:18:32.773710Z"
}
},
"cell_type": "code",
"source": "dic = harmonizer_utils.standardize_dict(raw_dict)",
"id": "65ab398829a1ad1f",
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\isabe\\PycharmProjects\\socio4health\\src\\socio4health\\utils\\harmonizer_utils.py:78: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.\n",
" .apply(_process_group, include_groups=True)\\\n"
]
}
],
"execution_count": 7
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Additionally, the content of columns of the dictionary can be translated into English using `translate_column` function from `harmonizer_utils` module. Translation is performed for facilitate the understanding and processing of the data.",
"id": "a2f7ece82aa5d5d"
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"
translate_column
method may take some time depending on the size of the dictionary and the number of columns to be translated. It is recommended to use this method only if you need the content of the columns in English for further processing or analysis.\n",
"