image info

Hands-on with socio4health: effects of hydrometeorologigcal hazards and urbanization on dengue risk in Brazil#

Run the tutorial via free cloud platforms: Binder Open In Colab

This notebook provides a real-world example of how to use socio4health to retrieve, harmonize and analyze socioeconomic and demographic variables, such as the level of urbanization and access to water supply in Brazil, to recreate the dataset used in the publication Combined effects of hydrometeorological hazards and urbanisation on dengue risk in Brazil: a spatiotemporal modelling study by Lowe et al., published in The Lancet Planetary Health in 2021 (DOI). The study evaluated how the association between hydrometeorological events and dengue risk varies with these variables. This tutorial assumes an intermediate or advanced understanding of Python and data manipulation.

Setting up the environment#

To run this notebook, you need to have the following prerequisites:

  • Python 3.10+

Additionally, you need to install the socio4health and pandas package, which can be done using pip:

!pip install socio4health pandas -q
[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip

Import Libraries#

To perform the data extraction, the socio4health library provides the Extractor class for data extraction, and the Harmonizer class for data harmonization of the retrieved date. pandas will be used for data manipulation. Additionally, we will use some utility functions from the socio4health.utils.harmonizer_utils module to standardize and translate the dictionary.

import re
import pandas as pd
import dask.dataframe as dd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
from socio4health import Extractor
from socio4health.harmonizer import Harmonizer
from socio4health.utils import harmonizer_utils, extractor_utils

1. Load and standardize the dictionary#

To harmonize the data, provide a dictionary that describes the variables in the dataset. The study retrieved data from the 2010 census, from Instituto Brasileiro de Geografia e Estatística (IBGE). The dictionary for the census data can be found here. Follow the steps in the tutorial “How to Create a Raw Dictionary for Data Harmonization” to create a raw dictionary in Excel format.

This dictionary must be standardized and translated to English. The socio4health.utils.harmonizer_utils module provides utility functions to perform these tasks. Additionally, the socio4health.utils.extractor_utils module provides utility functions to parse fixed-width file (FWF) dictionaries, which is the format used in the IBGE census data.

raw_dic = pd.read_excel("raw_dictionary_br_2010.xlsx")
dic=harmonizer_utils.s4h_standardize_dict(raw_dic)
colnames, colspecs =extractor_utils.s4h_parse_fwf_dict(dic)
C:\Users\isabe\PycharmProjects\socio4health\src\socio4health\utils\harmonizer_utils.py:98: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  .apply(_process_group, include_groups=True)\

This is how the standardized dictionary looks:

dic
variable_name question description value initial_position final_position size dec type possible_answers
0 V0402 a responsabilidade pelo domicílio é de: NaN 1.0; 2.0; 9.0 107.0 107.0 1.0 NaN C apenas um morador; mais de um morador; ignorado
1 V0209 abastecimento de água, canalização: NaN 1.0; 2.0; 3.0 90.0 90.0 1.0 NaN C sim, em pelo menos um cômodo; sim, só na propr...
2 V0208 abastecimento de água, forma: NaN 1.0; 2.0; 3.0; 4.0; 5.0; 6.0; 7.0; 8.0; 9.0; 10.0 88.0 89.0 2.0 NaN C rede geral de distribuição; poço ou nascente n...
3 V6210 adequação da moradia NaN 1.0; 2.0; 3.0 144.0 144.0 1.0 NaN C adequada; semi-adequada; inadequada
4 V0301 alguma pessoa que morava com você(s) estava mo... NaN 1.0; 2.0 104.0 104.0 1.0 NaN C sim; não
... ... ... ... ... ... ... ... ... ... ...
71 V0214 televisão, existência: NaN 1.0; 2.0 95.0 95.0 1.0 NaN C sim; não
72 V4002 tipo de espécie: NaN 11.0; 12.0; 13.0; 14.0; 15.0; 51.0; 52.0; 53.0... 56.0 57.0 2.0 NaN C\n casa; casa de vila ou em condomínio; apartamen...
73 V0001 unidade da federação: NaN 11.0; 12.0; 13.0; 14.0; 15.0; 16.0; 17.0; 21.0... 1.0 2.0 2.0 NaN A rondônia; acre; amazonas; roraima; pará; amapá...
74 V2011 valor do aluguel (em reais) NaN NaN 59.0 64.0 6.0 NaN N NaN
75 V0011 área de ponderação NaN NaN 8.0 20.0 13.0 NaN A NaN

76 rows × 10 columns

The classification model used in this tutorial is a BERT model fine-tuned for the task of classifying survey questions into categories. You can use your own model by providing the path to the model in the MODEL_PATH parameter of the harmonizer_utils.s4h_classify_rows function.

dic = harmonizer_utils.s4h_translate_column(dic, "question", language="en")
dic = harmonizer_utils.s4h_translate_column(dic, "description", language="en")
dic = harmonizer_utils.s4h_translate_column(dic, "possible_answers", language="en")
dic = harmonizer_utils.s4h_classify_rows(dic, "question_en", "description_en", "possible_answers_en",
                                        new_column_name="category",
                                        MODEL_PATH="files/bert_finetuned_classifier")
dic
question translated
description translated
possible_answers translated
Device set to use cpu
variable_name question description value initial_position final_position size dec type possible_answers question_en description_en possible_answers_en category
0 V0402 a responsabilidade pelo domicílio é de: NaN 1.0; 2.0; 9.0 107.0 107.0 1.0 NaN C apenas um morador; mais de um morador; ignorado The responsibility for the home is: NaN just a resident; more than one resident; ignored Housing
1 V0209 abastecimento de água, canalização: NaN 1.0; 2.0; 3.0 90.0 90.0 1.0 NaN C sim, em pelo menos um cômodo; sim, só na propr... water supply, channeling: NaN Yes, in at least one room; Yes, only on the pr... Housing
2 V0208 abastecimento de água, forma: NaN 1.0; 2.0; 3.0; 4.0; 5.0; 6.0; 7.0; 8.0; 9.0; 10.0 88.0 89.0 2.0 NaN C rede geral de distribuição; poço ou nascente n... water supply, form: NaN General Distribution Network; well or source o... Business
3 V6210 adequação da moradia NaN 1.0; 2.0; 3.0 144.0 144.0 1.0 NaN C adequada; semi-adequada; inadequada Housing Adequacy NaN adequate; semi-adherence; inadequate Housing
4 V0301 alguma pessoa que morava com você(s) estava mo... NaN 1.0; 2.0 104.0 104.0 1.0 NaN C sim; não Someone who lived with you (s) was living in a... NaN Yes; no Business
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
71 V0214 televisão, existência: NaN 1.0; 2.0 95.0 95.0 1.0 NaN C sim; não television, existence: NaN Yes; no Identification
72 V4002 tipo de espécie: NaN 11.0; 12.0; 13.0; 14.0; 15.0; 51.0; 52.0; 53.0... 56.0 57.0 2.0 NaN C\n casa; casa de vila ou em condomínio; apartamen... Type of species: NaN home; village house or condominium; apartment;... Housing
73 V0001 unidade da federação: NaN 11.0; 12.0; 13.0; 14.0; 15.0; 16.0; 17.0; 21.0... 1.0 2.0 2.0 NaN A rondônia; acre; amazonas; roraima; pará; amapá... Federation unit: NaN Rondônia; acre; Amazonas; Roraima; to; Amapá; ... Business
74 V2011 valor do aluguel (em reais) NaN NaN 59.0 64.0 6.0 NaN N NaN Rental value (in reais) NaN NaN Business
75 V0011 área de ponderação NaN NaN 8.0 20.0 13.0 NaN A NaN weighing area NaN NaN Identification

76 rows × 14 columns

2. Extract data from Brazil Census 2010#

To extract data, use the Extractor class from the socio4health library. As in the publication, extract the Brazil Census 2010 dataset from the Brazilian Institute of Geography and Statistics (IBGE) website or from a local copy. The dataset is available here.

The Extractor class requires the following parameters:

  • input_path: The URL or local path to the data source.

  • down_ext: A list of file extensions to download. This can include .txt,.zip, etc.

  • output_path: The local path where the extracted data will be saved.

  • key_words: A list of keywords to filter the files to be downloaded. In this case, a regular expression is used to select only the files with a .zip extension that contain uppercase letters in their names.

  • depth: The depth of the directory structure to traverse when downloading files. A depth of 0 means only the files in the specified directory will be downloaded.

  • is_fwf: A boolean indicating whether the files are in fixed-width format (FWF). In this case, the files are in FWF format, so this parameter is set to True.

  • colnames: A list of column names for the FWF files, extracted from the standardized dictionary.

  • colspecs: A list of tuples indicating the start and end positions of each column in the FWF files, extracted from the standardized dictionary.

bra_online_extractor = Extractor(input_path="https://www.ibge.gov.br/estatisticas/sociais/saude/9662-censo-demografico-2010.html?=&t=microdados",
                                 down_ext=['.txt','.zip'],
                                 output_path="../../../../input/IBGE_2010",
                                 key_words=["^[A-Z]+\.zip$"],
                                 depth=0, is_fwf=True, colnames=colnames, colspecs=colspecs)
bra_Censo_2010 = bra_online_extractor.s4h_extract()
2025-09-24 10:49:42,531 - INFO - ----------------------
2025-09-24 10:49:42,533 - INFO - Starting data extraction...
2025-09-24 10:49:42,534 - INFO - Extracting data in online mode...
2025-09-24 10:49:42,535 - INFO - Scraping URL: https://www.ibge.gov.br/estatisticas/sociais/saude/9662-censo-demografico-2010.html?=&t=microdados with depth 0
2025-09-24 10:49:54,140 - INFO - Spider completed successfully for URL: https://www.ibge.gov.br/estatisticas/sociais/saude/9662-censo-demografico-2010.html?=&t=microdados
2025-09-24 10:49:54,147 - INFO - Downloading files to: ../../../../input/IBGE_2010
Downloading files:   0%|          | 0/27 [00:00<?, ?it/s]2025-09-24 10:50:01,707 - INFO - Successfully downloaded: RO.zip
Downloading files:   4%|▎         | 1/27 [00:07<03:16,  7.54s/it]2025-09-24 10:50:04,214 - INFO - Successfully downloaded: AC.zip
Downloading files:   7%|▋         | 2/27 [00:10<01:54,  4.58s/it]2025-09-24 10:50:08,911 - INFO - Successfully downloaded: AM.zip
Downloading files:  11%|█         | 3/27 [00:14<01:51,  4.63s/it]2025-09-24 10:50:11,500 - INFO - Successfully downloaded: RR.zip
Downloading files:  15%|█▍        | 4/27 [00:17<01:28,  3.83s/it]2025-09-24 10:50:31,403 - INFO - Successfully downloaded: PA.zip
Downloading files:  19%|█▊        | 5/27 [00:37<03:31,  9.62s/it]2025-09-24 10:50:34,357 - INFO - Successfully downloaded: AP.zip
Downloading files:  22%|██▏       | 6/27 [00:40<02:34,  7.36s/it]2025-09-24 10:50:39,845 - INFO - Successfully downloaded: TO.zip
Downloading files:  26%|██▌       | 7/27 [00:45<02:14,  6.75s/it]2025-09-24 10:50:55,853 - INFO - Successfully downloaded: MA.zip
Downloading files:  30%|██▉       | 8/27 [01:01<03:04,  9.69s/it]2025-09-24 10:51:07,970 - INFO - Successfully downloaded: PI.zip
Downloading files:  33%|███▎      | 9/27 [01:13<03:08, 10.45s/it]2025-09-24 10:51:31,612 - INFO - Successfully downloaded: CE.zip
Downloading files:  37%|███▋      | 10/27 [01:37<04:06, 14.52s/it]2025-09-24 10:52:07,354 - INFO - Successfully downloaded: RN.zip
Downloading files:  41%|████      | 11/27 [02:13<05:36, 21.02s/it]2025-09-24 10:52:12,916 - INFO - Successfully downloaded: PB.zip
Downloading files:  44%|████▍     | 12/27 [02:18<04:04, 16.32s/it]2025-09-24 10:52:37,459 - INFO - Successfully downloaded: PE.zip
Downloading files:  48%|████▊     | 13/27 [02:43<04:23, 18.81s/it]2025-09-24 10:52:45,408 - INFO - Successfully downloaded: AL.zip
Downloading files:  52%|█████▏    | 14/27 [02:51<03:21, 15.53s/it]2025-09-24 10:52:52,982 - INFO - Successfully downloaded: SE.zip
Downloading files:  56%|█████▌    | 15/27 [02:58<02:37, 13.13s/it]2025-09-24 10:53:40,088 - INFO - Successfully downloaded: BA.zip
Downloading files:  59%|█████▉    | 16/27 [03:45<04:16, 23.36s/it]2025-09-24 10:55:23,631 - INFO - Successfully downloaded: MG.zip
Downloading files:  63%|██████▎   | 17/27 [05:29<07:54, 47.47s/it]2025-09-24 10:55:33,601 - INFO - Successfully downloaded: ES.zip
Downloading files:  67%|██████▋   | 18/27 [05:39<05:25, 36.20s/it]2025-09-24 10:56:19,971 - INFO - Successfully downloaded: RJ.zip
Downloading files:  70%|███████   | 19/27 [06:25<05:14, 39.26s/it]2025-09-24 10:57:04,413 - INFO - Successfully downloaded: PR.zip
Downloading files:  74%|███████▍  | 20/27 [07:10<04:45, 40.81s/it]2025-09-24 10:57:08,635 - INFO - Successfully downloaded: SC.zip
Downloading files:  78%|███████▊  | 21/27 [07:14<02:58, 29.83s/it]2025-09-24 10:58:15,827 - INFO - Successfully downloaded: RS.zip
Downloading files:  81%|████████▏ | 22/27 [08:21<03:25, 41.04s/it]2025-09-24 10:58:21,445 - INFO - Successfully downloaded: MS.zip
Downloading files:  85%|████████▌ | 23/27 [08:27<02:01, 30.41s/it]2025-09-24 10:58:33,252 - INFO - Successfully downloaded: MT.zip
Downloading files:  89%|████████▉ | 24/27 [08:39<01:14, 24.83s/it]2025-09-24 10:58:56,952 - INFO - Successfully downloaded: GO.zip
Downloading files:  93%|█████████▎| 25/27 [09:02<00:48, 24.49s/it]2025-09-24 10:59:15,593 - INFO - Successfully downloaded: DF.zip
Downloading files:  96%|█████████▋| 26/27 [09:21<00:22, 22.74s/it]2025-09-24 11:01:28,006 - INFO - Successfully downloaded: SP.zip
Downloading files: 100%|██████████| 27/27 [11:33<00:00, 25.70s/it]
2025-09-24 11:01:28,026 - INFO - Processing (depth 0): RO.zip
2025-09-24 11:01:28,916 - INFO - Extracted: a53593a5_RO_Dom11.txt
2025-09-24 11:01:28,943 - INFO - Extracted: a53593a5_RO_FAMI11.TXT
2025-09-24 11:01:29,103 - INFO - Extracted: a53593a5_RO_Pes11.txt
2025-09-24 11:01:29,118 - INFO - Processing (depth 0): AC.zip
2025-09-24 11:01:29,538 - INFO - Extracted: 92cb3342_AC_Dom12.txt
2025-09-24 11:01:29,559 - INFO - Extracted: 92cb3342_AC_FAMI12.TXT
2025-09-24 11:01:29,639 - INFO - Extracted: 92cb3342_AC_Pes12.txt
2025-09-24 11:01:29,650 - INFO - Processing (depth 0): AM.zip
2025-09-24 11:01:31,106 - INFO - Extracted: 4a72eb4f_AM_Dom13.txt
2025-09-24 11:01:31,133 - INFO - Extracted: 4a72eb4f_AM_FAMI13.TXT
2025-09-24 11:01:31,438 - INFO - Extracted: 4a72eb4f_AM_Pes13.txt
2025-09-24 11:01:31,448 - INFO - Processing (depth 0): RR.zip
2025-09-24 11:01:31,733 - INFO - Extracted: 0a13c222_RR_Dom14.txt
2025-09-24 11:01:31,760 - INFO - Extracted: 0a13c222_RR_FAMI14.TXT
2025-09-24 11:01:31,812 - INFO - Extracted: 0a13c222_RR_Pes14.txt
2025-09-24 11:01:31,822 - INFO - Processing (depth 0): PA.zip
2025-09-24 11:01:34,980 - INFO - Extracted: f5e1a42c_PA_Dom15.txt
2025-09-24 11:01:35,036 - INFO - Extracted: f5e1a42c_PA_FAMI15.TXT
2025-09-24 11:01:35,646 - INFO - Extracted: f5e1a42c_PA_Pes15.txt
2025-09-24 11:01:35,653 - INFO - Processing (depth 0): AP.zip
2025-09-24 11:01:35,971 - INFO - Extracted: ff361082_AP_Dom16.txt
2025-09-24 11:01:35,986 - INFO - Extracted: ff361082_AP_FAMI16.TXT
2025-09-24 11:01:36,034 - INFO - Extracted: ff361082_AP_Pes16.txt
2025-09-24 11:01:36,043 - INFO - Processing (depth 0): TO.zip
2025-09-24 11:01:36,739 - INFO - Extracted: 1094d344_TO_Dom17.txt
2025-09-24 11:01:36,763 - INFO - Extracted: 1094d344_TO_FAMI17.TXT
2025-09-24 11:01:36,907 - INFO - Extracted: 1094d344_TO_Pes17.txt
2025-09-24 11:01:36,917 - INFO - Processing (depth 0): MA.zip
2025-09-24 11:01:39,352 - INFO - Extracted: dda94a02_MA_DOM21.txt
2025-09-24 11:01:39,391 - INFO - Extracted: dda94a02_MA_FAMI21.TXT
2025-09-24 11:01:39,919 - INFO - Extracted: dda94a02_MA_PES21.txt
2025-09-24 11:01:39,924 - INFO - Processing (depth 0): PI.zip
2025-09-24 11:01:43,346 - INFO - Extracted: 97ba12f7_PI_DOM22.txt
2025-09-24 11:01:43,567 - INFO - Extracted: 97ba12f7_PI_FAMI22.TXT
2025-09-24 11:01:46,435 - INFO - Extracted: 97ba12f7_PI_PES22.txt
2025-09-24 11:01:46,468 - INFO - Processing (depth 0): CE.zip
2025-09-24 11:01:54,325 - INFO - Extracted: c45c6627_CE_DOM23.txt
2025-09-24 11:01:54,704 - INFO - Extracted: c45c6627_CE_FAMI23.TXT
2025-09-24 11:02:06,212 - INFO - Extracted: c45c6627_CE_PES23.txt
2025-09-24 11:02:06,255 - INFO - Processing (depth 0): RN.zip
2025-09-24 11:02:15,917 - INFO - Extracted: 51cb50d2_RN_DOM24.txt
2025-09-24 11:02:15,974 - INFO - Extracted: 51cb50d2_RN_DOM25.txt
2025-09-24 11:02:16,015 - INFO - Extracted: 51cb50d2_RN_FAMI24.TXT
2025-09-24 11:02:16,062 - INFO - Extracted: 51cb50d2_RN_FAMI25.TXT
2025-09-24 11:02:16,389 - INFO - Extracted: 51cb50d2_RN_PES24.txt
2025-09-24 11:02:16,822 - INFO - Extracted: 51cb50d2_RN_PES25.txt
2025-09-24 11:02:16,842 - INFO - Processing (depth 0): PB.zip
2025-09-24 11:02:19,063 - INFO - Extracted: 8b434ba3_PB_DOM25.txt
2025-09-24 11:02:19,101 - INFO - Extracted: 8b434ba3_PB_FAMI25.TXT
2025-09-24 11:02:19,617 - INFO - Extracted: 8b434ba3_PB_PES25.txt
2025-09-24 11:02:19,627 - INFO - Processing (depth 0): PE.zip
2025-09-24 11:02:23,676 - INFO - Extracted: 8e71e8ff_PE_DOM26.txt
2025-09-24 11:02:23,764 - INFO - Extracted: 8e71e8ff_PE_FAMI26.TXT
2025-09-24 11:02:24,526 - INFO - Extracted: 8e71e8ff_PE_PES26.txt
2025-09-24 11:02:24,555 - INFO - Processing (depth 0): AL.zip
2025-09-24 11:02:25,827 - INFO - Extracted: a235cd88_AL_DOM27.txt
2025-09-24 11:02:25,858 - INFO - Extracted: a235cd88_AL_FAMI27.TXT
2025-09-24 11:02:27,027 - INFO - Extracted: a235cd88_AL_PES27.txt
2025-09-24 11:02:27,033 - INFO - Processing (depth 0): SE.zip
2025-09-24 11:02:28,588 - INFO - Extracted: 845ee5f6_SE_DOM28.txt
2025-09-24 11:02:28,696 - INFO - Extracted: 845ee5f6_SE_FAMI28.TXT
2025-09-24 11:02:30,004 - INFO - Extracted: 845ee5f6_SE_PES28.txt
2025-09-24 11:02:30,016 - INFO - Processing (depth 0): BA.zip
2025-09-24 11:02:33,217 - INFO - Extracted: 2c0725ad_BA_DOM29.txt
2025-09-24 11:02:33,231 - INFO - Processing (depth 1): FAMI29.zip
2025-09-24 11:02:34,351 - INFO - Extracted: a5fa9632_FAMI29.TXT
2025-09-24 11:02:34,406 - INFO - Processing (depth 1): PES29.zip
2025-09-24 11:02:46,574 - INFO - Extracted: 444ff497_pes29.txt
2025-09-24 11:02:46,616 - INFO - Processing (depth 0): MG.zip
2025-09-24 11:03:17,004 - INFO - Extracted: 6c0bd45b_MG_Dom31.txt
2025-09-24 11:03:17,421 - INFO - Extracted: 6c0bd45b_MG_FAMI31.TXT
2025-09-24 11:03:19,787 - INFO - Extracted: 6c0bd45b_MG_Pes31.txt
2025-09-24 11:03:19,795 - INFO - Processing (depth 0): ES.zip
2025-09-24 11:03:21,509 - INFO - Extracted: 9375f986_ES_Dom32.txt
2025-09-24 11:03:21,540 - INFO - Extracted: 9375f986_ES_FAMI32.TXT
2025-09-24 11:03:21,849 - INFO - Extracted: 9375f986_ES_Pes32.txt
2025-09-24 11:03:21,857 - INFO - Processing (depth 0): RJ.zip
2025-09-24 11:03:31,233 - INFO - Extracted: 9e381c4d_RJ_Dom33.txt
2025-09-24 11:03:32,158 - INFO - Extracted: 9e381c4d_RJ_FAMI33.TXT
2025-09-24 11:03:42,695 - INFO - Extracted: 9e381c4d_RJ_Pes33.txt
2025-09-24 11:03:42,733 - INFO - Processing (depth 0): PR.zip
2025-09-24 11:03:54,203 - INFO - Extracted: faacb5e1_PR_DOM41.txt
2025-09-24 11:03:54,950 - INFO - Extracted: faacb5e1_PR_FAMI41.TXT
2025-09-24 11:04:02,708 - INFO - Extracted: faacb5e1_PR_PES41.txt
2025-09-24 11:04:02,731 - INFO - Processing (depth 0): SC.zip
2025-09-24 11:04:09,060 - INFO - Extracted: c9e3f561_SC_DOM42.txt
2025-09-24 11:04:09,526 - INFO - Extracted: c9e3f561_SC_FAMI42.TXT
2025-09-24 11:04:14,074 - INFO - Extracted: c9e3f561_SC_PES42.txt
2025-09-24 11:04:14,100 - INFO - Processing (depth 0): RS.zip
2025-09-24 11:04:26,180 - INFO - Extracted: 46447151_RS_DOM43.txt
2025-09-24 11:04:26,910 - INFO - Extracted: 46447151_RS_FAMI43.TXT
2025-09-24 11:04:35,068 - INFO - Extracted: 46447151_RS_PES43.txt
2025-09-24 11:04:35,092 - INFO - Processing (depth 0): MS.zip
2025-09-24 11:04:37,618 - INFO - Extracted: bd6e8822_MS_DOM50.txt
2025-09-24 11:04:37,657 - INFO - Extracted: bd6e8822_MS_FAMI50.TXT
2025-09-24 11:04:37,940 - INFO - Extracted: bd6e8822_MS_PES50.txt
2025-09-24 11:04:37,949 - INFO - Processing (depth 0): MT.zip
2025-09-24 11:04:39,347 - INFO - Extracted: 5c11ab43_MT_DOM51.txt
2025-09-24 11:04:39,377 - INFO - Extracted: 5c11ab43_MT_FAMI51.TXT
2025-09-24 11:04:39,715 - INFO - Extracted: 5c11ab43_MT_PES51.txt
2025-09-24 11:04:39,722 - INFO - Processing (depth 0): GO.zip
2025-09-24 11:04:42,426 - INFO - Extracted: b7ccb3ac_GO_DOM52.txt
2025-09-24 11:04:42,483 - INFO - Extracted: b7ccb3ac_GO_FAMI52.TXT
2025-09-24 11:04:43,013 - INFO - Extracted: b7ccb3ac_GO_PES52.txt
2025-09-24 11:04:43,020 - INFO - Processing (depth 0): DF.zip
2025-09-24 11:04:44,336 - INFO - Extracted: 326f1f07_DF_DOM53.txt
2025-09-24 11:04:44,361 - INFO - Extracted: 326f1f07_DF_FAMI53.TXT
2025-09-24 11:04:44,524 - INFO - Extracted: 326f1f07_DF_PES53.txt
2025-09-24 11:04:44,530 - INFO - Processing (depth 0): SP.zip
2025-09-24 11:05:01,711 - INFO - Extracted: 3a9a479a_SP_Dom35.txt
2025-09-24 11:05:03,572 - INFO - Extracted: 3a9a479a_SP_FAMI35.TXT
2025-09-24 11:05:33,736 - INFO - Extracted: 3a9a479a_SP_Pes35.txt
Processing files:  57%|█████▋    | 48/84 [03:55<01:41,  2.82s/it]2025-09-24 11:10:21,723 - ERROR - Error reading ../../../../input/IBGE_2010\444ff497_pes29.txt: Unable to allocate 103. MiB for an array with shape (177739, 76) and data type object
2025-09-24 11:10:22,633 - WARNING - Error processing ../../../../input/IBGE_2010\444ff497_pes29.txt: Error reading file: Unable to allocate 103. MiB for an array with shape (177739, 76) and data type object
Processing files:  58%|█████▊    | 49/84 [04:55<11:39, 19.98s/it]2025-09-24 11:10:34,623 - ERROR - Error reading ../../../../input/IBGE_2010\2c0725ad_BA_DOM29.txt: 
2025-09-24 11:10:34,624 - WARNING - Error processing ../../../../input/IBGE_2010\2c0725ad_BA_DOM29.txt: Error reading file: 
Processing files:  60%|█████▉    | 50/84 [05:00<08:42, 15.36s/it]2025-09-24 11:10:38,127 - ERROR - Error reading ../../../../input/IBGE_2010\a5fa9632_FAMI29.TXT: 
2025-09-24 11:10:38,525 - WARNING - Error processing ../../../../input/IBGE_2010\a5fa9632_FAMI29.TXT: Error reading file: 
Processing files:  61%|██████    | 51/84 [05:04<06:35, 11.99s/it]2025-09-24 11:10:44,933 - ERROR - Error reading ../../../../input/IBGE_2010\6c0bd45b_MG_Dom31.txt: 
2025-09-24 11:10:44,936 - WARNING - Error processing ../../../../input/IBGE_2010\6c0bd45b_MG_Dom31.txt: Error reading file: 
Processing files:  62%|██████▏   | 52/84 [05:10<05:29, 10.31s/it]2025-09-24 11:10:59,012 - ERROR - Error reading ../../../../input/IBGE_2010\6c0bd45b_MG_Pes31.txt: 
2025-09-24 11:11:03,054 - WARNING - Error processing ../../../../input/IBGE_2010\6c0bd45b_MG_Pes31.txt: Error reading file: 
Processing files:  63%|██████▎   | 53/84 [05:28<06:30, 12.59s/it]2025-09-24 11:11:06,153 - ERROR - Error reading ../../../../input/IBGE_2010\6c0bd45b_MG_FAMI31.TXT: 
2025-09-24 11:11:06,156 - WARNING - Error processing ../../../../input/IBGE_2010\6c0bd45b_MG_FAMI31.TXT: Error reading file: 
Processing files:  64%|██████▍   | 54/84 [05:31<04:52,  9.74s/it]2025-09-24 11:11:08,280 - ERROR - Error reading ../../../../input/IBGE_2010\9375f986_ES_FAMI32.TXT: 
2025-09-24 11:11:08,282 - WARNING - Error processing ../../../../input/IBGE_2010\9375f986_ES_FAMI32.TXT: Error reading file: 
Processing files:  65%|██████▌   | 55/84 [05:33<03:36,  7.46s/it]2025-09-24 11:11:10,395 - ERROR - Error reading ../../../../input/IBGE_2010\9375f986_ES_Dom32.txt: 
2025-09-24 11:11:10,396 - WARNING - Error processing ../../../../input/IBGE_2010\9375f986_ES_Dom32.txt: Error reading file: 
Processing files:  67%|██████▋   | 56/84 [05:36<02:43,  5.86s/it]2025-09-24 11:11:14,564 - ERROR - Error reading ../../../../input/IBGE_2010\9375f986_ES_Pes32.txt: 
2025-09-24 11:11:14,569 - WARNING - Error processing ../../../../input/IBGE_2010\9375f986_ES_Pes32.txt: Error reading file: 
Processing files:  68%|██████▊   | 57/84 [05:40<02:24,  5.35s/it]2025-09-24 11:11:16,669 - ERROR - Error reading ../../../../input/IBGE_2010\9e381c4d_RJ_FAMI33.TXT: 
2025-09-24 11:11:16,671 - WARNING - Error processing ../../../../input/IBGE_2010\9e381c4d_RJ_FAMI33.TXT: Error reading file: 
Processing files:  69%|██████▉   | 58/84 [05:42<01:53,  4.38s/it]2025-09-24 11:11:18,697 - ERROR - Error reading ../../../../input/IBGE_2010\9e381c4d_RJ_Pes33.txt: [Errno 22] Invalid argument
2025-09-24 11:11:18,698 - WARNING - Error processing ../../../../input/IBGE_2010\9e381c4d_RJ_Pes33.txt: Error reading file: [Errno 22] Invalid argument
Processing files:  70%|███████   | 59/84 [05:44<01:31,  3.67s/it]2025-09-24 11:11:21,886 - ERROR - Error reading ../../../../input/IBGE_2010\9e381c4d_RJ_Dom33.txt: 
2025-09-24 11:11:21,889 - WARNING - Error processing ../../../../input/IBGE_2010\9e381c4d_RJ_Dom33.txt: Error reading file: 
Processing files:  71%|███████▏  | 60/84 [05:47<01:24,  3.53s/it]2025-09-24 11:11:24,996 - ERROR - Error reading ../../../../input/IBGE_2010\faacb5e1_PR_DOM41.txt: 
2025-09-24 11:11:24,999 - WARNING - Error processing ../../../../input/IBGE_2010\faacb5e1_PR_DOM41.txt: Error reading file: 
Processing files:  73%|███████▎  | 61/84 [05:50<01:18,  3.40s/it]2025-09-24 11:11:30,150 - ERROR - Error reading ../../../../input/IBGE_2010\faacb5e1_PR_PES41.txt: 
2025-09-24 11:11:32,175 - WARNING - Error processing ../../../../input/IBGE_2010\faacb5e1_PR_PES41.txt: Error reading file: 
Processing files:  74%|███████▍  | 62/84 [06:05<02:33,  6.96s/it]2025-09-24 11:11:48,426 - ERROR - Error reading ../../../../input/IBGE_2010\faacb5e1_PR_FAMI41.TXT: 
2025-09-24 11:11:48,430 - WARNING - Error processing ../../../../input/IBGE_2010\faacb5e1_PR_FAMI41.TXT: Error reading file: 
Processing files:  75%|███████▌  | 63/84 [06:14<02:33,  7.32s/it]2025-09-24 11:11:52,574 - ERROR - Error reading ../../../../input/IBGE_2010\c9e3f561_SC_DOM42.txt: 
2025-09-24 11:11:52,576 - WARNING - Error processing ../../../../input/IBGE_2010\c9e3f561_SC_DOM42.txt: Error reading file: 
Processing files:  76%|███████▌  | 64/84 [06:18<02:07,  6.37s/it]2025-09-24 11:11:54,690 - ERROR - Error reading ../../../../input/IBGE_2010\c9e3f561_SC_FAMI42.TXT: 
2025-09-24 11:11:54,693 - WARNING - Error processing ../../../../input/IBGE_2010\c9e3f561_SC_FAMI42.TXT: Error reading file: 
Processing files:  77%|███████▋  | 65/84 [06:20<01:36,  5.09s/it]2025-09-24 11:12:01,840 - ERROR - Error reading ../../../../input/IBGE_2010\c9e3f561_SC_PES42.txt: 
2025-09-24 11:12:05,897 - WARNING - Error processing ../../../../input/IBGE_2010\c9e3f561_SC_PES42.txt: Error reading file: 
Processing files:  79%|███████▊  | 66/84 [06:31<02:04,  6.93s/it]2025-09-24 11:12:10,062 - ERROR - Error reading ../../../../input/IBGE_2010\46447151_RS_DOM43.txt: 
2025-09-24 11:12:10,070 - WARNING - Error processing ../../../../input/IBGE_2010\46447151_RS_DOM43.txt: Error reading file: 
Processing files:  80%|███████▉  | 67/84 [06:35<01:43,  6.10s/it]2025-09-24 11:12:14,216 - ERROR - Error reading ../../../../input/IBGE_2010\46447151_RS_PES43.txt: 
2025-09-24 11:12:24,324 - WARNING - Error processing ../../../../input/IBGE_2010\46447151_RS_PES43.txt: Error reading file: 
Processing files: 100%|██████████| 84/84 [10:51<00:00,  7.76s/it]
2025-09-24 11:16:26,169 - INFO - Successfully processed 64/84 files
2025-09-24 11:16:26,184 - INFO - Extraction completed successfully.

3. Harmonize the data#

Use the Harmonizer class from the socio4health library to harmonize the data. First, set the similarity threshold to 0.9, meaning that only variables with a similarity score of 0.9 or higher will be considered for harmonization. Next, use the s4h_vertical_merge method to merge the dataframes vertically.

har = Harmonizer()
har.similarity_threshold = 0.9
dfs = har.s4h_vertical_merge(bra_Censo_2010)
Grouping DataFrames: 100%|██████████| 64/64 [00:01<00:00, 47.99it/s]
Merging groups: 100%|██████████| 1/1 [00:00<00:00,  1.06it/s]

After merging the dataframes, set the dictionary and the categories of interest. In this case, we are interested in the "Housing" category. Then, use the s4h_data_selector method to filter the dataframes based on the dictionary, categories, and a key column (in this case 'V0001', which represents the state code). The s4h_data_selector method returns a list of filtered dataframes.

har.dict_df = dic
har.categories = ["Housing"]
har.key_col = 'V0001'
filtered_ddfs = har.s4h_data_selector(dfs)
2025-09-24 11:53:53,620 - WARNING - key_col or key_val not defined, row-wise size will not be reduced
len(filtered_ddfs)
1
filtered_ddfs[0].compute()
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Cell In[10], line 1
----> 1 filtered_ddfs[0].compute()

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\dask\base.py:373, in DaskMethodsMixin.compute(self, **kwargs)
    349 def compute(self, **kwargs):
    350     """Compute this dask collection
    351 
    352     This turns a lazy Dask collection into its in-memory equivalent.
   (...)    371     dask.compute
    372     """
--> 373     (result,) = compute(self, traverse=False, **kwargs)
    374     return result

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\dask\base.py:681, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    678     expr = expr.optimize()
    679     keys = list(flatten(expr.__dask_keys__()))
--> 681     results = schedule(expr, keys, **kwargs)
    683 return repack(results)

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\dask\dataframe\io\csv.py:351, in _read_csv(block, part, columns, reader, header, dtypes, head, colname, full_columns, enforce, kwargs, blocksize)
    348         rest_kwargs["usecols"] = _columns
    350 # Call `pandas_read_text`
--> 351 df = pandas_read_text(
    352     reader,
    353     block,
    354     header,
    355     rest_kwargs,
    356     dtypes,
    357     _columns,
    358     write_header,
    359     enforce,
    360     path_info,
    361 )
    362 if project_after_read:
    363     return df[columns]

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\dask\dataframe\io\csv.py:77, in pandas_read_text(reader, b, header, kwargs, dtypes, columns, write_header, enforce, path)
     75 bio.write(b)
     76 bio.seek(0)
---> 77 df = reader(bio, **kwargs)
     78 if dtypes:
     79     coerce_dtypes(df, dtypes)

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\io\parsers\readers.py:1565, in read_fwf(filepath_or_buffer, colspecs, widths, infer_nrows, dtype_backend, iterator, chunksize, **kwds)
   1563 check_dtype_backend(dtype_backend)
   1564 kwds["dtype_backend"] = dtype_backend
-> 1565 return _read(filepath_or_buffer, kwds)

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\io\parsers\readers.py:626, in _read(filepath_or_buffer, kwds)
    623     return parser
    625 with parser:
--> 626     return parser.read(nrows)

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\io\parsers\readers.py:1923, in TextFileReader.read(self, nrows)
   1916 nrows = validate_integer("nrows", nrows)
   1917 try:
   1918     # error: "ParserBase" has no attribute "read"
   1919     (
   1920         index,
   1921         columns,
   1922         col_dict,
-> 1923     ) = self._engine.read(  # type: ignore[attr-defined]
   1924         nrows
   1925     )
   1926 except Exception:
   1927     self.close()

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\io\parsers\python_parser.py:252, in PythonParser.read(self, rows)
    246 def read(
    247     self, rows: int | None = None
    248 ) -> tuple[
    249     Index | None, Sequence[Hashable] | MultiIndex, Mapping[Hashable, ArrayLike]
    250 ]:
    251     try:
--> 252         content = self._get_lines(rows)
    253     except StopIteration:
    254         if self._first_chunk:

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\io\parsers\python_parser.py:1144, in PythonParser._get_lines(self, rows)
   1141             rows += 1
   1143             if next_row is not None:
-> 1144                 new_rows.append(next_row)
   1145         len_new_rows = len(new_rows)
   1147 except StopIteration:

MemoryError: 

Finally, we can perform some analysis on the harmonized data. In this case, we will calculate the total population by state (V0001) using the variable V0401, which represents the total population in each census tract. We will then create a horizontal bar plot to visualize the population distribution across states using matplotlib.

ddf = filtered_ddfs[0][["V0001", "V0401"]]

ddf = ddf.assign(
    V0001 = ddf["V0001"].astype("category"),
    V0401 = dd.to_numeric(ddf["V0401"], errors="coerce").astype("float64").fillna(0.0)
).categorize(columns=["V0001"])

pop = ddf.groupby("V0001")["V0401"].sum(split_out=8).compute()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 ddf = filtered_ddfs[0][["V0001", "V0401"]]
      3 ddf = ddf.assign(
      4     V0001 = ddf["V0001"].astype("category"),
      5     V0401 = dd.to_numeric(ddf["V0401"], errors="coerce").astype("float64").fillna(0.0)
      6 ).categorize(columns=["V0001"])
      8 pop = ddf.groupby("V0001")["V0401"].sum(split_out=8).compute()

NameError: name 'filtered_ddfs' is not defined
row = dic.loc[dic["variable_name"]=="V0001", ["value","possible_answers"]].iloc[0]

vals = [s for s in re.split(r"\s*;\s*", str(row["value"]).strip(" ;")) if s]
labs = [s for s in re.split(r"\s*;\s*", str(row["possible_answers"]).strip(" ;")) if s]

idx = pop.index
if pd.api.types.is_integer_dtype(idx):
    keys = [int(float(v)) for v in vals]
elif pd.api.types.is_float_dtype(idx):
    keys = [float(v) for v in vals]
else:
    keys = [str(int(float(v))) for v in vals]

if len(keys) != len(labs):
    raise ValueError(f"Misalignment: {len(keys)} codes vs {len(labs)} names")
code2name = dict(zip(keys, labs))

pop_named = pop.rename(index=code2name)
top = pop_named.sort_values()
top_titled = top.copy()
top_titled.index = [str(s).title() for s in top.index]

fig, ax = plt.subplots(figsize=(11,7), dpi=130)
ax.barh(top_titled.index, top_titled.values)

ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.set_title(f"Population by State", pad=10)
ax.set_xlabel("Population")
ax.set_ylabel("State")
ax.xaxis.set_major_formatter(FuncFormatter(lambda x, p: f"{x/1e6:.1f} M"))

total = pop_named.sum()
for i, v in enumerate(top_titled.values):
    ax.text(v, i, f"{v/1e6:.1f} M  ({v/total:.1%})", va="center", ha="left", fontsize=9)

ax.grid(axis="x", linestyle="--", alpha=0.3)
plt.margins(x=0.03)
plt.tight_layout()
plt.show()
../_images/03b7eb3ed9a56ed67418a4b9517d488e524bd04094a2929024a39ef1bba2f75f.png