
Hands-on with socio4health: effects of hydrometeorologigcal hazards and urbanization on dengue risk in Brazil#
Run the tutorial via free cloud platforms:
This notebook provides a real-world example of how to use socio4health to retrieve, harmonize and analyze socioeconomic and demographic variables, such as the level of urbanization and access to water supply in Brazil, to recreate the dataset used in the publication Combined effects of hydrometeorological hazards and urbanisation on dengue risk in Brazil: a spatiotemporal modelling study by Lowe et al., published in The Lancet Planetary Health in 2021 (DOI). The study evaluated how the association between hydrometeorological events and dengue risk varies with these variables. This tutorial assumes an intermediate or advanced understanding of Python and data manipulation.
Setting up the environment#
To run this notebook, you need to have the following prerequisites:
Python 3.10+
Additionally, you need to install the socio4health
and pandas
package, which can be done using pip
:
!pip install socio4health pandas -q
[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip
Import Libraries#
To perform the data extraction, the socio4health
library provides the Extractor
class for data extraction, and the Harmonizer
class for data harmonization of the retrieved date. pandas
will be used for data manipulation. Additionally, we will use some utility functions from the socio4health.utils.harmonizer_utils
module to standardize and translate the dictionary.
import re
import pandas as pd
import dask.dataframe as dd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
from socio4health import Extractor
from socio4health.harmonizer import Harmonizer
from socio4health.utils import harmonizer_utils, extractor_utils
1. Load and standardize the dictionary#
To harmonize the data, provide a dictionary that describes the variables in the dataset. The study retrieved data from the 2010 census, from Instituto Brasileiro de Geografia e Estatística (IBGE). The dictionary for the census data can be found here. Follow the steps in the tutorial “How to Create a Raw Dictionary for Data Harmonization” to create a raw dictionary in Excel format.
This dictionary must be standardized and translated to English. The socio4health.utils.harmonizer_utils
module provides utility functions to perform these tasks. Additionally, the socio4health.utils.extractor_utils
module provides utility functions to parse fixed-width file (FWF) dictionaries, which is the format used in the IBGE census data.
raw_dic = pd.read_excel("raw_dictionary_br_2010.xlsx")
dic=harmonizer_utils.s4h_standardize_dict(raw_dic)
colnames, colspecs =extractor_utils.s4h_parse_fwf_dict(dic)
C:\Users\isabe\PycharmProjects\socio4health\src\socio4health\utils\harmonizer_utils.py:98: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
.apply(_process_group, include_groups=True)\
This is how the standardized dictionary looks:
dic
variable_name | question | description | value | initial_position | final_position | size | dec | type | possible_answers | |
---|---|---|---|---|---|---|---|---|---|---|
0 | V0402 | a responsabilidade pelo domicílio é de: | NaN | 1.0; 2.0; 9.0 | 107.0 | 107.0 | 1.0 | NaN | C | apenas um morador; mais de um morador; ignorado |
1 | V0209 | abastecimento de água, canalização: | NaN | 1.0; 2.0; 3.0 | 90.0 | 90.0 | 1.0 | NaN | C | sim, em pelo menos um cômodo; sim, só na propr... |
2 | V0208 | abastecimento de água, forma: | NaN | 1.0; 2.0; 3.0; 4.0; 5.0; 6.0; 7.0; 8.0; 9.0; 10.0 | 88.0 | 89.0 | 2.0 | NaN | C | rede geral de distribuição; poço ou nascente n... |
3 | V6210 | adequação da moradia | NaN | 1.0; 2.0; 3.0 | 144.0 | 144.0 | 1.0 | NaN | C | adequada; semi-adequada; inadequada |
4 | V0301 | alguma pessoa que morava com você(s) estava mo... | NaN | 1.0; 2.0 | 104.0 | 104.0 | 1.0 | NaN | C | sim; não |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
71 | V0214 | televisão, existência: | NaN | 1.0; 2.0 | 95.0 | 95.0 | 1.0 | NaN | C | sim; não |
72 | V4002 | tipo de espécie: | NaN | 11.0; 12.0; 13.0; 14.0; 15.0; 51.0; 52.0; 53.0... | 56.0 | 57.0 | 2.0 | NaN | C\n | casa; casa de vila ou em condomínio; apartamen... |
73 | V0001 | unidade da federação: | NaN | 11.0; 12.0; 13.0; 14.0; 15.0; 16.0; 17.0; 21.0... | 1.0 | 2.0 | 2.0 | NaN | A | rondônia; acre; amazonas; roraima; pará; amapá... |
74 | V2011 | valor do aluguel (em reais) | NaN | NaN | 59.0 | 64.0 | 6.0 | NaN | N | NaN |
75 | V0011 | área de ponderação | NaN | NaN | 8.0 | 20.0 | 13.0 | NaN | A | NaN |
76 rows × 10 columns
The classification model used in this tutorial is a BERT model fine-tuned for the task of classifying survey questions into categories. You can use your own model by providing the path to the model in the MODEL_PATH
parameter of the harmonizer_utils.s4h_classify_rows
function.
dic = harmonizer_utils.s4h_translate_column(dic, "question", language="en")
dic = harmonizer_utils.s4h_translate_column(dic, "description", language="en")
dic = harmonizer_utils.s4h_translate_column(dic, "possible_answers", language="en")
dic = harmonizer_utils.s4h_classify_rows(dic, "question_en", "description_en", "possible_answers_en",
new_column_name="category",
MODEL_PATH="files/bert_finetuned_classifier")
dic
question translated
description translated
possible_answers translated
Device set to use cpu
variable_name | question | description | value | initial_position | final_position | size | dec | type | possible_answers | question_en | description_en | possible_answers_en | category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | V0402 | a responsabilidade pelo domicílio é de: | NaN | 1.0; 2.0; 9.0 | 107.0 | 107.0 | 1.0 | NaN | C | apenas um morador; mais de um morador; ignorado | The responsibility for the home is: | NaN | just a resident; more than one resident; ignored | Housing |
1 | V0209 | abastecimento de água, canalização: | NaN | 1.0; 2.0; 3.0 | 90.0 | 90.0 | 1.0 | NaN | C | sim, em pelo menos um cômodo; sim, só na propr... | water supply, channeling: | NaN | Yes, in at least one room; Yes, only on the pr... | Housing |
2 | V0208 | abastecimento de água, forma: | NaN | 1.0; 2.0; 3.0; 4.0; 5.0; 6.0; 7.0; 8.0; 9.0; 10.0 | 88.0 | 89.0 | 2.0 | NaN | C | rede geral de distribuição; poço ou nascente n... | water supply, form: | NaN | General Distribution Network; well or source o... | Business |
3 | V6210 | adequação da moradia | NaN | 1.0; 2.0; 3.0 | 144.0 | 144.0 | 1.0 | NaN | C | adequada; semi-adequada; inadequada | Housing Adequacy | NaN | adequate; semi-adherence; inadequate | Housing |
4 | V0301 | alguma pessoa que morava com você(s) estava mo... | NaN | 1.0; 2.0 | 104.0 | 104.0 | 1.0 | NaN | C | sim; não | Someone who lived with you (s) was living in a... | NaN | Yes; no | Business |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
71 | V0214 | televisão, existência: | NaN | 1.0; 2.0 | 95.0 | 95.0 | 1.0 | NaN | C | sim; não | television, existence: | NaN | Yes; no | Identification |
72 | V4002 | tipo de espécie: | NaN | 11.0; 12.0; 13.0; 14.0; 15.0; 51.0; 52.0; 53.0... | 56.0 | 57.0 | 2.0 | NaN | C\n | casa; casa de vila ou em condomínio; apartamen... | Type of species: | NaN | home; village house or condominium; apartment;... | Housing |
73 | V0001 | unidade da federação: | NaN | 11.0; 12.0; 13.0; 14.0; 15.0; 16.0; 17.0; 21.0... | 1.0 | 2.0 | 2.0 | NaN | A | rondônia; acre; amazonas; roraima; pará; amapá... | Federation unit: | NaN | Rondônia; acre; Amazonas; Roraima; to; Amapá; ... | Business |
74 | V2011 | valor do aluguel (em reais) | NaN | NaN | 59.0 | 64.0 | 6.0 | NaN | N | NaN | Rental value (in reais) | NaN | NaN | Business |
75 | V0011 | área de ponderação | NaN | NaN | 8.0 | 20.0 | 13.0 | NaN | A | NaN | weighing area | NaN | NaN | Identification |
76 rows × 14 columns
2. Extract data from Brazil Census 2010#
To extract data, use the Extractor
class from the socio4health
library. As in the publication, extract the Brazil Census 2010 dataset from the Brazilian Institute of Geography and Statistics (IBGE) website or from a local copy. The dataset is available here.
The Extractor
class requires the following parameters:
input_path
: TheURL
or local path to the data source.down_ext
: A list of file extensions to download. This can include.txt
,.zip
, etc.output_path
: The local path where the extracted data will be saved.key_words
: A list of keywords to filter the files to be downloaded. In this case, a regular expression is used to select only the files with a.zip
extension that contain uppercase letters in their names.depth
: The depth of the directory structure to traverse when downloading files. A depth of0
means only the files in the specified directory will be downloaded.is_fwf
: A boolean indicating whether the files are in fixed-width format (FWF). In this case, the files are in FWF format, so this parameter is set toTrue
.colnames
: A list of column names for the FWF files, extracted from the standardized dictionary.colspecs
: A list of tuples indicating the start and end positions of each column in the FWF files, extracted from the standardized dictionary.
bra_online_extractor = Extractor(input_path="https://www.ibge.gov.br/estatisticas/sociais/saude/9662-censo-demografico-2010.html?=&t=microdados",
down_ext=['.txt','.zip'],
output_path="../../../../input/IBGE_2010",
key_words=["^[A-Z]+\.zip$"],
depth=0, is_fwf=True, colnames=colnames, colspecs=colspecs)
bra_Censo_2010 = bra_online_extractor.s4h_extract()
2025-09-24 10:49:42,531 - INFO - ----------------------
2025-09-24 10:49:42,533 - INFO - Starting data extraction...
2025-09-24 10:49:42,534 - INFO - Extracting data in online mode...
2025-09-24 10:49:42,535 - INFO - Scraping URL: https://www.ibge.gov.br/estatisticas/sociais/saude/9662-censo-demografico-2010.html?=&t=microdados with depth 0
2025-09-24 10:49:54,140 - INFO - Spider completed successfully for URL: https://www.ibge.gov.br/estatisticas/sociais/saude/9662-censo-demografico-2010.html?=&t=microdados
2025-09-24 10:49:54,147 - INFO - Downloading files to: ../../../../input/IBGE_2010
Downloading files: 0%| | 0/27 [00:00<?, ?it/s]2025-09-24 10:50:01,707 - INFO - Successfully downloaded: RO.zip
Downloading files: 4%|▎ | 1/27 [00:07<03:16, 7.54s/it]2025-09-24 10:50:04,214 - INFO - Successfully downloaded: AC.zip
Downloading files: 7%|▋ | 2/27 [00:10<01:54, 4.58s/it]2025-09-24 10:50:08,911 - INFO - Successfully downloaded: AM.zip
Downloading files: 11%|█ | 3/27 [00:14<01:51, 4.63s/it]2025-09-24 10:50:11,500 - INFO - Successfully downloaded: RR.zip
Downloading files: 15%|█▍ | 4/27 [00:17<01:28, 3.83s/it]2025-09-24 10:50:31,403 - INFO - Successfully downloaded: PA.zip
Downloading files: 19%|█▊ | 5/27 [00:37<03:31, 9.62s/it]2025-09-24 10:50:34,357 - INFO - Successfully downloaded: AP.zip
Downloading files: 22%|██▏ | 6/27 [00:40<02:34, 7.36s/it]2025-09-24 10:50:39,845 - INFO - Successfully downloaded: TO.zip
Downloading files: 26%|██▌ | 7/27 [00:45<02:14, 6.75s/it]2025-09-24 10:50:55,853 - INFO - Successfully downloaded: MA.zip
Downloading files: 30%|██▉ | 8/27 [01:01<03:04, 9.69s/it]2025-09-24 10:51:07,970 - INFO - Successfully downloaded: PI.zip
Downloading files: 33%|███▎ | 9/27 [01:13<03:08, 10.45s/it]2025-09-24 10:51:31,612 - INFO - Successfully downloaded: CE.zip
Downloading files: 37%|███▋ | 10/27 [01:37<04:06, 14.52s/it]2025-09-24 10:52:07,354 - INFO - Successfully downloaded: RN.zip
Downloading files: 41%|████ | 11/27 [02:13<05:36, 21.02s/it]2025-09-24 10:52:12,916 - INFO - Successfully downloaded: PB.zip
Downloading files: 44%|████▍ | 12/27 [02:18<04:04, 16.32s/it]2025-09-24 10:52:37,459 - INFO - Successfully downloaded: PE.zip
Downloading files: 48%|████▊ | 13/27 [02:43<04:23, 18.81s/it]2025-09-24 10:52:45,408 - INFO - Successfully downloaded: AL.zip
Downloading files: 52%|█████▏ | 14/27 [02:51<03:21, 15.53s/it]2025-09-24 10:52:52,982 - INFO - Successfully downloaded: SE.zip
Downloading files: 56%|█████▌ | 15/27 [02:58<02:37, 13.13s/it]2025-09-24 10:53:40,088 - INFO - Successfully downloaded: BA.zip
Downloading files: 59%|█████▉ | 16/27 [03:45<04:16, 23.36s/it]2025-09-24 10:55:23,631 - INFO - Successfully downloaded: MG.zip
Downloading files: 63%|██████▎ | 17/27 [05:29<07:54, 47.47s/it]2025-09-24 10:55:33,601 - INFO - Successfully downloaded: ES.zip
Downloading files: 67%|██████▋ | 18/27 [05:39<05:25, 36.20s/it]2025-09-24 10:56:19,971 - INFO - Successfully downloaded: RJ.zip
Downloading files: 70%|███████ | 19/27 [06:25<05:14, 39.26s/it]2025-09-24 10:57:04,413 - INFO - Successfully downloaded: PR.zip
Downloading files: 74%|███████▍ | 20/27 [07:10<04:45, 40.81s/it]2025-09-24 10:57:08,635 - INFO - Successfully downloaded: SC.zip
Downloading files: 78%|███████▊ | 21/27 [07:14<02:58, 29.83s/it]2025-09-24 10:58:15,827 - INFO - Successfully downloaded: RS.zip
Downloading files: 81%|████████▏ | 22/27 [08:21<03:25, 41.04s/it]2025-09-24 10:58:21,445 - INFO - Successfully downloaded: MS.zip
Downloading files: 85%|████████▌ | 23/27 [08:27<02:01, 30.41s/it]2025-09-24 10:58:33,252 - INFO - Successfully downloaded: MT.zip
Downloading files: 89%|████████▉ | 24/27 [08:39<01:14, 24.83s/it]2025-09-24 10:58:56,952 - INFO - Successfully downloaded: GO.zip
Downloading files: 93%|█████████▎| 25/27 [09:02<00:48, 24.49s/it]2025-09-24 10:59:15,593 - INFO - Successfully downloaded: DF.zip
Downloading files: 96%|█████████▋| 26/27 [09:21<00:22, 22.74s/it]2025-09-24 11:01:28,006 - INFO - Successfully downloaded: SP.zip
Downloading files: 100%|██████████| 27/27 [11:33<00:00, 25.70s/it]
2025-09-24 11:01:28,026 - INFO - Processing (depth 0): RO.zip
2025-09-24 11:01:28,916 - INFO - Extracted: a53593a5_RO_Dom11.txt
2025-09-24 11:01:28,943 - INFO - Extracted: a53593a5_RO_FAMI11.TXT
2025-09-24 11:01:29,103 - INFO - Extracted: a53593a5_RO_Pes11.txt
2025-09-24 11:01:29,118 - INFO - Processing (depth 0): AC.zip
2025-09-24 11:01:29,538 - INFO - Extracted: 92cb3342_AC_Dom12.txt
2025-09-24 11:01:29,559 - INFO - Extracted: 92cb3342_AC_FAMI12.TXT
2025-09-24 11:01:29,639 - INFO - Extracted: 92cb3342_AC_Pes12.txt
2025-09-24 11:01:29,650 - INFO - Processing (depth 0): AM.zip
2025-09-24 11:01:31,106 - INFO - Extracted: 4a72eb4f_AM_Dom13.txt
2025-09-24 11:01:31,133 - INFO - Extracted: 4a72eb4f_AM_FAMI13.TXT
2025-09-24 11:01:31,438 - INFO - Extracted: 4a72eb4f_AM_Pes13.txt
2025-09-24 11:01:31,448 - INFO - Processing (depth 0): RR.zip
2025-09-24 11:01:31,733 - INFO - Extracted: 0a13c222_RR_Dom14.txt
2025-09-24 11:01:31,760 - INFO - Extracted: 0a13c222_RR_FAMI14.TXT
2025-09-24 11:01:31,812 - INFO - Extracted: 0a13c222_RR_Pes14.txt
2025-09-24 11:01:31,822 - INFO - Processing (depth 0): PA.zip
2025-09-24 11:01:34,980 - INFO - Extracted: f5e1a42c_PA_Dom15.txt
2025-09-24 11:01:35,036 - INFO - Extracted: f5e1a42c_PA_FAMI15.TXT
2025-09-24 11:01:35,646 - INFO - Extracted: f5e1a42c_PA_Pes15.txt
2025-09-24 11:01:35,653 - INFO - Processing (depth 0): AP.zip
2025-09-24 11:01:35,971 - INFO - Extracted: ff361082_AP_Dom16.txt
2025-09-24 11:01:35,986 - INFO - Extracted: ff361082_AP_FAMI16.TXT
2025-09-24 11:01:36,034 - INFO - Extracted: ff361082_AP_Pes16.txt
2025-09-24 11:01:36,043 - INFO - Processing (depth 0): TO.zip
2025-09-24 11:01:36,739 - INFO - Extracted: 1094d344_TO_Dom17.txt
2025-09-24 11:01:36,763 - INFO - Extracted: 1094d344_TO_FAMI17.TXT
2025-09-24 11:01:36,907 - INFO - Extracted: 1094d344_TO_Pes17.txt
2025-09-24 11:01:36,917 - INFO - Processing (depth 0): MA.zip
2025-09-24 11:01:39,352 - INFO - Extracted: dda94a02_MA_DOM21.txt
2025-09-24 11:01:39,391 - INFO - Extracted: dda94a02_MA_FAMI21.TXT
2025-09-24 11:01:39,919 - INFO - Extracted: dda94a02_MA_PES21.txt
2025-09-24 11:01:39,924 - INFO - Processing (depth 0): PI.zip
2025-09-24 11:01:43,346 - INFO - Extracted: 97ba12f7_PI_DOM22.txt
2025-09-24 11:01:43,567 - INFO - Extracted: 97ba12f7_PI_FAMI22.TXT
2025-09-24 11:01:46,435 - INFO - Extracted: 97ba12f7_PI_PES22.txt
2025-09-24 11:01:46,468 - INFO - Processing (depth 0): CE.zip
2025-09-24 11:01:54,325 - INFO - Extracted: c45c6627_CE_DOM23.txt
2025-09-24 11:01:54,704 - INFO - Extracted: c45c6627_CE_FAMI23.TXT
2025-09-24 11:02:06,212 - INFO - Extracted: c45c6627_CE_PES23.txt
2025-09-24 11:02:06,255 - INFO - Processing (depth 0): RN.zip
2025-09-24 11:02:15,917 - INFO - Extracted: 51cb50d2_RN_DOM24.txt
2025-09-24 11:02:15,974 - INFO - Extracted: 51cb50d2_RN_DOM25.txt
2025-09-24 11:02:16,015 - INFO - Extracted: 51cb50d2_RN_FAMI24.TXT
2025-09-24 11:02:16,062 - INFO - Extracted: 51cb50d2_RN_FAMI25.TXT
2025-09-24 11:02:16,389 - INFO - Extracted: 51cb50d2_RN_PES24.txt
2025-09-24 11:02:16,822 - INFO - Extracted: 51cb50d2_RN_PES25.txt
2025-09-24 11:02:16,842 - INFO - Processing (depth 0): PB.zip
2025-09-24 11:02:19,063 - INFO - Extracted: 8b434ba3_PB_DOM25.txt
2025-09-24 11:02:19,101 - INFO - Extracted: 8b434ba3_PB_FAMI25.TXT
2025-09-24 11:02:19,617 - INFO - Extracted: 8b434ba3_PB_PES25.txt
2025-09-24 11:02:19,627 - INFO - Processing (depth 0): PE.zip
2025-09-24 11:02:23,676 - INFO - Extracted: 8e71e8ff_PE_DOM26.txt
2025-09-24 11:02:23,764 - INFO - Extracted: 8e71e8ff_PE_FAMI26.TXT
2025-09-24 11:02:24,526 - INFO - Extracted: 8e71e8ff_PE_PES26.txt
2025-09-24 11:02:24,555 - INFO - Processing (depth 0): AL.zip
2025-09-24 11:02:25,827 - INFO - Extracted: a235cd88_AL_DOM27.txt
2025-09-24 11:02:25,858 - INFO - Extracted: a235cd88_AL_FAMI27.TXT
2025-09-24 11:02:27,027 - INFO - Extracted: a235cd88_AL_PES27.txt
2025-09-24 11:02:27,033 - INFO - Processing (depth 0): SE.zip
2025-09-24 11:02:28,588 - INFO - Extracted: 845ee5f6_SE_DOM28.txt
2025-09-24 11:02:28,696 - INFO - Extracted: 845ee5f6_SE_FAMI28.TXT
2025-09-24 11:02:30,004 - INFO - Extracted: 845ee5f6_SE_PES28.txt
2025-09-24 11:02:30,016 - INFO - Processing (depth 0): BA.zip
2025-09-24 11:02:33,217 - INFO - Extracted: 2c0725ad_BA_DOM29.txt
2025-09-24 11:02:33,231 - INFO - Processing (depth 1): FAMI29.zip
2025-09-24 11:02:34,351 - INFO - Extracted: a5fa9632_FAMI29.TXT
2025-09-24 11:02:34,406 - INFO - Processing (depth 1): PES29.zip
2025-09-24 11:02:46,574 - INFO - Extracted: 444ff497_pes29.txt
2025-09-24 11:02:46,616 - INFO - Processing (depth 0): MG.zip
2025-09-24 11:03:17,004 - INFO - Extracted: 6c0bd45b_MG_Dom31.txt
2025-09-24 11:03:17,421 - INFO - Extracted: 6c0bd45b_MG_FAMI31.TXT
2025-09-24 11:03:19,787 - INFO - Extracted: 6c0bd45b_MG_Pes31.txt
2025-09-24 11:03:19,795 - INFO - Processing (depth 0): ES.zip
2025-09-24 11:03:21,509 - INFO - Extracted: 9375f986_ES_Dom32.txt
2025-09-24 11:03:21,540 - INFO - Extracted: 9375f986_ES_FAMI32.TXT
2025-09-24 11:03:21,849 - INFO - Extracted: 9375f986_ES_Pes32.txt
2025-09-24 11:03:21,857 - INFO - Processing (depth 0): RJ.zip
2025-09-24 11:03:31,233 - INFO - Extracted: 9e381c4d_RJ_Dom33.txt
2025-09-24 11:03:32,158 - INFO - Extracted: 9e381c4d_RJ_FAMI33.TXT
2025-09-24 11:03:42,695 - INFO - Extracted: 9e381c4d_RJ_Pes33.txt
2025-09-24 11:03:42,733 - INFO - Processing (depth 0): PR.zip
2025-09-24 11:03:54,203 - INFO - Extracted: faacb5e1_PR_DOM41.txt
2025-09-24 11:03:54,950 - INFO - Extracted: faacb5e1_PR_FAMI41.TXT
2025-09-24 11:04:02,708 - INFO - Extracted: faacb5e1_PR_PES41.txt
2025-09-24 11:04:02,731 - INFO - Processing (depth 0): SC.zip
2025-09-24 11:04:09,060 - INFO - Extracted: c9e3f561_SC_DOM42.txt
2025-09-24 11:04:09,526 - INFO - Extracted: c9e3f561_SC_FAMI42.TXT
2025-09-24 11:04:14,074 - INFO - Extracted: c9e3f561_SC_PES42.txt
2025-09-24 11:04:14,100 - INFO - Processing (depth 0): RS.zip
2025-09-24 11:04:26,180 - INFO - Extracted: 46447151_RS_DOM43.txt
2025-09-24 11:04:26,910 - INFO - Extracted: 46447151_RS_FAMI43.TXT
2025-09-24 11:04:35,068 - INFO - Extracted: 46447151_RS_PES43.txt
2025-09-24 11:04:35,092 - INFO - Processing (depth 0): MS.zip
2025-09-24 11:04:37,618 - INFO - Extracted: bd6e8822_MS_DOM50.txt
2025-09-24 11:04:37,657 - INFO - Extracted: bd6e8822_MS_FAMI50.TXT
2025-09-24 11:04:37,940 - INFO - Extracted: bd6e8822_MS_PES50.txt
2025-09-24 11:04:37,949 - INFO - Processing (depth 0): MT.zip
2025-09-24 11:04:39,347 - INFO - Extracted: 5c11ab43_MT_DOM51.txt
2025-09-24 11:04:39,377 - INFO - Extracted: 5c11ab43_MT_FAMI51.TXT
2025-09-24 11:04:39,715 - INFO - Extracted: 5c11ab43_MT_PES51.txt
2025-09-24 11:04:39,722 - INFO - Processing (depth 0): GO.zip
2025-09-24 11:04:42,426 - INFO - Extracted: b7ccb3ac_GO_DOM52.txt
2025-09-24 11:04:42,483 - INFO - Extracted: b7ccb3ac_GO_FAMI52.TXT
2025-09-24 11:04:43,013 - INFO - Extracted: b7ccb3ac_GO_PES52.txt
2025-09-24 11:04:43,020 - INFO - Processing (depth 0): DF.zip
2025-09-24 11:04:44,336 - INFO - Extracted: 326f1f07_DF_DOM53.txt
2025-09-24 11:04:44,361 - INFO - Extracted: 326f1f07_DF_FAMI53.TXT
2025-09-24 11:04:44,524 - INFO - Extracted: 326f1f07_DF_PES53.txt
2025-09-24 11:04:44,530 - INFO - Processing (depth 0): SP.zip
2025-09-24 11:05:01,711 - INFO - Extracted: 3a9a479a_SP_Dom35.txt
2025-09-24 11:05:03,572 - INFO - Extracted: 3a9a479a_SP_FAMI35.TXT
2025-09-24 11:05:33,736 - INFO - Extracted: 3a9a479a_SP_Pes35.txt
Processing files: 57%|█████▋ | 48/84 [03:55<01:41, 2.82s/it]2025-09-24 11:10:21,723 - ERROR - Error reading ../../../../input/IBGE_2010\444ff497_pes29.txt: Unable to allocate 103. MiB for an array with shape (177739, 76) and data type object
2025-09-24 11:10:22,633 - WARNING - Error processing ../../../../input/IBGE_2010\444ff497_pes29.txt: Error reading file: Unable to allocate 103. MiB for an array with shape (177739, 76) and data type object
Processing files: 58%|█████▊ | 49/84 [04:55<11:39, 19.98s/it]2025-09-24 11:10:34,623 - ERROR - Error reading ../../../../input/IBGE_2010\2c0725ad_BA_DOM29.txt:
2025-09-24 11:10:34,624 - WARNING - Error processing ../../../../input/IBGE_2010\2c0725ad_BA_DOM29.txt: Error reading file:
Processing files: 60%|█████▉ | 50/84 [05:00<08:42, 15.36s/it]2025-09-24 11:10:38,127 - ERROR - Error reading ../../../../input/IBGE_2010\a5fa9632_FAMI29.TXT:
2025-09-24 11:10:38,525 - WARNING - Error processing ../../../../input/IBGE_2010\a5fa9632_FAMI29.TXT: Error reading file:
Processing files: 61%|██████ | 51/84 [05:04<06:35, 11.99s/it]2025-09-24 11:10:44,933 - ERROR - Error reading ../../../../input/IBGE_2010\6c0bd45b_MG_Dom31.txt:
2025-09-24 11:10:44,936 - WARNING - Error processing ../../../../input/IBGE_2010\6c0bd45b_MG_Dom31.txt: Error reading file:
Processing files: 62%|██████▏ | 52/84 [05:10<05:29, 10.31s/it]2025-09-24 11:10:59,012 - ERROR - Error reading ../../../../input/IBGE_2010\6c0bd45b_MG_Pes31.txt:
2025-09-24 11:11:03,054 - WARNING - Error processing ../../../../input/IBGE_2010\6c0bd45b_MG_Pes31.txt: Error reading file:
Processing files: 63%|██████▎ | 53/84 [05:28<06:30, 12.59s/it]2025-09-24 11:11:06,153 - ERROR - Error reading ../../../../input/IBGE_2010\6c0bd45b_MG_FAMI31.TXT:
2025-09-24 11:11:06,156 - WARNING - Error processing ../../../../input/IBGE_2010\6c0bd45b_MG_FAMI31.TXT: Error reading file:
Processing files: 64%|██████▍ | 54/84 [05:31<04:52, 9.74s/it]2025-09-24 11:11:08,280 - ERROR - Error reading ../../../../input/IBGE_2010\9375f986_ES_FAMI32.TXT:
2025-09-24 11:11:08,282 - WARNING - Error processing ../../../../input/IBGE_2010\9375f986_ES_FAMI32.TXT: Error reading file:
Processing files: 65%|██████▌ | 55/84 [05:33<03:36, 7.46s/it]2025-09-24 11:11:10,395 - ERROR - Error reading ../../../../input/IBGE_2010\9375f986_ES_Dom32.txt:
2025-09-24 11:11:10,396 - WARNING - Error processing ../../../../input/IBGE_2010\9375f986_ES_Dom32.txt: Error reading file:
Processing files: 67%|██████▋ | 56/84 [05:36<02:43, 5.86s/it]2025-09-24 11:11:14,564 - ERROR - Error reading ../../../../input/IBGE_2010\9375f986_ES_Pes32.txt:
2025-09-24 11:11:14,569 - WARNING - Error processing ../../../../input/IBGE_2010\9375f986_ES_Pes32.txt: Error reading file:
Processing files: 68%|██████▊ | 57/84 [05:40<02:24, 5.35s/it]2025-09-24 11:11:16,669 - ERROR - Error reading ../../../../input/IBGE_2010\9e381c4d_RJ_FAMI33.TXT:
2025-09-24 11:11:16,671 - WARNING - Error processing ../../../../input/IBGE_2010\9e381c4d_RJ_FAMI33.TXT: Error reading file:
Processing files: 69%|██████▉ | 58/84 [05:42<01:53, 4.38s/it]2025-09-24 11:11:18,697 - ERROR - Error reading ../../../../input/IBGE_2010\9e381c4d_RJ_Pes33.txt: [Errno 22] Invalid argument
2025-09-24 11:11:18,698 - WARNING - Error processing ../../../../input/IBGE_2010\9e381c4d_RJ_Pes33.txt: Error reading file: [Errno 22] Invalid argument
Processing files: 70%|███████ | 59/84 [05:44<01:31, 3.67s/it]2025-09-24 11:11:21,886 - ERROR - Error reading ../../../../input/IBGE_2010\9e381c4d_RJ_Dom33.txt:
2025-09-24 11:11:21,889 - WARNING - Error processing ../../../../input/IBGE_2010\9e381c4d_RJ_Dom33.txt: Error reading file:
Processing files: 71%|███████▏ | 60/84 [05:47<01:24, 3.53s/it]2025-09-24 11:11:24,996 - ERROR - Error reading ../../../../input/IBGE_2010\faacb5e1_PR_DOM41.txt:
2025-09-24 11:11:24,999 - WARNING - Error processing ../../../../input/IBGE_2010\faacb5e1_PR_DOM41.txt: Error reading file:
Processing files: 73%|███████▎ | 61/84 [05:50<01:18, 3.40s/it]2025-09-24 11:11:30,150 - ERROR - Error reading ../../../../input/IBGE_2010\faacb5e1_PR_PES41.txt:
2025-09-24 11:11:32,175 - WARNING - Error processing ../../../../input/IBGE_2010\faacb5e1_PR_PES41.txt: Error reading file:
Processing files: 74%|███████▍ | 62/84 [06:05<02:33, 6.96s/it]2025-09-24 11:11:48,426 - ERROR - Error reading ../../../../input/IBGE_2010\faacb5e1_PR_FAMI41.TXT:
2025-09-24 11:11:48,430 - WARNING - Error processing ../../../../input/IBGE_2010\faacb5e1_PR_FAMI41.TXT: Error reading file:
Processing files: 75%|███████▌ | 63/84 [06:14<02:33, 7.32s/it]2025-09-24 11:11:52,574 - ERROR - Error reading ../../../../input/IBGE_2010\c9e3f561_SC_DOM42.txt:
2025-09-24 11:11:52,576 - WARNING - Error processing ../../../../input/IBGE_2010\c9e3f561_SC_DOM42.txt: Error reading file:
Processing files: 76%|███████▌ | 64/84 [06:18<02:07, 6.37s/it]2025-09-24 11:11:54,690 - ERROR - Error reading ../../../../input/IBGE_2010\c9e3f561_SC_FAMI42.TXT:
2025-09-24 11:11:54,693 - WARNING - Error processing ../../../../input/IBGE_2010\c9e3f561_SC_FAMI42.TXT: Error reading file:
Processing files: 77%|███████▋ | 65/84 [06:20<01:36, 5.09s/it]2025-09-24 11:12:01,840 - ERROR - Error reading ../../../../input/IBGE_2010\c9e3f561_SC_PES42.txt:
2025-09-24 11:12:05,897 - WARNING - Error processing ../../../../input/IBGE_2010\c9e3f561_SC_PES42.txt: Error reading file:
Processing files: 79%|███████▊ | 66/84 [06:31<02:04, 6.93s/it]2025-09-24 11:12:10,062 - ERROR - Error reading ../../../../input/IBGE_2010\46447151_RS_DOM43.txt:
2025-09-24 11:12:10,070 - WARNING - Error processing ../../../../input/IBGE_2010\46447151_RS_DOM43.txt: Error reading file:
Processing files: 80%|███████▉ | 67/84 [06:35<01:43, 6.10s/it]2025-09-24 11:12:14,216 - ERROR - Error reading ../../../../input/IBGE_2010\46447151_RS_PES43.txt:
2025-09-24 11:12:24,324 - WARNING - Error processing ../../../../input/IBGE_2010\46447151_RS_PES43.txt: Error reading file:
Processing files: 100%|██████████| 84/84 [10:51<00:00, 7.76s/it]
2025-09-24 11:16:26,169 - INFO - Successfully processed 64/84 files
2025-09-24 11:16:26,184 - INFO - Extraction completed successfully.
3. Harmonize the data#
Use the Harmonizer class from the socio4health library to harmonize the data. First, set the similarity threshold to 0.9
, meaning that only variables with a similarity score of 0.9
or higher will be considered for harmonization. Next, use the s4h_vertical_merge
method to merge the dataframes vertically.
har = Harmonizer()
har.similarity_threshold = 0.9
dfs = har.s4h_vertical_merge(bra_Censo_2010)
Grouping DataFrames: 100%|██████████| 64/64 [00:01<00:00, 47.99it/s]
Merging groups: 100%|██████████| 1/1 [00:00<00:00, 1.06it/s]
After merging the dataframes, set the dictionary and the categories of interest. In this case, we are interested in the "Housing"
category. Then, use the s4h_data_selector
method to filter the dataframes based on the dictionary, categories, and a key column (in this case 'V0001'
, which represents the state code). The s4h_data_selector
method returns a list of filtered dataframes.
har.dict_df = dic
har.categories = ["Housing"]
har.key_col = 'V0001'
filtered_ddfs = har.s4h_data_selector(dfs)
2025-09-24 11:53:53,620 - WARNING - key_col or key_val not defined, row-wise size will not be reduced
len(filtered_ddfs)
1
filtered_ddfs[0].compute()
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
Cell In[10], line 1
----> 1 filtered_ddfs[0].compute()
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\dask\base.py:373, in DaskMethodsMixin.compute(self, **kwargs)
349 def compute(self, **kwargs):
350 """Compute this dask collection
351
352 This turns a lazy Dask collection into its in-memory equivalent.
(...) 371 dask.compute
372 """
--> 373 (result,) = compute(self, traverse=False, **kwargs)
374 return result
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\dask\base.py:681, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
678 expr = expr.optimize()
679 keys = list(flatten(expr.__dask_keys__()))
--> 681 results = schedule(expr, keys, **kwargs)
683 return repack(results)
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\dask\dataframe\io\csv.py:351, in _read_csv(block, part, columns, reader, header, dtypes, head, colname, full_columns, enforce, kwargs, blocksize)
348 rest_kwargs["usecols"] = _columns
350 # Call `pandas_read_text`
--> 351 df = pandas_read_text(
352 reader,
353 block,
354 header,
355 rest_kwargs,
356 dtypes,
357 _columns,
358 write_header,
359 enforce,
360 path_info,
361 )
362 if project_after_read:
363 return df[columns]
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\dask\dataframe\io\csv.py:77, in pandas_read_text(reader, b, header, kwargs, dtypes, columns, write_header, enforce, path)
75 bio.write(b)
76 bio.seek(0)
---> 77 df = reader(bio, **kwargs)
78 if dtypes:
79 coerce_dtypes(df, dtypes)
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\io\parsers\readers.py:1565, in read_fwf(filepath_or_buffer, colspecs, widths, infer_nrows, dtype_backend, iterator, chunksize, **kwds)
1563 check_dtype_backend(dtype_backend)
1564 kwds["dtype_backend"] = dtype_backend
-> 1565 return _read(filepath_or_buffer, kwds)
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\io\parsers\readers.py:626, in _read(filepath_or_buffer, kwds)
623 return parser
625 with parser:
--> 626 return parser.read(nrows)
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\io\parsers\readers.py:1923, in TextFileReader.read(self, nrows)
1916 nrows = validate_integer("nrows", nrows)
1917 try:
1918 # error: "ParserBase" has no attribute "read"
1919 (
1920 index,
1921 columns,
1922 col_dict,
-> 1923 ) = self._engine.read( # type: ignore[attr-defined]
1924 nrows
1925 )
1926 except Exception:
1927 self.close()
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\io\parsers\python_parser.py:252, in PythonParser.read(self, rows)
246 def read(
247 self, rows: int | None = None
248 ) -> tuple[
249 Index | None, Sequence[Hashable] | MultiIndex, Mapping[Hashable, ArrayLike]
250 ]:
251 try:
--> 252 content = self._get_lines(rows)
253 except StopIteration:
254 if self._first_chunk:
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\io\parsers\python_parser.py:1144, in PythonParser._get_lines(self, rows)
1141 rows += 1
1143 if next_row is not None:
-> 1144 new_rows.append(next_row)
1145 len_new_rows = len(new_rows)
1147 except StopIteration:
MemoryError:
Finally, we can perform some analysis on the harmonized data. In this case, we will calculate the total population by state (V0001
) using the variable V0401
, which represents the total population in each census tract. We will then create a horizontal bar plot to visualize the population distribution across states using matplotlib
.
ddf = filtered_ddfs[0][["V0001", "V0401"]]
ddf = ddf.assign(
V0001 = ddf["V0001"].astype("category"),
V0401 = dd.to_numeric(ddf["V0401"], errors="coerce").astype("float64").fillna(0.0)
).categorize(columns=["V0001"])
pop = ddf.groupby("V0001")["V0401"].sum(split_out=8).compute()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[2], line 1
----> 1 ddf = filtered_ddfs[0][["V0001", "V0401"]]
3 ddf = ddf.assign(
4 V0001 = ddf["V0001"].astype("category"),
5 V0401 = dd.to_numeric(ddf["V0401"], errors="coerce").astype("float64").fillna(0.0)
6 ).categorize(columns=["V0001"])
8 pop = ddf.groupby("V0001")["V0401"].sum(split_out=8).compute()
NameError: name 'filtered_ddfs' is not defined
row = dic.loc[dic["variable_name"]=="V0001", ["value","possible_answers"]].iloc[0]
vals = [s for s in re.split(r"\s*;\s*", str(row["value"]).strip(" ;")) if s]
labs = [s for s in re.split(r"\s*;\s*", str(row["possible_answers"]).strip(" ;")) if s]
idx = pop.index
if pd.api.types.is_integer_dtype(idx):
keys = [int(float(v)) for v in vals]
elif pd.api.types.is_float_dtype(idx):
keys = [float(v) for v in vals]
else:
keys = [str(int(float(v))) for v in vals]
if len(keys) != len(labs):
raise ValueError(f"Misalignment: {len(keys)} codes vs {len(labs)} names")
code2name = dict(zip(keys, labs))
pop_named = pop.rename(index=code2name)
top = pop_named.sort_values()
top_titled = top.copy()
top_titled.index = [str(s).title() for s in top.index]
fig, ax = plt.subplots(figsize=(11,7), dpi=130)
ax.barh(top_titled.index, top_titled.values)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.set_title(f"Population by State", pad=10)
ax.set_xlabel("Population")
ax.set_ylabel("State")
ax.xaxis.set_major_formatter(FuncFormatter(lambda x, p: f"{x/1e6:.1f} M"))
total = pop_named.sum()
for i, v in enumerate(top_titled.values):
ax.text(v, i, f"{v/1e6:.1f} M ({v/total:.1%})", va="center", ha="left", fontsize=9)
ax.grid(axis="x", linestyle="--", alpha=0.3)
plt.margins(x=0.03)
plt.tight_layout()
plt.show()
