Extraction of Colombia, Brazil and Peru online data#

Run the tutorial via free cloud platforms:

This notebook provides you with an introduction on how to retrieve data from online data sources through web scraping, as well as from local files from Colombia, Brazil, Peru, and the Dominican Republic. This tutorial assumes you have an intermediate or advanced understanding of Python and data manipulation.

Setting up the environment#

To run this notebook, you need to have the following prerequisites:

Python 3.10+

Additionally, you need to install the socio4health and pandas package, which can be done using pip:

!pip install socio4health pandas -q

[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip

Import Libraries#

To perform the data extraction, the socio4health library provides the Extractor class for data extraction, and the Harmonizer class for data harmonization of the retrieved date. We will also use pandas for data manipulation.

from socio4health import Extractor
from socio4health.enums.data_info_enum import BraColnamesEnum, BraColspecsEnum

Use case 1: Extracting data from Colombia#

To extract data from Colombia, we will use the Extractor class from the socio4health library. The Extractor class provides methods to retrieve data from various sources, including online databases and local files. In this example, we will extract the Large Integrated Household Survey - GEIH - 2022 (Gran Encuesta Integrada de Hogares - GEIH - 2022) dataset from the Colombian Nacional Administration of Statistics (DANE) website

The Extractor class requires the following parameters:

input_path: The URL or local path to the data source.
down_ext: A list of file extensions to download. This can include .CSV, .csv, .zip, etc.
sep: The separator used in the data files (e.g., ; for semicolon-separated values).
output_path: The local path where the extracted data will be saved.
depth: The depth of the directory structure to traverse when downloading files. A depth of 0 means only the files in the specified directory will be downloaded.

col_online_extractor = Extractor(input_path="https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata", down_ext=['.CSV','.csv','.zip'], sep=';', output_path="../data", depth=0)

After the instance is set up, we can call the s4h_extract method to download and extract the data. The method returns a list of pandas DataFrames containing the extracted data.

col_dfs = col_online_extractor.s4h_extract()

2025-09-24 12:24:17,077 - INFO - ----------------------
2025-09-24 12:24:17,078 - INFO - Starting data extraction...
2025-09-24 12:24:17,079 - INFO - Extracting data in online mode...
2025-09-24 12:24:17,080 - INFO - Scraping URL: https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata with depth 0
2025-09-24 12:24:21,753 - INFO - Spider completed successfully for URL: https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata
2025-09-24 12:24:21,755 - INFO - Downloading files to: ../data
Downloading files:   0%|          | 0/12 [00:00<?, ?it/s]2025-09-24 12:24:25,589 - INFO - Successfully downloaded: GEIH_Enero_2022_Marco_2018.zip
Downloading files:   8%|▊         | 1/12 [00:03<00:41,  3.81s/it]2025-09-24 12:24:29,405 - INFO - Successfully downloaded: GEIH_Febrero_2022_Marco_2018.zip
Downloading files:  17%|█▋        | 2/12 [00:07<00:38,  3.81s/it]2025-09-24 12:24:32,439 - INFO - Successfully downloaded: GEIH_Marzo_2022_Marco_2018.zip
Downloading files:  25%|██▌       | 3/12 [00:10<00:31,  3.46s/it]2025-09-24 12:24:34,187 - INFO - Successfully downloaded: GEIH_Mayo_2022_Marco_2018.zip
Downloading files:  33%|███▎      | 4/12 [00:12<00:22,  2.78s/it]2025-09-24 12:24:37,230 - INFO - Successfully downloaded: GEIH_Junio_2022_Marco_2018.zip
Downloading files:  42%|████▏     | 5/12 [00:15<00:20,  2.88s/it]2025-09-24 12:24:40,109 - INFO - Successfully downloaded: GEIH_Julio_2022_Marco_2018.zip
Downloading files:  50%|█████     | 6/12 [00:18<00:17,  2.88s/it]2025-09-24 12:24:44,680 - INFO - Successfully downloaded: GEIH_Agosto_2022_Marco_2018.zip
Downloading files:  58%|█████▊    | 7/12 [00:22<00:17,  3.43s/it]2025-09-24 12:24:48,759 - INFO - Successfully downloaded: GEIH_Septiembre_Marco_2018.zip
Downloading files:  67%|██████▋   | 8/12 [00:26<00:14,  3.64s/it]2025-09-24 12:24:51,119 - INFO - Successfully downloaded: GEIH_Octubre_Marco_2018.zip
Downloading files:  75%|███████▌  | 9/12 [00:29<00:09,  3.24s/it]2025-09-24 12:24:54,440 - INFO - Successfully downloaded: GEIH_Diciembre_2022_Marco_2018.zip
Downloading files:  83%|████████▎ | 10/12 [00:32<00:06,  3.26s/it]2025-09-24 12:24:59,539 - INFO - Successfully downloaded: GEIH_Abril_2022_Marco_2018_Act.zip
Downloading files:  92%|█████████▏| 11/12 [00:37<00:03,  3.83s/it]2025-09-24 12:25:02,078 - INFO - Successfully downloaded: GEIH_Noviembre_2022_Marco_2018.act.zip
Downloading files: 100%|██████████| 12/12 [00:40<00:00,  3.36s/it]
2025-09-24 12:25:02,087 - INFO - Processing (depth 0): GEIH_Enero_2022_Marco_2018.zip
2025-09-24 12:25:14,100 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Características generales, seguridad social en salud y educación.csv
2025-09-24 12:25:14,118 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:25:14,139 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Fuerza de trabajo.csv
2025-09-24 12:25:14,160 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Migración.CSV
2025-09-24 12:25:14,177 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_No ocupados.csv
2025-09-24 12:25:14,210 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Ocupados.csv
2025-09-24 12:25:14,237 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otras formas de trabajo.csv
2025-09-24 12:25:14,260 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ingresos e impuestos.csv
2025-09-24 12:25:14,281 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Tipo de investigación.csv
2025-09-24 12:25:15,073 - INFO - Processing (depth 0): GEIH_Febrero_2022_Marco_2018.zip
2025-09-24 12:25:54,279 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:25:54,289 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:25:54,316 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Fuerza de trabajo.CSV
2025-09-24 12:25:54,337 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Migración.CSV
2025-09-24 12:25:54,351 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_No ocupados.CSV
2025-09-24 12:25:54,367 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Ocupados.CSV
2025-09-24 12:25:54,384 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Otras formas de trabajo.CSV
2025-09-24 12:25:54,401 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:25:54,419 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Tipo de investigación.CSV
2025-09-24 12:25:54,627 - INFO - Processing (depth 0): GEIH_Marzo_2022_Marco_2018.zip
2025-09-24 12:26:06,808 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:26:06,821 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Datos del hogar y la vivienda.CSV
2025-09-24 12:26:06,840 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Fuerza de trabajo.CSV
2025-09-24 12:26:06,860 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Migración.CSV
2025-09-24 12:26:06,871 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_No ocupados.CSV
2025-09-24 12:26:06,890 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Ocupados.CSV
2025-09-24 12:26:06,908 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Otras formas de trabajo.CSV
2025-09-24 12:26:06,934 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Otros ingresos e impuestos.CSV
2025-09-24 12:26:06,951 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Tipo de investigación.CSV
2025-09-24 12:26:07,163 - INFO - Processing (depth 0): GEIH_Mayo_2022_Marco_2018.zip
2025-09-24 12:26:07,573 - INFO - Processing (depth 1): CSV.zip
2025-09-24 12:26:08,009 - INFO - Extracted: abcbb904_csv_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:26:08,011 - INFO - Extracted: abcbb904_csv_Datos del hogar y la vivienda.CSV
2025-09-24 12:26:08,013 - INFO - Extracted: abcbb904_csv_Fuerza de trabajo.CSV
2025-09-24 12:26:08,015 - INFO - Extracted: abcbb904_csv_Migración.CSV
2025-09-24 12:26:08,017 - INFO - Extracted: abcbb904_csv_No ocupados.CSV
2025-09-24 12:26:08,021 - INFO - Extracted: abcbb904_csv_Ocupados.CSV
2025-09-24 12:26:08,024 - INFO - Extracted: abcbb904_csv_Otras formas de trabajo.CSV
2025-09-24 12:26:08,028 - INFO - Extracted: abcbb904_csv_Otros ingresos e impuestos.CSV
2025-09-24 12:26:08,032 - INFO - Extracted: abcbb904_csv_Tipo de investigación.CSV
2025-09-24 12:26:08,036 - INFO - Processing (depth 1): DTA.zip
2025-09-24 12:26:13,712 - WARNING - No matches in DTA.zip. Contents:
2025-09-24 12:26:13,713 - WARNING - No files found matching the specified extensions.
2025-09-24 12:26:13,714 - INFO - Processing (depth 1): SAV.zip
2025-09-24 12:26:19,948 - WARNING - No matches in SAV.zip. Contents:
2025-09-24 12:26:19,949 - WARNING - No files found matching the specified extensions.
2025-09-24 12:26:19,955 - INFO - Processing (depth 0): GEIH_Junio_2022_Marco_2018.zip
2025-09-24 12:26:31,636 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:26:31,648 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:26:31,662 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Fuerza de trabajo.CSV
2025-09-24 12:26:31,678 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Migración.CSV
2025-09-24 12:26:31,688 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_No ocupados.CSV
2025-09-24 12:26:31,706 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Ocupados.CSV
2025-09-24 12:26:31,727 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Otras formas de trabajo.CSV
2025-09-24 12:26:31,744 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:26:31,757 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Tipo de investigación.CSV
2025-09-24 12:26:31,990 - INFO - Processing (depth 0): GEIH_Julio_2022_Marco_2018.zip
2025-09-24 12:27:09,303 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:27:09,318 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:27:09,337 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Fuerza de trabajo.CSV
2025-09-24 12:27:09,362 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Migración.CSV
2025-09-24 12:27:09,373 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_No ocupados.CSV
2025-09-24 12:27:09,392 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Ocupados.CSV
2025-09-24 12:27:09,411 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Otras formas de trabajo.CSV
2025-09-24 12:27:09,428 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:27:09,442 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Tipo de investigación.CSV
2025-09-24 12:27:09,665 - INFO - Processing (depth 0): GEIH_Agosto_2022_Marco_2018.zip
2025-09-24 12:27:21,343 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:27:21,356 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:27:21,376 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Fuerza de trabajo.CSV
2025-09-24 12:27:21,399 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Migración.CSV
2025-09-24 12:27:21,409 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_No ocupados.CSV
2025-09-24 12:27:21,430 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Ocupados.CSV
2025-09-24 12:27:21,447 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Otras formas de trabajo.CSV
2025-09-24 12:27:21,464 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:27:21,477 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Tipo de investigación.CSV
2025-09-24 12:27:21,659 - INFO - Processing (depth 0): GEIH_Septiembre_Marco_2018.zip
2025-09-24 12:27:21,901 - INFO - Processing (depth 1): CSV.zip
2025-09-24 12:27:22,348 - INFO - Extracted: 25ad86f4_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:27:22,351 - INFO - Extracted: 25ad86f4_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:27:22,354 - INFO - Extracted: 25ad86f4_CSV_Fuerza de trabajo.CSV
2025-09-24 12:27:22,356 - INFO - Extracted: 25ad86f4_CSV_Migración.CSV
2025-09-24 12:27:22,359 - INFO - Extracted: 25ad86f4_CSV_No ocupados.CSV
2025-09-24 12:27:22,362 - INFO - Extracted: 25ad86f4_CSV_Ocupados.CSV
2025-09-24 12:27:22,367 - INFO - Extracted: 25ad86f4_CSV_Otras formas de trabajo.CSV
2025-09-24 12:27:22,370 - INFO - Extracted: 25ad86f4_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:27:22,373 - INFO - Extracted: 25ad86f4_CSV_Tipo de investigación.CSV
2025-09-24 12:27:22,376 - INFO - Processing (depth 1): DTA.zip
2025-09-24 12:27:28,035 - WARNING - No matches in DTA.zip. Contents:
2025-09-24 12:27:28,036 - WARNING - No files found matching the specified extensions.
2025-09-24 12:27:28,037 - INFO - Processing (depth 1): SAV.zip
2025-09-24 12:27:33,899 - WARNING - No matches in SAV.zip. Contents:
2025-09-24 12:27:33,900 - WARNING - No files found matching the specified extensions.
2025-09-24 12:27:33,908 - INFO - Processing (depth 0): GEIH_Octubre_Marco_2018.zip
2025-09-24 12:27:34,090 - INFO - Processing (depth 1): CSV 4.zip
2025-09-24 12:27:34,294 - INFO - Extracted: 4a4be0fc_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:27:34,297 - INFO - Extracted: 4a4be0fc_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:27:34,300 - INFO - Extracted: 4a4be0fc_CSV_Fuerza de trabajo.CSV
2025-09-24 12:27:34,303 - INFO - Extracted: 4a4be0fc_CSV_Migración.CSV
2025-09-24 12:27:34,307 - INFO - Extracted: 4a4be0fc_CSV_No ocupados.CSV
2025-09-24 12:27:34,312 - INFO - Extracted: 4a4be0fc_CSV_Ocupados.CSV
2025-09-24 12:27:34,316 - INFO - Extracted: 4a4be0fc_CSV_Otras formas de trabajo.CSV
2025-09-24 12:27:34,319 - INFO - Extracted: 4a4be0fc_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:27:34,323 - INFO - Extracted: 4a4be0fc_CSV_Tipo de investigación.CSV
2025-09-24 12:27:34,326 - INFO - Processing (depth 1): DTA 4.zip
2025-09-24 12:27:39,914 - WARNING - No matches in DTA 4.zip. Contents:
2025-09-24 12:27:39,915 - WARNING - No files found matching the specified extensions.
2025-09-24 12:27:39,916 - INFO - Processing (depth 1): SAV 4.zip
2025-09-24 12:27:45,730 - WARNING - No matches in SAV 4.zip. Contents:
2025-09-24 12:27:45,731 - WARNING - No files found matching the specified extensions.
2025-09-24 12:27:45,739 - INFO - Processing (depth 0): GEIH_Diciembre_2022_Marco_2018.zip
2025-09-24 12:27:57,984 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:27:57,994 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Datos del hogar y la vivienda.CSV
2025-09-24 12:27:58,009 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Fuerza de trabajo.CSV
2025-09-24 12:27:58,022 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Migración.CSV
2025-09-24 12:27:58,049 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_No ocupados.CSV
2025-09-24 12:27:58,082 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Ocupados.CSV
2025-09-24 12:27:58,108 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Otras formas de trabajo.CSV
2025-09-24 12:27:58,130 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Otros ingresos e impuestos.CSV
2025-09-24 12:27:58,143 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Tipo de investigación.CSV
2025-09-24 12:27:58,370 - INFO - Processing (depth 0): GEIH_Abril_2022_Marco_2018_Act.zip
2025-09-24 12:28:10,618 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:28:10,632 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:28:10,650 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Fuerza de trabajo.CSV
2025-09-24 12:28:10,672 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Migración.CSV
2025-09-24 12:28:10,686 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_No ocupados.CSV
2025-09-24 12:28:10,706 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Ocupados.CSV
2025-09-24 12:28:10,726 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Otras formas de trabajo.CSV
2025-09-24 12:28:10,743 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:28:10,755 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Tipo de investigación.CSV
2025-09-24 12:28:10,978 - INFO - Processing (depth 0): GEIH_Noviembre_2022_Marco_2018.act.zip
2025-09-24 12:28:22,841 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:28:22,851 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:28:22,866 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Fuerza de trabajo.CSV
2025-09-24 12:28:22,882 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Migración.CSV
2025-09-24 12:28:22,892 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_No ocupados.CSV
2025-09-24 12:28:22,913 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Ocupados.CSV
2025-09-24 12:28:22,956 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Otras formas de trabajo.CSV
2025-09-24 12:28:22,978 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:28:22,993 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Tipo de investigación.CSV
Processing files: 100%|██████████| 108/108 [00:28<00:00,  3.78it/s]
2025-09-24 12:28:51,874 - INFO - Successfully processed 108/108 files
2025-09-24 12:28:51,875 - INFO - Extraction completed successfully.

col_dfs[0].head()

	PERIODO	DIRECTORIO	SECUENCIA_P	ORDEN	HOGAR	P7495	P7500S1	P7500S1A1	P7500S2	P7500S2A1	...	P3371S1	P3371S2	P3371S3	P3371S4	P3372	P3372S1	FEX_C18	PER	REGIS	filename
0	20220104	5000000	1	1	1	2	<NA>	<NA>	<NA>	<NA>	...	<NA>	<NA>	<NA>	<NA>	2	<NA>	1432.4633227	2022	90	c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ...
1	20220104	5000000	1	2	1	2	<NA>	<NA>	<NA>	<NA>	...	<NA>	<NA>	<NA>	<NA>	2	<NA>	1432.4633227	2022	90	c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ...
2	20220104	5000000	1	6	1	2	<NA>	<NA>	<NA>	<NA>	...	<NA>	<NA>	<NA>	<NA>	2	<NA>	1432.4633227	2022	90	c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ...
3	20220104	5000001	1	1	1	2	<NA>	<NA>	<NA>	<NA>	...	<NA>	<NA>	<NA>	<NA>	2	<NA>	1088.7962663	2022	90	c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ...
4	20220104	5000001	1	2	1	2	<NA>	<NA>	<NA>	<NA>	...	<NA>	<NA>	<NA>	<NA>	2	<NA>	1088.7962663	2022	90	c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ...

5 rows × 60 columns

Use case 2: Extracting data from Brazil#

We are downloading the Brazilian data from the Brazilian Institute of Geography and Statistics (IBGE) website. The Extractor class is used to download the data. In this case, we are extracting the Brazilian National Continuous Household Sample Survey (PNADC) for the year 2024

⚠️

Important: is_fwf parameter is set to True, which indicates that the data files are in fixed-width format. The colnames and colspecs parameters must be provided. In this example, they are set to the corresponding available enums for PNADC data, which define the column names and specifications for the dataset. See more details in socio4health.enums.data_info_enum documentation .

bra_online_extractor = Extractor(input_path="https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/", down_ext=['.txt','.zip'], is_fwf=True, colnames=BraColnamesEnum.PNADC.value, colspecs=BraColspecsEnum.PNADC.value, output_path="../data", depth=0)

bra_dfs = bra_online_extractor.s4h_extract()

2025-09-24 12:30:08,711 - INFO - ----------------------
2025-09-24 12:30:08,713 - INFO - Starting data extraction...
2025-09-24 12:30:08,713 - INFO - Extracting data in online mode...
2025-09-24 12:30:08,715 - INFO - Scraping URL: https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/ with depth 0
2025-09-24 12:30:13,261 - INFO - Spider completed successfully for URL: https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/
2025-09-24 12:30:13,264 - INFO - Downloading files to: ../data
Downloading files:   0%|          | 0/4 [00:00<?, ?it/s]2025-09-24 12:30:28,678 - INFO - Successfully downloaded: PNADC_012024_20250815.zip
Downloading files:  25%|██▌       | 1/4 [00:15<00:46, 15.41s/it]2025-09-24 12:41:31,509 - INFO - Successfully downloaded: PNADC_022024_20250815.zip
Downloading files:  50%|█████     | 2/4 [11:18<13:12, 396.25s/it]2025-09-24 12:41:57,057 - INFO - Successfully downloaded: PNADC_032024_20250815.zip
Downloading files:  75%|███████▌  | 3/4 [11:43<03:46, 226.98s/it]2025-09-24 12:42:33,947 - INFO - Successfully downloaded: PNADC_042024_20250815.zip
Downloading files: 100%|██████████| 4/4 [12:20<00:00, 185.17s/it]
2025-09-24 12:42:33,956 - INFO - Processing (depth 0): PNADC_012024_20250815.zip
2025-09-24 12:42:49,933 - INFO - Extracted: a7db871d_PNADC_012024.txt
2025-09-24 12:42:49,942 - INFO - Processing (depth 0): PNADC_022024_20250815.zip
2025-09-24 12:43:32,038 - INFO - Extracted: 946bef7f_PNADC_022024.txt
2025-09-24 12:43:32,041 - INFO - Processing (depth 0): PNADC_032024_20250815.zip
2025-09-24 12:44:02,915 - INFO - Extracted: c8ebcb57_PNADC_032024.txt
2025-09-24 12:44:02,916 - INFO - Processing (depth 0): PNADC_042024_20250815.zip
2025-09-24 12:44:16,725 - INFO - Extracted: c1bffc2a_PNADC_042024.txt
Processing files: 100%|██████████| 4/4 [04:57<00:00, 74.45s/it]
2025-09-24 12:49:14,550 - INFO - Successfully processed 4/4 files
2025-09-24 12:49:14,554 - INFO - Extraction completed successfully.

bra_dfs[0].head()

	Ano	Trimestre	UF	Capital	RM_RIDE	UPA	Estrato	V1008	V1014	V1016	...	V1028192	V1028195	V1028196	V1028197	V1028198	V1028199	V1028200	filename
0	2024	1	11	11	<NA>	110000016	1110011	03	11	1	...	000242.37393247	000132.86482247	000252.85458864	000271.03799675	000122.61081652	000125.78602243	000113.09511303	a7db871d_PNADC_012024.txt
1	2024	1	11	11	<NA>	110000016	1110011	06	11	1	...	000405.66107457	000205.06572241	000410.23613176	000437.83686366	000190.08927267	000200.15696949	000182.15329508	a7db871d_PNADC_012024.txt
2	2024	1	11	11	<NA>	110000016	1110011	06	11	1	...	000405.66107457	000205.06572241	000410.23613176	000437.83686366	000190.08927267	000200.15696949	000182.15329508	a7db871d_PNADC_012024.txt
3	2024	1	11	11	<NA>	110000016	1110011	08	11	1	...	000485.38386591	000242.53160028	000474.75504741	000520.88948037	000223.17316781	000229.81795045	000213.22589782	a7db871d_PNADC_012024.txt
4	2024	1	11	11	<NA>	110000016	1110011	08	11	1	...	000485.38386591	000242.53160028	000474.75504741	000520.88948037	000223.17316781	000229.81795045	000213.22589782	a7db871d_PNADC_012024.txt

5 rows × 421 columns

Use case 3: Extracting data from Peru#

Peruvian data is extracted from the National Institute of Statistics and Informatics (INEI) website. In this case, we are extracting the National Household Survey (ENAHO) for the year 2022. The down_ext parameter is set to download .csv and .zip files, and the sep parameter is set to ;, indicating that the data files are semicolon-separated values.

per_online_extractor = Extractor(input_path="https://www.inei.gob.pe/media/DATOS_ABIERTOS/ENAHO/DATA/2022.zip", down_ext=['.csv','.zip'], output_path="../data", depth=0)

per_dfs = per_online_extractor.s4h_extract()

2025-09-24 12:50:18,463 - INFO - ----------------------
2025-09-24 12:50:18,464 - INFO - Starting data extraction...
2025-09-24 12:50:18,465 - INFO - Extracting data in online mode...
2025-09-24 12:50:18,466 - INFO - Detected direct file download URL - skipping scraping
2025-09-24 12:50:18,467 - INFO - Downloading large file (2022.zip)...
0.00B [00:00, ?B/s]2025-09-24 12:57:51,195 - INFO - Successfully downloaded: 2022.zip
0.00B [07:32, ?B/s]
2025-09-24 12:57:51,201 - INFO - Processing (depth 0): 2022.zip
2025-09-24 12:58:01,073 - INFO - Extracted: a0643d91_784-Modulo01_Enaho01-2022-100.csv
2025-09-24 12:58:01,083 - INFO - Extracted: a0643d91_784-Modulo02_ENAHO-TABLA-CIUO-88.csv
2025-09-24 12:58:01,090 - INFO - Extracted: a0643d91_784-Modulo02_ENAHO-TABLA-CNO-2015.csv
2025-09-24 12:58:01,112 - INFO - Extracted: a0643d91_784-Modulo02_Enaho01-2022-200.csv
2025-09-24 12:58:01,261 - INFO - Extracted: a0643d91_784-Modulo03_Enaho01a-2022-300.csv
2025-09-24 12:58:01,265 - INFO - Extracted: a0643d91_784-Modulo04_ENAHO-TABLA-PAISES.csv
2025-09-24 12:58:01,270 - INFO - Extracted: a0643d91_784-Modulo04_ENAHO-TABLA-UBIGEO.csv
2025-09-24 12:58:01,545 - INFO - Extracted: a0643d91_784-Modulo04_Enaho01a-2022-400.csv
2025-09-24 12:58:01,551 - INFO - Extracted: a0643d91_784-Modulo04_TABLA-UBIGEO-1874 DISTRITOS.csv
2025-09-24 12:58:01,557 - INFO - Extracted: a0643d91_784-Modulo05_ENAHO-TABLA-CIIU-REV3.csv
2025-09-24 12:58:01,564 - INFO - Extracted: a0643d91_784-Modulo05_ENAHO-TABLA-CIIU-REV4.csv
2025-09-24 12:58:01,572 - INFO - Extracted: a0643d91_784-Modulo05_ENAHO-TABLA-CIUO-88.csv
2025-09-24 12:58:01,580 - INFO - Extracted: a0643d91_784-Modulo05_ENAHO-TABLA-CNO-2015.csv
2025-09-24 12:58:01,890 - INFO - Extracted: a0643d91_784-Modulo05_Enaho01a-2022-500.csv
2025-09-24 12:58:04,521 - INFO - Extracted: a0643d91_784-Modulo07_Enaho01-2022-601.csv
2025-09-24 12:58:04,602 - INFO - Extracted: a0643d91_784-Modulo08_Enaho01-2022-602.csv
2025-09-24 12:58:04,618 - INFO - Extracted: a0643d91_784-Modulo08_Enaho01-2022-602A.csv
2025-09-24 12:58:05,144 - INFO - Extracted: a0643d91_784-Modulo09_Enaho01-2022-603.csv
2025-09-24 12:58:06,157 - INFO - Extracted: a0643d91_784-Modulo10_Enaho01-2022-604.csv
2025-09-24 12:58:06,679 - INFO - Extracted: a0643d91_784-Modulo11_Enaho01-2022-605.csv
2025-09-24 12:58:07,338 - INFO - Extracted: a0643d91_784-Modulo12_Enaho01-2022-606.csv
2025-09-24 12:58:08,099 - INFO - Extracted: a0643d91_784-Modulo13_Enaho01-2022-607.csv
2025-09-24 12:58:08,837 - INFO - Extracted: a0643d91_784-Modulo15_Enaho01-2022-609.csv
2025-09-24 12:58:09,513 - INFO - Extracted: a0643d91_784-Modulo16_Enaho01-2022-610.csv
2025-09-24 12:58:10,013 - INFO - Extracted: a0643d91_784-Modulo17_Enaho01-2022-611.csv
2025-09-24 12:58:10,753 - INFO - Extracted: a0643d91_784-Modulo18_Enaho01-2022-612.csv
2025-09-24 12:58:10,778 - INFO - Extracted: a0643d91_784-Modulo22_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:10,783 - INFO - Extracted: a0643d91_784-Modulo22_Enaho02-2022-2000.csv
2025-09-24 12:58:10,938 - INFO - Extracted: a0643d91_784-Modulo22_Enaho02-2022-2000A.csv
2025-09-24 12:58:11,015 - INFO - Extracted: a0643d91_784-Modulo22_Enaho02-2022-2100.csv
2025-09-24 12:58:11,104 - INFO - Extracted: a0643d91_784-Modulo23_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:11,174 - INFO - Extracted: a0643d91_784-Modulo23_Enaho02-2022-2200.csv
2025-09-24 12:58:11,303 - INFO - Extracted: a0643d91_784-Modulo24_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:11,375 - INFO - Extracted: a0643d91_784-Modulo24_Enaho02-2022-2300.csv
2025-09-24 12:58:11,473 - INFO - Extracted: a0643d91_784-Modulo25_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:11,602 - INFO - Extracted: a0643d91_784-Modulo25_Enaho02-2022-2400.csv
2025-09-24 12:58:11,799 - INFO - Extracted: a0643d91_784-Modulo26_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:11,816 - INFO - Extracted: a0643d91_784-Modulo26_Enaho02-2022-2500.csv
2025-09-24 12:58:12,003 - INFO - Extracted: a0643d91_784-Modulo27_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:12,078 - INFO - Extracted: a0643d91_784-Modulo27_Enaho02-2022-2600.csv
2025-09-24 12:58:12,111 - INFO - Extracted: a0643d91_784-Modulo28_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:12,126 - INFO - Extracted: a0643d91_784-Modulo28_Enaho02-2022-2700.csv
2025-09-24 12:58:12,241 - INFO - Extracted: a0643d91_784-Modulo34_Sumaria-2022-12g.csv
2025-09-24 12:58:12,269 - INFO - Extracted: a0643d91_784-Modulo34_Sumaria-2022.csv
2025-09-24 12:58:12,343 - INFO - Extracted: a0643d91_784-Modulo37_Enaho01-2022-700.csv
2025-09-24 12:58:12,377 - INFO - Extracted: a0643d91_784-Modulo37_Enaho01-2022-700A.csv
2025-09-24 12:58:12,494 - INFO - Extracted: a0643d91_784-Modulo37_Enaho01-2022-700B.csv
2025-09-24 12:58:12,701 - INFO - Extracted: a0643d91_784-Modulo77_Enaho04-2022-1-Preg-1-a-13.csv
2025-09-24 12:58:12,719 - INFO - Extracted: a0643d91_784-Modulo77_Enaho04-2022-2-Preg-14-a-22.csv
2025-09-24 12:58:12,894 - INFO - Extracted: a0643d91_784-Modulo77_Enaho04-2022-3-Preg-23.csv
2025-09-24 12:58:12,981 - INFO - Extracted: a0643d91_784-Modulo77_Enaho04-2022-4-Preg-24.csv
2025-09-24 12:58:13,023 - INFO - Extracted: a0643d91_784-Modulo77_Enaho04-2022-5-Preg-25.csv
2025-09-24 12:58:13,814 - INFO - Extracted: a0643d91_784-Modulo78_Enaho01-2022-606D.csv
2025-09-24 12:58:13,878 - INFO - Extracted: a0643d91_784-Modulo84_Enaho01-2022-800A.csv
2025-09-24 12:58:13,948 - INFO - Extracted: a0643d91_784-Modulo84_Enaho01-2022-800B.csv
2025-09-24 12:58:14,253 - INFO - Extracted: a0643d91_784-Modulo85_Enaho01B-2022-1.csv
2025-09-24 12:58:14,415 - INFO - Extracted: a0643d91_784-Modulo85_Enaho01B-2022-2.csv
Processing files: 100%|██████████| 57/57 [00:59<00:00,  1.05s/it]
2025-09-24 12:59:14,152 - INFO - Successfully processed 57/57 files
2025-09-24 12:59:14,153 - INFO - Extraction completed successfully.

per_dfs[0].head()

	AÑO	MES	CONGLOME	VIVIENDA	HOGAR	UBIGEO	DOMINIO	ESTRATO	P612N	P612	...	P612H	TICUEST01	D612H	I612H	FACTOR07	NCONGLOME	filename
0	2022	01	005030	008	11	010201	7	4	1	2	...		2			106.890243530273	006618	a0643d91_784-Modulo18_Enaho01-2022-612.csv
1	2022	01	005030	008	11	010201	7	4	2	1	...	1502	2	1526.17028808594	152.617034912109	106.890243530273	006618	a0643d91_784-Modulo18_Enaho01-2022-612.csv
2	2022	01	005030	008	11	010201	7	4	3	2	...		2			106.890243530273	006618	a0643d91_784-Modulo18_Enaho01-2022-612.csv
3	2022	01	005030	008	11	010201	7	4	4	1	...	120	2	121.931053161621	10.1609210968018	106.890243530273	006618	a0643d91_784-Modulo18_Enaho01-2022-612.csv
4	2022	01	005030	008	11	010201	7	4	5	2	...		2			106.890243530273	006618	a0643d91_784-Modulo18_Enaho01-2022-612.csv

5 rows × 25 columns

Further steps#

Harmonize the extracted data using the Harmonizer class from the socio4health library. You can follow the Harmonization tutorial for more details.