image info

Extraction of Colombia, Brazil and Peru online data#

Run the tutorial via free cloud platforms: Binder Open In Colab

This notebook provides you with an introduction on how to retrieve data from online data sources through web scraping, as well as from local files from Colombia, Brazil, Peru, and the Dominican Republic. This tutorial assumes you have an intermediate or advanced understanding of Python and data manipulation.

Setting up the environment#

To run this notebook, you need to have the following prerequisites:

  • Python 3.10+

Additionally, you need to install the socio4health and pandas package, which can be done using pip:

!pip install socio4health pandas -q
[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip

Import Libraries#

To perform the data extraction, the socio4health library provides the Extractor class for data extraction, and the Harmonizer class for data harmonization of the retrieved date. We will also use pandas for data manipulation.


from socio4health import Extractor
from socio4health.enums.data_info_enum import BraColnamesEnum, BraColspecsEnum

Use case 1: Extracting data from Colombia#

To extract data from Colombia, we will use the Extractor class from the socio4health library. The Extractor class provides methods to retrieve data from various sources, including online databases and local files. In this example, we will extract the Large Integrated Household Survey - GEIH - 2022 (Gran Encuesta Integrada de Hogares - GEIH - 2022) dataset from the Colombian Nacional Administration of Statistics (DANE) website

The Extractor class requires the following parameters:

  • input_path: The URL or local path to the data source.

  • down_ext: A list of file extensions to download. This can include .CSV, .csv, .zip, etc.

  • sep: The separator used in the data files (e.g., ; for semicolon-separated values).

  • output_path: The local path where the extracted data will be saved.

  • depth: The depth of the directory structure to traverse when downloading files. A depth of 0 means only the files in the specified directory will be downloaded.

col_online_extractor = Extractor(input_path="https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata", down_ext=['.CSV','.csv','.zip'], sep=';', output_path="../data", depth=0)

After the instance is set up, we can call the s4h_extract method to download and extract the data. The method returns a list of pandas DataFrames containing the extracted data.

col_dfs = col_online_extractor.s4h_extract()
2025-09-24 12:24:17,077 - INFO - ----------------------
2025-09-24 12:24:17,078 - INFO - Starting data extraction...
2025-09-24 12:24:17,079 - INFO - Extracting data in online mode...
2025-09-24 12:24:17,080 - INFO - Scraping URL: https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata with depth 0
2025-09-24 12:24:21,753 - INFO - Spider completed successfully for URL: https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata
2025-09-24 12:24:21,755 - INFO - Downloading files to: ../data
Downloading files:   0%|          | 0/12 [00:00<?, ?it/s]2025-09-24 12:24:25,589 - INFO - Successfully downloaded: GEIH_Enero_2022_Marco_2018.zip
Downloading files:   8%|▊         | 1/12 [00:03<00:41,  3.81s/it]2025-09-24 12:24:29,405 - INFO - Successfully downloaded: GEIH_Febrero_2022_Marco_2018.zip
Downloading files:  17%|█▋        | 2/12 [00:07<00:38,  3.81s/it]2025-09-24 12:24:32,439 - INFO - Successfully downloaded: GEIH_Marzo_2022_Marco_2018.zip
Downloading files:  25%|██▌       | 3/12 [00:10<00:31,  3.46s/it]2025-09-24 12:24:34,187 - INFO - Successfully downloaded: GEIH_Mayo_2022_Marco_2018.zip
Downloading files:  33%|███▎      | 4/12 [00:12<00:22,  2.78s/it]2025-09-24 12:24:37,230 - INFO - Successfully downloaded: GEIH_Junio_2022_Marco_2018.zip
Downloading files:  42%|████▏     | 5/12 [00:15<00:20,  2.88s/it]2025-09-24 12:24:40,109 - INFO - Successfully downloaded: GEIH_Julio_2022_Marco_2018.zip
Downloading files:  50%|█████     | 6/12 [00:18<00:17,  2.88s/it]2025-09-24 12:24:44,680 - INFO - Successfully downloaded: GEIH_Agosto_2022_Marco_2018.zip
Downloading files:  58%|█████▊    | 7/12 [00:22<00:17,  3.43s/it]2025-09-24 12:24:48,759 - INFO - Successfully downloaded: GEIH_Septiembre_Marco_2018.zip
Downloading files:  67%|██████▋   | 8/12 [00:26<00:14,  3.64s/it]2025-09-24 12:24:51,119 - INFO - Successfully downloaded: GEIH_Octubre_Marco_2018.zip
Downloading files:  75%|███████▌  | 9/12 [00:29<00:09,  3.24s/it]2025-09-24 12:24:54,440 - INFO - Successfully downloaded: GEIH_Diciembre_2022_Marco_2018.zip
Downloading files:  83%|████████▎ | 10/12 [00:32<00:06,  3.26s/it]2025-09-24 12:24:59,539 - INFO - Successfully downloaded: GEIH_Abril_2022_Marco_2018_Act.zip
Downloading files:  92%|█████████▏| 11/12 [00:37<00:03,  3.83s/it]2025-09-24 12:25:02,078 - INFO - Successfully downloaded: GEIH_Noviembre_2022_Marco_2018.act.zip
Downloading files: 100%|██████████| 12/12 [00:40<00:00,  3.36s/it]
2025-09-24 12:25:02,087 - INFO - Processing (depth 0): GEIH_Enero_2022_Marco_2018.zip
2025-09-24 12:25:14,100 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Características generales, seguridad social en salud y educación.csv
2025-09-24 12:25:14,118 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:25:14,139 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Fuerza de trabajo.csv
2025-09-24 12:25:14,160 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Migración.CSV
2025-09-24 12:25:14,177 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_No ocupados.csv
2025-09-24 12:25:14,210 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Ocupados.csv
2025-09-24 12:25:14,237 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otras formas de trabajo.csv
2025-09-24 12:25:14,260 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ingresos e impuestos.csv
2025-09-24 12:25:14,281 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Tipo de investigación.csv
2025-09-24 12:25:15,073 - INFO - Processing (depth 0): GEIH_Febrero_2022_Marco_2018.zip
2025-09-24 12:25:54,279 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:25:54,289 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:25:54,316 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Fuerza de trabajo.CSV
2025-09-24 12:25:54,337 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Migración.CSV
2025-09-24 12:25:54,351 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_No ocupados.CSV
2025-09-24 12:25:54,367 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Ocupados.CSV
2025-09-24 12:25:54,384 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Otras formas de trabajo.CSV
2025-09-24 12:25:54,401 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:25:54,419 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Tipo de investigación.CSV
2025-09-24 12:25:54,627 - INFO - Processing (depth 0): GEIH_Marzo_2022_Marco_2018.zip
2025-09-24 12:26:06,808 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:26:06,821 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Datos del hogar y la vivienda.CSV
2025-09-24 12:26:06,840 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Fuerza de trabajo.CSV
2025-09-24 12:26:06,860 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Migración.CSV
2025-09-24 12:26:06,871 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_No ocupados.CSV
2025-09-24 12:26:06,890 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Ocupados.CSV
2025-09-24 12:26:06,908 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Otras formas de trabajo.CSV
2025-09-24 12:26:06,934 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Otros ingresos e impuestos.CSV
2025-09-24 12:26:06,951 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Tipo de investigación.CSV
2025-09-24 12:26:07,163 - INFO - Processing (depth 0): GEIH_Mayo_2022_Marco_2018.zip
2025-09-24 12:26:07,573 - INFO - Processing (depth 1): CSV.zip
2025-09-24 12:26:08,009 - INFO - Extracted: abcbb904_csv_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:26:08,011 - INFO - Extracted: abcbb904_csv_Datos del hogar y la vivienda.CSV
2025-09-24 12:26:08,013 - INFO - Extracted: abcbb904_csv_Fuerza de trabajo.CSV
2025-09-24 12:26:08,015 - INFO - Extracted: abcbb904_csv_Migración.CSV
2025-09-24 12:26:08,017 - INFO - Extracted: abcbb904_csv_No ocupados.CSV
2025-09-24 12:26:08,021 - INFO - Extracted: abcbb904_csv_Ocupados.CSV
2025-09-24 12:26:08,024 - INFO - Extracted: abcbb904_csv_Otras formas de trabajo.CSV
2025-09-24 12:26:08,028 - INFO - Extracted: abcbb904_csv_Otros ingresos e impuestos.CSV
2025-09-24 12:26:08,032 - INFO - Extracted: abcbb904_csv_Tipo de investigación.CSV
2025-09-24 12:26:08,036 - INFO - Processing (depth 1): DTA.zip
2025-09-24 12:26:13,712 - WARNING - No matches in DTA.zip. Contents:
2025-09-24 12:26:13,713 - WARNING - No files found matching the specified extensions.
2025-09-24 12:26:13,714 - INFO - Processing (depth 1): SAV.zip
2025-09-24 12:26:19,948 - WARNING - No matches in SAV.zip. Contents:
2025-09-24 12:26:19,949 - WARNING - No files found matching the specified extensions.
2025-09-24 12:26:19,955 - INFO - Processing (depth 0): GEIH_Junio_2022_Marco_2018.zip
2025-09-24 12:26:31,636 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:26:31,648 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:26:31,662 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Fuerza de trabajo.CSV
2025-09-24 12:26:31,678 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Migración.CSV
2025-09-24 12:26:31,688 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_No ocupados.CSV
2025-09-24 12:26:31,706 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Ocupados.CSV
2025-09-24 12:26:31,727 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Otras formas de trabajo.CSV
2025-09-24 12:26:31,744 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:26:31,757 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Tipo de investigación.CSV
2025-09-24 12:26:31,990 - INFO - Processing (depth 0): GEIH_Julio_2022_Marco_2018.zip
2025-09-24 12:27:09,303 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:27:09,318 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:27:09,337 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Fuerza de trabajo.CSV
2025-09-24 12:27:09,362 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Migración.CSV
2025-09-24 12:27:09,373 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_No ocupados.CSV
2025-09-24 12:27:09,392 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Ocupados.CSV
2025-09-24 12:27:09,411 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Otras formas de trabajo.CSV
2025-09-24 12:27:09,428 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:27:09,442 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Tipo de investigación.CSV
2025-09-24 12:27:09,665 - INFO - Processing (depth 0): GEIH_Agosto_2022_Marco_2018.zip
2025-09-24 12:27:21,343 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:27:21,356 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:27:21,376 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Fuerza de trabajo.CSV
2025-09-24 12:27:21,399 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Migración.CSV
2025-09-24 12:27:21,409 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_No ocupados.CSV
2025-09-24 12:27:21,430 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Ocupados.CSV
2025-09-24 12:27:21,447 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Otras formas de trabajo.CSV
2025-09-24 12:27:21,464 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:27:21,477 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Tipo de investigación.CSV
2025-09-24 12:27:21,659 - INFO - Processing (depth 0): GEIH_Septiembre_Marco_2018.zip
2025-09-24 12:27:21,901 - INFO - Processing (depth 1): CSV.zip
2025-09-24 12:27:22,348 - INFO - Extracted: 25ad86f4_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:27:22,351 - INFO - Extracted: 25ad86f4_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:27:22,354 - INFO - Extracted: 25ad86f4_CSV_Fuerza de trabajo.CSV
2025-09-24 12:27:22,356 - INFO - Extracted: 25ad86f4_CSV_Migración.CSV
2025-09-24 12:27:22,359 - INFO - Extracted: 25ad86f4_CSV_No ocupados.CSV
2025-09-24 12:27:22,362 - INFO - Extracted: 25ad86f4_CSV_Ocupados.CSV
2025-09-24 12:27:22,367 - INFO - Extracted: 25ad86f4_CSV_Otras formas de trabajo.CSV
2025-09-24 12:27:22,370 - INFO - Extracted: 25ad86f4_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:27:22,373 - INFO - Extracted: 25ad86f4_CSV_Tipo de investigación.CSV
2025-09-24 12:27:22,376 - INFO - Processing (depth 1): DTA.zip
2025-09-24 12:27:28,035 - WARNING - No matches in DTA.zip. Contents:
2025-09-24 12:27:28,036 - WARNING - No files found matching the specified extensions.
2025-09-24 12:27:28,037 - INFO - Processing (depth 1): SAV.zip
2025-09-24 12:27:33,899 - WARNING - No matches in SAV.zip. Contents:
2025-09-24 12:27:33,900 - WARNING - No files found matching the specified extensions.
2025-09-24 12:27:33,908 - INFO - Processing (depth 0): GEIH_Octubre_Marco_2018.zip
2025-09-24 12:27:34,090 - INFO - Processing (depth 1): CSV 4.zip
2025-09-24 12:27:34,294 - INFO - Extracted: 4a4be0fc_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:27:34,297 - INFO - Extracted: 4a4be0fc_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:27:34,300 - INFO - Extracted: 4a4be0fc_CSV_Fuerza de trabajo.CSV
2025-09-24 12:27:34,303 - INFO - Extracted: 4a4be0fc_CSV_Migración.CSV
2025-09-24 12:27:34,307 - INFO - Extracted: 4a4be0fc_CSV_No ocupados.CSV
2025-09-24 12:27:34,312 - INFO - Extracted: 4a4be0fc_CSV_Ocupados.CSV
2025-09-24 12:27:34,316 - INFO - Extracted: 4a4be0fc_CSV_Otras formas de trabajo.CSV
2025-09-24 12:27:34,319 - INFO - Extracted: 4a4be0fc_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:27:34,323 - INFO - Extracted: 4a4be0fc_CSV_Tipo de investigación.CSV
2025-09-24 12:27:34,326 - INFO - Processing (depth 1): DTA 4.zip
2025-09-24 12:27:39,914 - WARNING - No matches in DTA 4.zip. Contents:
2025-09-24 12:27:39,915 - WARNING - No files found matching the specified extensions.
2025-09-24 12:27:39,916 - INFO - Processing (depth 1): SAV 4.zip
2025-09-24 12:27:45,730 - WARNING - No matches in SAV 4.zip. Contents:
2025-09-24 12:27:45,731 - WARNING - No files found matching the specified extensions.
2025-09-24 12:27:45,739 - INFO - Processing (depth 0): GEIH_Diciembre_2022_Marco_2018.zip
2025-09-24 12:27:57,984 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:27:57,994 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Datos del hogar y la vivienda.CSV
2025-09-24 12:27:58,009 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Fuerza de trabajo.CSV
2025-09-24 12:27:58,022 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Migración.CSV
2025-09-24 12:27:58,049 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_No ocupados.CSV
2025-09-24 12:27:58,082 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Ocupados.CSV
2025-09-24 12:27:58,108 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Otras formas de trabajo.CSV
2025-09-24 12:27:58,130 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Otros ingresos e impuestos.CSV
2025-09-24 12:27:58,143 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Tipo de investigación.CSV
2025-09-24 12:27:58,370 - INFO - Processing (depth 0): GEIH_Abril_2022_Marco_2018_Act.zip
2025-09-24 12:28:10,618 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:28:10,632 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:28:10,650 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Fuerza de trabajo.CSV
2025-09-24 12:28:10,672 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Migración.CSV
2025-09-24 12:28:10,686 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_No ocupados.CSV
2025-09-24 12:28:10,706 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Ocupados.CSV
2025-09-24 12:28:10,726 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Otras formas de trabajo.CSV
2025-09-24 12:28:10,743 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:28:10,755 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Tipo de investigación.CSV
2025-09-24 12:28:10,978 - INFO - Processing (depth 0): GEIH_Noviembre_2022_Marco_2018.act.zip
2025-09-24 12:28:22,841 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:28:22,851 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:28:22,866 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Fuerza de trabajo.CSV
2025-09-24 12:28:22,882 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Migración.CSV
2025-09-24 12:28:22,892 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_No ocupados.CSV
2025-09-24 12:28:22,913 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Ocupados.CSV
2025-09-24 12:28:22,956 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Otras formas de trabajo.CSV
2025-09-24 12:28:22,978 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:28:22,993 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Tipo de investigación.CSV
Processing files: 100%|██████████| 108/108 [00:28<00:00,  3.78it/s]
2025-09-24 12:28:51,874 - INFO - Successfully processed 108/108 files
2025-09-24 12:28:51,875 - INFO - Extraction completed successfully.
col_dfs[0].head()
PERIODO DIRECTORIO SECUENCIA_P ORDEN HOGAR P7495 P7500S1 P7500S1A1 P7500S2 P7500S2A1 ... P3371S1 P3371S2 P3371S3 P3371S4 P3372 P3372S1 FEX_C18 PER REGIS filename
0 20220104 5000000 1 1 1 2 <NA> <NA> <NA> <NA> ... <NA> <NA> <NA> <NA> 2 <NA> 1432.4633227 2022 90 c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ...
1 20220104 5000000 1 2 1 2 <NA> <NA> <NA> <NA> ... <NA> <NA> <NA> <NA> 2 <NA> 1432.4633227 2022 90 c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ...
2 20220104 5000000 1 6 1 2 <NA> <NA> <NA> <NA> ... <NA> <NA> <NA> <NA> 2 <NA> 1432.4633227 2022 90 c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ...
3 20220104 5000001 1 1 1 2 <NA> <NA> <NA> <NA> ... <NA> <NA> <NA> <NA> 2 <NA> 1088.7962663 2022 90 c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ...
4 20220104 5000001 1 2 1 2 <NA> <NA> <NA> <NA> ... <NA> <NA> <NA> <NA> 2 <NA> 1088.7962663 2022 90 c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ...

5 rows × 60 columns

Use case 2: Extracting data from Brazil#

We are downloading the Brazilian data from the Brazilian Institute of Geography and Statistics (IBGE) website. The Extractor class is used to download the data. In this case, we are extracting the Brazilian National Continuous Household Sample Survey (PNADC) for the year 2024

⚠️
Important: is_fwf parameter is set to True, which indicates that the data files are in fixed-width format. The colnames and colspecs parameters must be provided. In this example, they are set to the corresponding available enums for PNADC data, which define the column names and specifications for the dataset. See more details in socio4health.enums.data_info_enum documentation .
bra_online_extractor = Extractor(input_path="https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/", down_ext=['.txt','.zip'], is_fwf=True, colnames=BraColnamesEnum.PNADC.value, colspecs=BraColspecsEnum.PNADC.value, output_path="../data", depth=0)

bra_dfs = bra_online_extractor.s4h_extract()
2025-09-24 12:30:08,711 - INFO - ----------------------
2025-09-24 12:30:08,713 - INFO - Starting data extraction...
2025-09-24 12:30:08,713 - INFO - Extracting data in online mode...
2025-09-24 12:30:08,715 - INFO - Scraping URL: https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/ with depth 0
2025-09-24 12:30:13,261 - INFO - Spider completed successfully for URL: https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/
2025-09-24 12:30:13,264 - INFO - Downloading files to: ../data
Downloading files:   0%|          | 0/4 [00:00<?, ?it/s]2025-09-24 12:30:28,678 - INFO - Successfully downloaded: PNADC_012024_20250815.zip
Downloading files:  25%|██▌       | 1/4 [00:15<00:46, 15.41s/it]2025-09-24 12:41:31,509 - INFO - Successfully downloaded: PNADC_022024_20250815.zip
Downloading files:  50%|█████     | 2/4 [11:18<13:12, 396.25s/it]2025-09-24 12:41:57,057 - INFO - Successfully downloaded: PNADC_032024_20250815.zip
Downloading files:  75%|███████▌  | 3/4 [11:43<03:46, 226.98s/it]2025-09-24 12:42:33,947 - INFO - Successfully downloaded: PNADC_042024_20250815.zip
Downloading files: 100%|██████████| 4/4 [12:20<00:00, 185.17s/it]
2025-09-24 12:42:33,956 - INFO - Processing (depth 0): PNADC_012024_20250815.zip
2025-09-24 12:42:49,933 - INFO - Extracted: a7db871d_PNADC_012024.txt
2025-09-24 12:42:49,942 - INFO - Processing (depth 0): PNADC_022024_20250815.zip
2025-09-24 12:43:32,038 - INFO - Extracted: 946bef7f_PNADC_022024.txt
2025-09-24 12:43:32,041 - INFO - Processing (depth 0): PNADC_032024_20250815.zip
2025-09-24 12:44:02,915 - INFO - Extracted: c8ebcb57_PNADC_032024.txt
2025-09-24 12:44:02,916 - INFO - Processing (depth 0): PNADC_042024_20250815.zip
2025-09-24 12:44:16,725 - INFO - Extracted: c1bffc2a_PNADC_042024.txt
Processing files: 100%|██████████| 4/4 [04:57<00:00, 74.45s/it]
2025-09-24 12:49:14,550 - INFO - Successfully processed 4/4 files
2025-09-24 12:49:14,554 - INFO - Extraction completed successfully.
bra_dfs[0].head()
Ano Trimestre UF Capital RM_RIDE UPA Estrato V1008 V1014 V1016 ... V1028192 V1028193 V1028194 V1028195 V1028196 V1028197 V1028198 V1028199 V1028200 filename
0 2024 1 11 11 <NA> 110000016 1110011 03 11 1 ... 000242.37393247 000000.00000000 000000.00000000 000132.86482247 000252.85458864 000271.03799675 000122.61081652 000125.78602243 000113.09511303 a7db871d_PNADC_012024.txt
1 2024 1 11 11 <NA> 110000016 1110011 06 11 1 ... 000405.66107457 000000.00000000 000000.00000000 000205.06572241 000410.23613176 000437.83686366 000190.08927267 000200.15696949 000182.15329508 a7db871d_PNADC_012024.txt
2 2024 1 11 11 <NA> 110000016 1110011 06 11 1 ... 000405.66107457 000000.00000000 000000.00000000 000205.06572241 000410.23613176 000437.83686366 000190.08927267 000200.15696949 000182.15329508 a7db871d_PNADC_012024.txt
3 2024 1 11 11 <NA> 110000016 1110011 08 11 1 ... 000485.38386591 000000.00000000 000000.00000000 000242.53160028 000474.75504741 000520.88948037 000223.17316781 000229.81795045 000213.22589782 a7db871d_PNADC_012024.txt
4 2024 1 11 11 <NA> 110000016 1110011 08 11 1 ... 000485.38386591 000000.00000000 000000.00000000 000242.53160028 000474.75504741 000520.88948037 000223.17316781 000229.81795045 000213.22589782 a7db871d_PNADC_012024.txt

5 rows × 421 columns

Use case 3: Extracting data from Peru#

Peruvian data is extracted from the National Institute of Statistics and Informatics (INEI) website. In this case, we are extracting the National Household Survey (ENAHO) for the year 2022. The down_ext parameter is set to download .csv and .zip files, and the sep parameter is set to ;, indicating that the data files are semicolon-separated values.

per_online_extractor = Extractor(input_path="https://www.inei.gob.pe/media/DATOS_ABIERTOS/ENAHO/DATA/2022.zip", down_ext=['.csv','.zip'], output_path="../data", depth=0)

per_dfs = per_online_extractor.s4h_extract()
2025-09-24 12:50:18,463 - INFO - ----------------------
2025-09-24 12:50:18,464 - INFO - Starting data extraction...
2025-09-24 12:50:18,465 - INFO - Extracting data in online mode...
2025-09-24 12:50:18,466 - INFO - Detected direct file download URL - skipping scraping
2025-09-24 12:50:18,467 - INFO - Downloading large file (2022.zip)...
0.00B [00:00, ?B/s]2025-09-24 12:57:51,195 - INFO - Successfully downloaded: 2022.zip
0.00B [07:32, ?B/s]
2025-09-24 12:57:51,201 - INFO - Processing (depth 0): 2022.zip
2025-09-24 12:58:01,073 - INFO - Extracted: a0643d91_784-Modulo01_Enaho01-2022-100.csv
2025-09-24 12:58:01,083 - INFO - Extracted: a0643d91_784-Modulo02_ENAHO-TABLA-CIUO-88.csv
2025-09-24 12:58:01,090 - INFO - Extracted: a0643d91_784-Modulo02_ENAHO-TABLA-CNO-2015.csv
2025-09-24 12:58:01,112 - INFO - Extracted: a0643d91_784-Modulo02_Enaho01-2022-200.csv
2025-09-24 12:58:01,261 - INFO - Extracted: a0643d91_784-Modulo03_Enaho01a-2022-300.csv
2025-09-24 12:58:01,265 - INFO - Extracted: a0643d91_784-Modulo04_ENAHO-TABLA-PAISES.csv
2025-09-24 12:58:01,270 - INFO - Extracted: a0643d91_784-Modulo04_ENAHO-TABLA-UBIGEO.csv
2025-09-24 12:58:01,545 - INFO - Extracted: a0643d91_784-Modulo04_Enaho01a-2022-400.csv
2025-09-24 12:58:01,551 - INFO - Extracted: a0643d91_784-Modulo04_TABLA-UBIGEO-1874 DISTRITOS.csv
2025-09-24 12:58:01,557 - INFO - Extracted: a0643d91_784-Modulo05_ENAHO-TABLA-CIIU-REV3.csv
2025-09-24 12:58:01,564 - INFO - Extracted: a0643d91_784-Modulo05_ENAHO-TABLA-CIIU-REV4.csv
2025-09-24 12:58:01,572 - INFO - Extracted: a0643d91_784-Modulo05_ENAHO-TABLA-CIUO-88.csv
2025-09-24 12:58:01,580 - INFO - Extracted: a0643d91_784-Modulo05_ENAHO-TABLA-CNO-2015.csv
2025-09-24 12:58:01,890 - INFO - Extracted: a0643d91_784-Modulo05_Enaho01a-2022-500.csv
2025-09-24 12:58:04,521 - INFO - Extracted: a0643d91_784-Modulo07_Enaho01-2022-601.csv
2025-09-24 12:58:04,602 - INFO - Extracted: a0643d91_784-Modulo08_Enaho01-2022-602.csv
2025-09-24 12:58:04,618 - INFO - Extracted: a0643d91_784-Modulo08_Enaho01-2022-602A.csv
2025-09-24 12:58:05,144 - INFO - Extracted: a0643d91_784-Modulo09_Enaho01-2022-603.csv
2025-09-24 12:58:06,157 - INFO - Extracted: a0643d91_784-Modulo10_Enaho01-2022-604.csv
2025-09-24 12:58:06,679 - INFO - Extracted: a0643d91_784-Modulo11_Enaho01-2022-605.csv
2025-09-24 12:58:07,338 - INFO - Extracted: a0643d91_784-Modulo12_Enaho01-2022-606.csv
2025-09-24 12:58:08,099 - INFO - Extracted: a0643d91_784-Modulo13_Enaho01-2022-607.csv
2025-09-24 12:58:08,837 - INFO - Extracted: a0643d91_784-Modulo15_Enaho01-2022-609.csv
2025-09-24 12:58:09,513 - INFO - Extracted: a0643d91_784-Modulo16_Enaho01-2022-610.csv
2025-09-24 12:58:10,013 - INFO - Extracted: a0643d91_784-Modulo17_Enaho01-2022-611.csv
2025-09-24 12:58:10,753 - INFO - Extracted: a0643d91_784-Modulo18_Enaho01-2022-612.csv
2025-09-24 12:58:10,778 - INFO - Extracted: a0643d91_784-Modulo22_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:10,783 - INFO - Extracted: a0643d91_784-Modulo22_Enaho02-2022-2000.csv
2025-09-24 12:58:10,938 - INFO - Extracted: a0643d91_784-Modulo22_Enaho02-2022-2000A.csv
2025-09-24 12:58:11,015 - INFO - Extracted: a0643d91_784-Modulo22_Enaho02-2022-2100.csv
2025-09-24 12:58:11,104 - INFO - Extracted: a0643d91_784-Modulo23_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:11,174 - INFO - Extracted: a0643d91_784-Modulo23_Enaho02-2022-2200.csv
2025-09-24 12:58:11,303 - INFO - Extracted: a0643d91_784-Modulo24_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:11,375 - INFO - Extracted: a0643d91_784-Modulo24_Enaho02-2022-2300.csv
2025-09-24 12:58:11,473 - INFO - Extracted: a0643d91_784-Modulo25_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:11,602 - INFO - Extracted: a0643d91_784-Modulo25_Enaho02-2022-2400.csv
2025-09-24 12:58:11,799 - INFO - Extracted: a0643d91_784-Modulo26_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:11,816 - INFO - Extracted: a0643d91_784-Modulo26_Enaho02-2022-2500.csv
2025-09-24 12:58:12,003 - INFO - Extracted: a0643d91_784-Modulo27_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:12,078 - INFO - Extracted: a0643d91_784-Modulo27_Enaho02-2022-2600.csv
2025-09-24 12:58:12,111 - INFO - Extracted: a0643d91_784-Modulo28_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:12,126 - INFO - Extracted: a0643d91_784-Modulo28_Enaho02-2022-2700.csv
2025-09-24 12:58:12,241 - INFO - Extracted: a0643d91_784-Modulo34_Sumaria-2022-12g.csv
2025-09-24 12:58:12,269 - INFO - Extracted: a0643d91_784-Modulo34_Sumaria-2022.csv
2025-09-24 12:58:12,343 - INFO - Extracted: a0643d91_784-Modulo37_Enaho01-2022-700.csv
2025-09-24 12:58:12,377 - INFO - Extracted: a0643d91_784-Modulo37_Enaho01-2022-700A.csv
2025-09-24 12:58:12,494 - INFO - Extracted: a0643d91_784-Modulo37_Enaho01-2022-700B.csv
2025-09-24 12:58:12,701 - INFO - Extracted: a0643d91_784-Modulo77_Enaho04-2022-1-Preg-1-a-13.csv
2025-09-24 12:58:12,719 - INFO - Extracted: a0643d91_784-Modulo77_Enaho04-2022-2-Preg-14-a-22.csv
2025-09-24 12:58:12,894 - INFO - Extracted: a0643d91_784-Modulo77_Enaho04-2022-3-Preg-23.csv
2025-09-24 12:58:12,981 - INFO - Extracted: a0643d91_784-Modulo77_Enaho04-2022-4-Preg-24.csv
2025-09-24 12:58:13,023 - INFO - Extracted: a0643d91_784-Modulo77_Enaho04-2022-5-Preg-25.csv
2025-09-24 12:58:13,814 - INFO - Extracted: a0643d91_784-Modulo78_Enaho01-2022-606D.csv
2025-09-24 12:58:13,878 - INFO - Extracted: a0643d91_784-Modulo84_Enaho01-2022-800A.csv
2025-09-24 12:58:13,948 - INFO - Extracted: a0643d91_784-Modulo84_Enaho01-2022-800B.csv
2025-09-24 12:58:14,253 - INFO - Extracted: a0643d91_784-Modulo85_Enaho01B-2022-1.csv
2025-09-24 12:58:14,415 - INFO - Extracted: a0643d91_784-Modulo85_Enaho01B-2022-2.csv
Processing files: 100%|██████████| 57/57 [00:59<00:00,  1.05s/it]
2025-09-24 12:59:14,152 - INFO - Successfully processed 57/57 files
2025-09-24 12:59:14,153 - INFO - Extraction completed successfully.
per_dfs[0].head()
AÑO MES CONGLOME VIVIENDA HOGAR UBIGEO DOMINIO ESTRATO P612N P612 ... P612H TICUEST01 D612G D612H I612G I612H FACTOR07 NCONGLOME SUB_CONGLOME filename
0 2022 01 005030 008 11 010201 7 4 1 2 ... 2 106.890243530273 006618 00 a0643d91_784-Modulo18_Enaho01-2022-612.csv
1 2022 01 005030 008 11 010201 7 4 2 1 ... 1502 2 1526.17028808594 152.617034912109 106.890243530273 006618 00 a0643d91_784-Modulo18_Enaho01-2022-612.csv
2 2022 01 005030 008 11 010201 7 4 3 2 ... 2 106.890243530273 006618 00 a0643d91_784-Modulo18_Enaho01-2022-612.csv
3 2022 01 005030 008 11 010201 7 4 4 1 ... 120 2 121.931053161621 10.1609210968018 106.890243530273 006618 00 a0643d91_784-Modulo18_Enaho01-2022-612.csv
4 2022 01 005030 008 11 010201 7 4 5 2 ... 2 106.890243530273 006618 00 a0643d91_784-Modulo18_Enaho01-2022-612.csv

5 rows × 25 columns

Further steps#

  • Harmonize the extracted data using the Harmonizer class from the socio4health library. You can follow the Harmonization tutorial for more details.