image info

Extraction of Colombia, Brazil and Peru online data#

Run the tutorial via free cloud platforms: Binder Open In Colab

This notebook provides you with an introduction on how to retrieve data from online data sources through web scraping, as well as from local files from Colombia, Brazil, Peru, and the Dominican Republic. This tutorial assumes you have an intermediate or advanced understanding of Python and data manipulation.

Setting up the environment#

To run this notebook, you need to have the following prerequisites:

  • Python 3.10+

Additionally, you need to install the socio4health and pandas package, which can be done using pip:

!pip install socio4health pandas -q

Import Libraries#

To perform the data extraction, the socio4health library provides the Extractor class for data extraction, and the Harmonizer class for data harmonization of the retrieved date. We will also use pandas for data manipulation.

import datetime
import pandas as pd
from socio4health import Extractor
from socio4health.enums.data_info_enum import BraColnamesEnum, BraColspecsEnum
from socio4health.harmonizer import Harmonizer
from socio4health.utils import harmonizer_utils

Use case 1: Extracting data from Colombia#

To extract data from Colombia, we will use the Extractor class from the socio4health library. The Extractor class provides methods to retrieve data from various sources, including online databases and local files. In this example, we will extract the Large Integrated Household Survey - GEIH - 2022 (Gran Encuesta Integrada de Hogares - GEIH - 2022) dataset from the Colombian Nacional Administration of Statistics (DANE) website

The Extractor class requires the following parameters:

  • input_path: The URL or local path to the data source.

  • down_ext: A list of file extensions to download. This can include .CSV, .csv, .zip, etc.

  • sep: The separator used in the data files (e.g., ; for semicolon-separated values).

  • output_path: The local path where the extracted data will be saved.

  • depth: The depth of the directory structure to traverse when downloading files. A depth of 0 means only the files in the specified directory will be downloaded.

col_online_extractor = Extractor(input_path="https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata", down_ext=['.CSV','.csv','.zip'], sep=';', output_path="../data", depth=0)

After the instance is set up, we can call the extract method to download and extract the data. The method returns a list of pandas DataFrames containing the extracted data.

col_dfs = col_online_extractor.extract()
2025-08-15 04:49:37,215 - INFO - ----------------------
2025-08-15 04:49:37,216 - INFO - Starting data extraction...
2025-08-15 04:49:37,216 - INFO - Extracting data in online mode...
2025-08-15 04:49:37,217 - INFO - Scraping URL: https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata with depth 0
2025-08-15 04:49:37,234 - WARNING - /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/scrapy/utils/request.py:120: ScrapyDeprecationWarning: 'REQUEST_FINGERPRINTER_IMPLEMENTATION' is a deprecated setting.
It will be removed in a future version of Scrapy.
  return cls(crawler)
2025-08-15 04:49:37,421 - ERROR - Failed during web scraping: This event loop is already running
2025-08-15 04:49:37,422 - ERROR - Exception while extracting data: Web scraping failed: This event loop is already running
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File ~/work/socio4health/socio4health/src/socio4health/extractor.py:232, in Extractor._extract_online_mode(self)
    231 logging.info(f"Scraping URL: {self.input_path} with depth {self.depth}")
--> 232 run_standard_spider(self.input_path, self.depth, self.down_ext, self.key_words)
    234 # Read scraped links

File ~/work/socio4health/socio4health/src/socio4health/utils/extractor_utils.py:47, in run_standard_spider(url, depth, down_ext, key_words)
     46 process.crawl(StandardSpider, url=url, depth=depth, down_ext=down_ext, key_words=key_words)
---> 47 process.start()

File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/scrapy/crawler.py:502, in CrawlerProcess.start(self, stop_after_crawl, install_signal_handlers)
    499     reactor.addSystemEventTrigger(
    500         "after", "startup", install_shutdown_handlers, self._signal_shutdown
    501     )
--> 502 reactor.run(installSignalHandlers=install_signal_handlers)

File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/twisted/internet/asyncioreactor.py:253, in AsyncioSelectorReactor.run(self, installSignalHandlers)
    252 self.startRunning(installSignalHandlers=installSignalHandlers)
--> 253 self._asyncioEventloop.run_forever()
    254 if self._justStopped:

File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/asyncio/base_events.py:634, in BaseEventLoop.run_forever(self)
    633 self._check_closed()
--> 634 self._check_running()
    635 self._set_coroutine_origin_tracking(self._debug)

File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/asyncio/base_events.py:626, in BaseEventLoop._check_running(self)
    625 if self.is_running():
--> 626     raise RuntimeError('This event loop is already running')
    627 if events._get_running_loop() is not None:

RuntimeError: This event loop is already running

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
File ~/work/socio4health/socio4health/src/socio4health/extractor.py:175, in Extractor.extract(self)
    174 if self.input_path and self.input_path.startswith("http"):
--> 175     self._extract_online_mode()
    176 elif self.input_path and os.path.isdir(self.input_path):

File ~/work/socio4health/socio4health/src/socio4health/extractor.py:239, in Extractor._extract_online_mode(self)
    238     logging.error(f"Failed during web scraping: {e}")
--> 239     raise ValueError(f"Web scraping failed: {str(e)}")
    241 # Step 2: Filter and confirm files to download

ValueError: Web scraping failed: This event loop is already running

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[4], line 1
----> 1 col_dfs = col_online_extractor.extract()

File ~/work/socio4health/socio4health/src/socio4health/extractor.py:181, in Extractor.extract(self)
    179 except Exception as e:
    180     logging.error(f"Exception while extracting data: {e}")
--> 181     raise ValueError(f"Extraction failed: {str(e)}")
    183 return self.dataframes

ValueError: Extraction failed: Web scraping failed: This event loop is already running

Use case 2: Extracting data from Brazil#

We are downloading the Brazilian data from the Brazilian Institute of Geography and Statistics (IBGE) website. The Extractor class is used to download the data. In this case, we are extracting the Brazilian National Continuous Household Sample Survey (PNADC) for the year 2024

⚠️
Important: is_fwf parameter is set to True, which indicates that the data files are in fixed-width format. The colnames and colspecs parameters must be provided. In this example, they are set to the corresponding available enums for PNADC data, which define the column names and specifications for the dataset. See more details in socio4health.enums.data_info_enum documentation .
bra_online_extractor = Extractor(input_path="https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/", down_ext=['.txt','.zip'], is_fwf=True, colnames=BraColnamesEnum.PNADC.value, colspecs=BraColspecsEnum.PNADC.value, output_path="../data", depth=0)

bra_dfs = bra_online_extractor.extract()

Use case 3: Extracting data from Peru#

Peruvian data is extracted from the National Institute of Statistics and Informatics (INEI) website. In this case, we are extracting the National Household Survey (ENAHO) for the year 2022. The down_ext parameter is set to download .csv and .zip files, and the sep parameter is set to ;, indicating that the data files are semicolon-separated values.

per_online_extractor = Extractor(input_path="https://www.inei.gob.pe/media/DATOS_ABIERTOS/ENAHO/DATA/2022.zip", down_ext=['.csv','.zip'], output_path="../data", depth=0)

per_dfs = per_online_extractor.extract()
<socio4health.extractor.Extractor at 0x1891198c6e0>

Further steps#

  • Harmonize the extracted data using the Harmonizer class from the socio4health library. You can follow the Harmonization tutorial for more details.