{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "<img src=\"https://raw.githubusercontent.com/harmonize-tools/socio4health/main/docs/source/_static/image.png\" alt=\"image info\" height=\"100\" width=\"100\"/>\n",
   "id": "ce4b64d6881c2a34"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "\n",
    "\n",
    "# Extraction of Colombia, Brazil and Peru online data"
   ],
   "id": "e2b0c70b588bed1d"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "**Run the tutorial via free cloud platforms:** [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/harmonize-tools/socio4health/HEAD?urlpath=%2Fdoc%2Ftree%2Fdocs%2Fsource%2Fnotebooks%2Fextractor.ipynb) <a target=\"_blank\" href=\"https://colab.research.google.com/github/harmonize-tools/socio4health/blob/main/docs/source/notebooks/extractor.ipynb\">\n",
    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
    "</a>\n",
    "\n"
   ],
   "id": "bbb9be9fde0e004a"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "This notebook provides you with an introduction on how to retrieve data from online data sources through **web scraping**, as well as from local files from Colombia, Brazil, Peru, and the Dominican Republic. This tutorial assumes you have an **intermediate** or **advanced** understanding of Python and data manipulation.\n",
    "\n",
    "## Setting up the environment\n",
    "\n",
    "To run this notebook, you need to have the following prerequisites:\n",
    "\n",
    "- Python 3.10+\n",
    "\n",
    "Additionally, you need to install the `socio4health` and `pandas` package, which can be done using ``pip``:\n",
    "\n"
   ],
   "id": "695f4aa2c770640a"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-07-10T16:49:44.305739Z",
     "start_time": "2025-07-10T16:49:30.215946Z"
    }
   },
   "cell_type": "code",
   "source": "!pip install socio4health pandas -q",
   "id": "a29a453e4e438474",
   "outputs": [],
   "execution_count": 2
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Import Libraries\n",
    "\n",
    "To perform the data extraction, the `socio4health` library provides the `Extractor` class for data extraction, and the `Harmonizer` class for data harmonization of the retrieved date. We will also use `pandas` for data manipulation.\n"
   ],
   "id": "a9faa7b1a0405434"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-07-10T16:49:44.360496Z",
     "start_time": "2025-07-10T16:49:44.349010Z"
    }
   },
   "cell_type": "code",
   "source": [
    "import datetime\n",
    "import pandas as pd\n",
    "from socio4health import Extractor\n",
    "from socio4health.enums.data_info_enum import BraColnamesEnum, BraColspecsEnum\n",
    "from socio4health.harmonizer import Harmonizer\n",
    "from socio4health.utils import harmonizer_utils"
   ],
   "id": "d0e08601b93ce10d",
   "outputs": [],
   "execution_count": 3
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Use case 1: Extracting data from Colombia\n",
    "\n",
    "To extract data from Colombia, we will use the `Extractor` class from the `socio4health` library. The `Extractor` class provides methods to retrieve data from various sources, including online databases and local files. In this example, we will extract the Large Integrated Household Survey - GEIH - 2022 (Gran Encuesta Integrada de Hogares - **GEIH** - 2022) dataset  from the Colombian Nacional Administration of Statistics (**DANE**) website\n",
    "\n",
    "The `Extractor` class requires the following parameters:\n",
    "- `input_path`: The `URL` or local path to the data source.\n",
    "- `down_ext`: A list of file extensions to download. This can include `.CSV`, `.csv`, `.zip`, etc.\n",
    "- `sep`: The separator used in the data files (e.g., `;` for semicolon-separated values).\n",
    "- `output_path`: The local path where the extracted data will be saved.\n",
    "- `depth`: The depth of the directory structure to traverse when downloading files. A depth of `0` means only the files in the specified directory will be downloaded.\n"
   ],
   "id": "d117d3d107ee158b"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-07-10T16:53:40.152716Z",
     "start_time": "2025-07-10T16:53:40.107571Z"
    }
   },
   "cell_type": "code",
   "source": "col_online_extractor = Extractor(input_path=\"https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata\", down_ext=['.CSV','.csv','.zip'], sep=';', output_path=\"../data\", depth=0)",
   "id": "d881365674fef602",
   "outputs": [],
   "execution_count": 4
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "After the instance is set up, we can call the `extract` method to download and extract the data. The method returns a list of `pandas` DataFrames containing the extracted data.",
   "id": "b0583ac89ee19937"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-08-11T20:39:34.498226Z",
     "start_time": "2025-08-11T20:39:32.660687Z"
    }
   },
   "cell_type": "code",
   "source": "col_dfs = col_online_extractor.extract()",
   "id": "20265f6f8a3355ad",
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'col_online_extractor' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001B[31m---------------------------------------------------------------------------\u001B[39m",
      "\u001B[31mNameError\u001B[39m                                 Traceback (most recent call last)",
      "\u001B[36mCell\u001B[39m\u001B[36m \u001B[39m\u001B[32mIn[1]\u001B[39m\u001B[32m, line 1\u001B[39m\n\u001B[32m----> \u001B[39m\u001B[32m1\u001B[39m col_dfs = \u001B[43mcol_online_extractor\u001B[49m.extract()\n",
      "\u001B[31mNameError\u001B[39m: name 'col_online_extractor' is not defined"
     ]
    }
   ],
   "execution_count": 1
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Use case 2: Extracting data from Brazil\n",
    "\n",
    "We are downloading the Brazilian data from the Brazilian Institute of Geography and Statistics (**IBGE**) website. The `Extractor` class is used to download the data. In this case, we are extracting the Brazilian National Continuous Household Sample Survey (**PNADC**) for the year 2024\n",
    "\n"
   ],
   "id": "bb348d1d04c4bc92"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "<div style=\"border-left: 4px solid #e74c3c; background: #fdecea; color: #222; padding: 0.5em 1em; margin: 1em 0; display: flex; align-items: center;\">\n",
    "  <span style=\"font-size: 20px; margin-right: 10px;\">⚠️</span>\n",
    "  <div>\n",
    "    <strong>Important:</strong> <code>is_fwf</code> parameter is set to <code>True</code>, which indicates that the data files are in fixed-width format. The <code>colnames</code> and <code>colspecs</code> parameters must be provided. In this example, they are set to the corresponding available enums for <strong> PNADC </strong> data, which define the column names and specifications for the dataset.\n",
    "    See more details in\n",
    "    <a href=\"https://harmonize-tools.github.io/socio4health/socio4health.enums.html#module-socio4health.enums.data_info_enum\" target=\"_blank\">\n",
    "      socio4health.enums.data_info_enum documentation\n",
    "    </a>.\n",
    "  </div>\n",
    "</div>"
   ],
   "id": "dc0c4a7f45353106"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-07-10T17:02:08.409996Z",
     "start_time": "2025-07-10T17:02:08.338028Z"
    }
   },
   "cell_type": "code",
   "source": [
    "bra_online_extractor = Extractor(input_path=\"https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/\", down_ext=['.txt','.zip'], is_fwf=True, colnames=BraColnamesEnum.PNADC.value, colspecs=BraColspecsEnum.PNADC.value, output_path=\"../data\", depth=0)\n",
    "\n",
    "bra_dfs = bra_online_extractor.extract()\n"
   ],
   "id": "c3f72a2dfb88c5cb",
   "outputs": [],
   "execution_count": 5
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "\n",
    "## Use case 3: Extracting data from Peru\n",
    "\n",
    "Peruvian data is extracted from the National Institute of Statistics and Informatics (**INEI**) website. In this case, we are extracting the National Household Survey (**ENAHO**) for the year 2022. The `down_ext` parameter is set to download `.csv` and `.zip` files, and the `sep` parameter is set to `;`, indicating that the data files are semicolon-separated values."
   ],
   "id": "efc27ffdb9ede231"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-07-10T17:07:42.742302Z",
     "start_time": "2025-07-10T17:07:42.627951Z"
    }
   },
   "cell_type": "code",
   "source": [
    "per_online_extractor = Extractor(input_path=\"https://www.inei.gob.pe/media/DATOS_ABIERTOS/ENAHO/DATA/2022.zip\", down_ext=['.csv','.zip'], output_path=\"../data\", depth=0)\n",
    "\n",
    "per_dfs = per_online_extractor.extract()"
   ],
   "id": "865a1ba69ecfeb3d",
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<socio4health.extractor.Extractor at 0x1891198c6e0>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 7
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "\n",
    "## Further steps\n",
    "* Harmonize the extracted data using the `Harmonizer` class from the `socio4health` library. You can follow the [Harmonization tutorial](https://harmonize-tools.github.io/socio4health/notebooks/harmonization.html) for more details."
   ],
   "id": "4f461db517653ec9"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}