{ "cells": [ { "metadata": {}, "cell_type": "markdown", "source": "\"image\n", "id": "48dfedfd3488ed92" }, { "metadata": {}, "cell_type": "markdown", "source": "# Harmonization of data", "id": "422e92e63d201714" }, { "metadata": {}, "cell_type": "markdown", "source": [ "**Run the tutorial via free cloud platforms:** [![badge](https://img.shields.io/badge/launch-binder-E66581.svg?logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAFkAAABZCAMAAABi1XidAAAB8lBMVEX///9XmsrmZYH1olJXmsr1olJXmsrmZYH1olJXmsr1olJXmsrmZYH1olL1olJXmsr1olJXmsrmZYH1olL1olJXmsrmZYH1olJXmsr1olL1olJXmsrmZYH1olL1olJXmsrmZYH1olL1olL0nFf1olJXmsrmZYH1olJXmsq8dZb1olJXmsrmZYH1olJXmspXmspXmsr1olL1olJXmsrmZYH1olJXmsr1olL1olJXmsrmZYH1olL1olLeaIVXmsrmZYH1olL1olL1olJXmsrmZYH1olLna31Xmsr1olJXmsr1olJXmsrmZYH1olLqoVr1olJXmsr1olJXmsrmZYH1olL1olKkfaPobXvviGabgadXmsqThKuofKHmZ4Dobnr1olJXmsr1olJXmspXmsr1olJXmsrfZ4TuhWn1olL1olJXmsqBi7X1olJXmspZmslbmMhbmsdemsVfl8ZgmsNim8Jpk8F0m7R4m7F5nLB6jbh7jbiDirOEibOGnKaMhq+PnaCVg6qWg6qegKaff6WhnpKofKGtnomxeZy3noG6dZi+n3vCcpPDcpPGn3bLb4/Mb47UbIrVa4rYoGjdaIbeaIXhoWHmZYHobXvpcHjqdHXreHLroVrsfG/uhGnuh2bwj2Hxk17yl1vzmljzm1j0nlX1olL3AJXWAAAAbXRSTlMAEBAQHx8gICAuLjAwMDw9PUBAQEpQUFBXV1hgYGBkcHBwcXl8gICAgoiIkJCQlJicnJ2goKCmqK+wsLC4usDAwMjP0NDQ1NbW3Nzg4ODi5+3v8PDw8/T09PX29vb39/f5+fr7+/z8/Pz9/v7+zczCxgAABC5JREFUeAHN1ul3k0UUBvCb1CTVpmpaitAGSLSpSuKCLWpbTKNJFGlcSMAFF63iUmRccNG6gLbuxkXU66JAUef/9LSpmXnyLr3T5AO/rzl5zj137p136BISy44fKJXuGN/d19PUfYeO67Znqtf2KH33Id1psXoFdW30sPZ1sMvs2D060AHqws4FHeJojLZqnw53cmfvg+XR8mC0OEjuxrXEkX5ydeVJLVIlV0e10PXk5k7dYeHu7Cj1j+49uKg7uLU61tGLw1lq27ugQYlclHC4bgv7VQ+TAyj5Zc/UjsPvs1sd5cWryWObtvWT2EPa4rtnWW3JkpjggEpbOsPr7F7EyNewtpBIslA7p43HCsnwooXTEc3UmPmCNn5lrqTJxy6nRmcavGZVt/3Da2pD5NHvsOHJCrdc1G2r3DITpU7yic7w/7Rxnjc0kt5GC4djiv2Sz3Fb2iEZg41/ddsFDoyuYrIkmFehz0HR2thPgQqMyQYb2OtB0WxsZ3BeG3+wpRb1vzl2UYBog8FfGhttFKjtAclnZYrRo9ryG9uG/FZQU4AEg8ZE9LjGMzTmqKXPLnlWVnIlQQTvxJf8ip7VgjZjyVPrjw1te5otM7RmP7xm+sK2Gv9I8Gi++BRbEkR9EBw8zRUcKxwp73xkaLiqQb+kGduJTNHG72zcW9LoJgqQxpP3/Tj//c3yB0tqzaml05/+orHLksVO+95kX7/7qgJvnjlrfr2Ggsyx0eoy9uPzN5SPd86aXggOsEKW2Prz7du3VID3/tzs/sSRs2w7ovVHKtjrX2pd7ZMlTxAYfBAL9jiDwfLkq55Tm7ifhMlTGPyCAs7RFRhn47JnlcB9RM5T97ASuZXIcVNuUDIndpDbdsfrqsOppeXl5Y+XVKdjFCTh+zGaVuj0d9zy05PPK3QzBamxdwtTCrzyg/2Rvf2EstUjordGwa/kx9mSJLr8mLLtCW8HHGJc2R5hS219IiF6PnTusOqcMl57gm0Z8kanKMAQg0qSyuZfn7zItsbGyO9QlnxY0eCuD1XL2ys/MsrQhltE7Ug0uFOzufJFE2PxBo/YAx8XPPdDwWN0MrDRYIZF0mSMKCNHgaIVFoBbNoLJ7tEQDKxGF0kcLQimojCZopv0OkNOyWCCg9XMVAi7ARJzQdM2QUh0gmBozjc3Skg6dSBRqDGYSUOu66Zg+I2fNZs/M3/f/Grl/XnyF1Gw3VKCez0PN5IUfFLqvgUN4C0qNqYs5YhPL+aVZYDE4IpUk57oSFnJm4FyCqqOE0jhY2SMyLFoo56zyo6becOS5UVDdj7Vih0zp+tcMhwRpBeLyqtIjlJKAIZSbI8SGSF3k0pA3mR5tHuwPFoa7N7reoq2bqCsAk1HqCu5uvI1n6JuRXI+S1Mco54YmYTwcn6Aeic+kssXi8XpXC4V3t7/ADuTNKaQJdScAAAAAElFTkSuQmCC)](https://github.com/harmonize-tools/socio4health/blob/main/docs/source/notebooks/extractor.ipynb) \n", " \"Open\n", "\n", "\n" ], "id": "1599a585fa8204d6" }, { "metadata": {}, "cell_type": "markdown", "source": [ "This notebook provides you with a tutorial on how to process the sociodemographic and economic data from online data sources from **Brazil**. This tutorial assumes you have an **intermediate** or **advanced** understanding of **Python** and data manipulation.\n", "\n", "## Setting up the enviornment\n", "\n", "To run this notebook, you need to have the following prerequisites:\n", "\n", "- Python 3.10+\n", "\n", "Additionally, you need to install the `socio4health` and `pandas` package, which can be done using ``pip``:\n", "\n" ], "id": "a04649ff2f2f8680" }, { "metadata": { "ExecuteTime": { "end_time": "2025-08-11T16:17:52.108769Z", "start_time": "2025-08-11T16:17:46.522060Z" } }, "cell_type": "code", "source": "!pip install socio4health pandas ipywidgets -q", "id": "59bb2e9841851c30", "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "[notice] A new release of pip is available: 25.1.1 -> 25.2\n", "[notice] To update, run: python.exe -m pip install --upgrade pip\n" ] } ], "execution_count": 2 }, { "metadata": {}, "cell_type": "markdown", "source": "## Import Libraries", "id": "af633dbea31aaaab" }, { "metadata": { "ExecuteTime": { "end_time": "2025-08-11T16:18:23.504993Z", "start_time": "2025-08-11T16:18:23.496551Z" } }, "cell_type": "code", "source": [ "import pandas as pd\n", "from socio4health import Extractor\n", "from socio4health.enums.data_info_enum import BraColnamesEnum, BraColspecsEnum\n", "from socio4health.harmonizer import Harmonizer\n", "from socio4health.utils import harmonizer_utils\n", "import tqdm as tqdm\n" ], "id": "e448c769134fe36d", "outputs": [], "execution_count": 4 }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Extracting data from Brazil\n", "\n", "In this example, we will extract the Brazilian National Continuous Household Sample Survey (**PNADC**) for the year 2024 from the Brazilian Institute of Geography and Statistics (**IBGE**) website." ], "id": "8b286730445109e9" }, { "metadata": { "ExecuteTime": { "end_time": "2025-08-11T16:18:26.327230Z", "start_time": "2025-08-11T16:18:26.316349Z" } }, "cell_type": "code", "source": "bra_online_extractor = Extractor(input_path=\"https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/\", down_ext=['.txt','.zip'], is_fwf=True, colnames=BraColnamesEnum.PNADC.value, colspecs=BraColspecsEnum.PNADC.value, output_path=\"../data\", depth=0)", "id": "338d2512725fe9f0", "outputs": [], "execution_count": 5 }, { "metadata": {}, "cell_type": "markdown", "source": "## Providing the raw dictionary", "id": "79484d3e09b568d6" }, { "metadata": {}, "cell_type": "markdown", "source": "We need to provide a **raw dictionary** to the **harmonizer** that contains the column names and their corresponding data types. This is necessary for the harmonization process, as it allows the harmonizer to understand the structure of the data. To know more about how to construct the raw dictionary, you can check the [documentation](https://harmonize-tools.github.io/socio4health/dictionary.html).", "id": "ac3ad903ed8cb378" }, { "metadata": { "ExecuteTime": { "end_time": "2025-08-11T16:18:30.084645Z", "start_time": "2025-08-11T16:18:29.613160Z" } }, "cell_type": "code", "source": "raw_dict = pd.read_excel('raw_dictionary.xlsx')", "id": "6773c8ef688101e2", "outputs": [], "execution_count": 6 }, { "metadata": {}, "cell_type": "markdown", "source": "The raw dictionary is then standardized using the `standardize_dict` method, which ensures that the dictionary is in a consistent format, making it easier to work with during the harmonization process.", "id": "492302dd9245be14" }, { "metadata": { "ExecuteTime": { "end_time": "2025-08-11T16:18:33.116476Z", "start_time": "2025-08-11T16:18:32.773710Z" } }, "cell_type": "code", "source": "dic = harmonizer_utils.standardize_dict(raw_dict)", "id": "65ab398829a1ad1f", "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\isabe\\PycharmProjects\\socio4health\\src\\socio4health\\utils\\harmonizer_utils.py:78: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.\n", " .apply(_process_group, include_groups=True)\\\n" ] } ], "execution_count": 7 }, { "metadata": {}, "cell_type": "markdown", "source": "Additionally, the content of columns of the dictionary can be translated into English using `translate_column` function from `harmonizer_utils` module. Translation is performed for facilitate the understanding and processing of the data.", "id": "a2f7ece82aa5d5d" }, { "metadata": {}, "cell_type": "markdown", "source": [ "
\n", " ⚠️\n", "
\n", " Warning: translate_column method may take some time depending on the size of the dictionary and the number of columns to be translated. It is recommended to use this method only if you need the content of the columns in English for further processing or analysis.\n", "
\n", "
" ], "id": "3eb0549ddf3d1ea4" }, { "metadata": { "ExecuteTime": { "end_time": "2025-08-11T16:19:24.280658Z", "start_time": "2025-08-11T16:18:44.709846Z" } }, "cell_type": "code", "source": [ "dic = harmonizer_utils.translate_column(dic, \"question\", language=\"en\")\n", "dic = harmonizer_utils.translate_column(dic, \"description\", language=\"en\")\n", "dic = harmonizer_utils.translate_column(dic, \"possible_answers\", language=\"en\")\n" ], "id": "aea219ca608eab2a", "outputs": [ { "ename": "KeyboardInterrupt", "evalue": "", "output_type": "error", "traceback": [ "\u001B[31m---------------------------------------------------------------------------\u001B[39m", "\u001B[31mKeyboardInterrupt\u001B[39m Traceback (most recent call last)", "\u001B[36mCell\u001B[39m\u001B[36m \u001B[39m\u001B[32mIn[8]\u001B[39m\u001B[32m, line 1\u001B[39m\n\u001B[32m----> \u001B[39m\u001B[32m1\u001B[39m dic = \u001B[43mharmonizer_utils\u001B[49m\u001B[43m.\u001B[49m\u001B[43mtranslate_column\u001B[49m\u001B[43m(\u001B[49m\u001B[43mdic\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[33;43m\"\u001B[39;49m\u001B[33;43mquestion\u001B[39;49m\u001B[33;43m\"\u001B[39;49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mlanguage\u001B[49m\u001B[43m=\u001B[49m\u001B[33;43m\"\u001B[39;49m\u001B[33;43men\u001B[39;49m\u001B[33;43m\"\u001B[39;49m\u001B[43m)\u001B[49m\n\u001B[32m 2\u001B[39m dic = harmonizer_utils.translate_column(dic, \u001B[33m\"\u001B[39m\u001B[33mdescription\u001B[39m\u001B[33m\"\u001B[39m, language=\u001B[33m\"\u001B[39m\u001B[33men\u001B[39m\u001B[33m\"\u001B[39m)\n\u001B[32m 3\u001B[39m dic = harmonizer_utils.translate_column(dic, \u001B[33m\"\u001B[39m\u001B[33mpossible_answers\u001B[39m\u001B[33m\"\u001B[39m, language=\u001B[33m\"\u001B[39m\u001B[33men\u001B[39m\u001B[33m\"\u001B[39m)\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\src\\socio4health\\utils\\harmonizer_utils.py:175\u001B[39m, in \u001B[36mtranslate_column\u001B[39m\u001B[34m(data, column, language)\u001B[39m\n\u001B[32m 172\u001B[39m data = data.copy()\n\u001B[32m 174\u001B[39m new_col = \u001B[33mf\u001B[39m\u001B[33m\"\u001B[39m\u001B[38;5;132;01m{\u001B[39;00mcolumn\u001B[38;5;132;01m}\u001B[39;00m\u001B[33m_\u001B[39m\u001B[38;5;132;01m{\u001B[39;00mlanguage\u001B[38;5;132;01m}\u001B[39;00m\u001B[33m\"\u001B[39m\n\u001B[32m--> \u001B[39m\u001B[32m175\u001B[39m data[new_col] = \u001B[43mdata\u001B[49m\u001B[43m[\u001B[49m\u001B[43mcolumn\u001B[49m\u001B[43m]\u001B[49m\u001B[43m.\u001B[49m\u001B[43mapply\u001B[49m\u001B[43m(\u001B[49m\u001B[43mtranslate_text\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 176\u001B[39m \u001B[38;5;28mprint\u001B[39m(\u001B[33mf\u001B[39m\u001B[33m\"\u001B[39m\u001B[38;5;132;01m{\u001B[39;00mcolumn\u001B[38;5;132;01m}\u001B[39;00m\u001B[33m translated\u001B[39m\u001B[33m\"\u001B[39m)\n\u001B[32m 178\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m data\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\pandas\\core\\series.py:4935\u001B[39m, in \u001B[36mSeries.apply\u001B[39m\u001B[34m(self, func, convert_dtype, args, by_row, **kwargs)\u001B[39m\n\u001B[32m 4800\u001B[39m \u001B[38;5;28;01mdef\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[34mapply\u001B[39m(\n\u001B[32m 4801\u001B[39m \u001B[38;5;28mself\u001B[39m,\n\u001B[32m 4802\u001B[39m func: AggFuncType,\n\u001B[32m (...)\u001B[39m\u001B[32m 4807\u001B[39m **kwargs,\n\u001B[32m 4808\u001B[39m ) -> DataFrame | Series:\n\u001B[32m 4809\u001B[39m \u001B[38;5;250m \u001B[39m\u001B[33;03m\"\"\"\u001B[39;00m\n\u001B[32m 4810\u001B[39m \u001B[33;03m Invoke function on values of Series.\u001B[39;00m\n\u001B[32m 4811\u001B[39m \n\u001B[32m (...)\u001B[39m\u001B[32m 4926\u001B[39m \u001B[33;03m dtype: float64\u001B[39;00m\n\u001B[32m 4927\u001B[39m \u001B[33;03m \"\"\"\u001B[39;00m\n\u001B[32m 4928\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mSeriesApply\u001B[49m\u001B[43m(\u001B[49m\n\u001B[32m 4929\u001B[39m \u001B[43m \u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m,\u001B[49m\n\u001B[32m 4930\u001B[39m \u001B[43m \u001B[49m\u001B[43mfunc\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 4931\u001B[39m \u001B[43m \u001B[49m\u001B[43mconvert_dtype\u001B[49m\u001B[43m=\u001B[49m\u001B[43mconvert_dtype\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 4932\u001B[39m \u001B[43m \u001B[49m\u001B[43mby_row\u001B[49m\u001B[43m=\u001B[49m\u001B[43mby_row\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 4933\u001B[39m \u001B[43m \u001B[49m\u001B[43margs\u001B[49m\u001B[43m=\u001B[49m\u001B[43margs\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 4934\u001B[39m \u001B[43m \u001B[49m\u001B[43mkwargs\u001B[49m\u001B[43m=\u001B[49m\u001B[43mkwargs\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m-> \u001B[39m\u001B[32m4935\u001B[39m \u001B[43m \u001B[49m\u001B[43m)\u001B[49m\u001B[43m.\u001B[49m\u001B[43mapply\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\pandas\\core\\apply.py:1422\u001B[39m, in \u001B[36mSeriesApply.apply\u001B[39m\u001B[34m(self)\u001B[39m\n\u001B[32m 1419\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28mself\u001B[39m.apply_compat()\n\u001B[32m 1421\u001B[39m \u001B[38;5;66;03m# self.func is Callable\u001B[39;00m\n\u001B[32m-> \u001B[39m\u001B[32m1422\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43mapply_standard\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\pandas\\core\\apply.py:1502\u001B[39m, in \u001B[36mSeriesApply.apply_standard\u001B[39m\u001B[34m(self)\u001B[39m\n\u001B[32m 1496\u001B[39m \u001B[38;5;66;03m# row-wise access\u001B[39;00m\n\u001B[32m 1497\u001B[39m \u001B[38;5;66;03m# apply doesn't have a `na_action` keyword and for backward compat reasons\u001B[39;00m\n\u001B[32m 1498\u001B[39m \u001B[38;5;66;03m# we need to give `na_action=\"ignore\"` for categorical data.\u001B[39;00m\n\u001B[32m 1499\u001B[39m \u001B[38;5;66;03m# TODO: remove the `na_action=\"ignore\"` when that default has been changed in\u001B[39;00m\n\u001B[32m 1500\u001B[39m \u001B[38;5;66;03m# Categorical (GH51645).\u001B[39;00m\n\u001B[32m 1501\u001B[39m action = \u001B[33m\"\u001B[39m\u001B[33mignore\u001B[39m\u001B[33m\"\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(obj.dtype, CategoricalDtype) \u001B[38;5;28;01melse\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m\n\u001B[32m-> \u001B[39m\u001B[32m1502\u001B[39m mapped = \u001B[43mobj\u001B[49m\u001B[43m.\u001B[49m\u001B[43m_map_values\u001B[49m\u001B[43m(\u001B[49m\n\u001B[32m 1503\u001B[39m \u001B[43m \u001B[49m\u001B[43mmapper\u001B[49m\u001B[43m=\u001B[49m\u001B[43mcurried\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mna_action\u001B[49m\u001B[43m=\u001B[49m\u001B[43maction\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mconvert\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43mconvert_dtype\u001B[49m\n\u001B[32m 1504\u001B[39m \u001B[43m\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 1506\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mlen\u001B[39m(mapped) \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(mapped[\u001B[32m0\u001B[39m], ABCSeries):\n\u001B[32m 1507\u001B[39m \u001B[38;5;66;03m# GH#43986 Need to do list(mapped) in order to get treated as nested\u001B[39;00m\n\u001B[32m 1508\u001B[39m \u001B[38;5;66;03m# See also GH#25959 regarding EA support\u001B[39;00m\n\u001B[32m 1509\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m obj._constructor_expanddim(\u001B[38;5;28mlist\u001B[39m(mapped), index=obj.index)\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\pandas\\core\\base.py:925\u001B[39m, in \u001B[36mIndexOpsMixin._map_values\u001B[39m\u001B[34m(self, mapper, na_action, convert)\u001B[39m\n\u001B[32m 922\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(arr, ExtensionArray):\n\u001B[32m 923\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m arr.map(mapper, na_action=na_action)\n\u001B[32m--> \u001B[39m\u001B[32m925\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43malgorithms\u001B[49m\u001B[43m.\u001B[49m\u001B[43mmap_array\u001B[49m\u001B[43m(\u001B[49m\u001B[43marr\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mmapper\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mna_action\u001B[49m\u001B[43m=\u001B[49m\u001B[43mna_action\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mconvert\u001B[49m\u001B[43m=\u001B[49m\u001B[43mconvert\u001B[49m\u001B[43m)\u001B[49m\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\pandas\\core\\algorithms.py:1743\u001B[39m, in \u001B[36mmap_array\u001B[39m\u001B[34m(arr, mapper, na_action, convert)\u001B[39m\n\u001B[32m 1741\u001B[39m values = arr.astype(\u001B[38;5;28mobject\u001B[39m, copy=\u001B[38;5;28;01mFalse\u001B[39;00m)\n\u001B[32m 1742\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m na_action \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[32m-> \u001B[39m\u001B[32m1743\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mlib\u001B[49m\u001B[43m.\u001B[49m\u001B[43mmap_infer\u001B[49m\u001B[43m(\u001B[49m\u001B[43mvalues\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mmapper\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mconvert\u001B[49m\u001B[43m=\u001B[49m\u001B[43mconvert\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 1744\u001B[39m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[32m 1745\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m lib.map_infer_mask(\n\u001B[32m 1746\u001B[39m values, mapper, mask=isna(values).view(np.uint8), convert=convert\n\u001B[32m 1747\u001B[39m )\n", "\u001B[36mFile \u001B[39m\u001B[32mpandas/_libs/lib.pyx:2999\u001B[39m, in \u001B[36mpandas._libs.lib.map_infer\u001B[39m\u001B[34m()\u001B[39m\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\src\\socio4health\\utils\\harmonizer_utils.py:167\u001B[39m, in \u001B[36mtranslate_column..translate_text\u001B[39m\u001B[34m(text)\u001B[39m\n\u001B[32m 165\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m text\n\u001B[32m 166\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mlen\u001B[39m(text) < \u001B[32m5000\u001B[39m:\n\u001B[32m--> \u001B[39m\u001B[32m167\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mGoogleTranslator\u001B[49m\u001B[43m(\u001B[49m\u001B[43msource\u001B[49m\u001B[43m=\u001B[49m\u001B[33;43m'\u001B[39;49m\u001B[33;43mauto\u001B[39;49m\u001B[33;43m'\u001B[39;49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mtarget\u001B[49m\u001B[43m=\u001B[49m\u001B[43mlanguage\u001B[49m\u001B[43m)\u001B[49m\u001B[43m.\u001B[49m\u001B[43mtranslate\u001B[49m\u001B[43m(\u001B[49m\u001B[43mtext\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 168\u001B[39m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[32m 169\u001B[39m \u001B[38;5;28mprint\u001B[39m(\u001B[33m\"\u001B[39m\u001B[33mRows with contents longer than 5000 characters are cut off\u001B[39m\u001B[33m\"\u001B[39m)\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\deep_translator\\google.py:67\u001B[39m, in \u001B[36mGoogleTranslator.translate\u001B[39m\u001B[34m(self, text, **kwargs)\u001B[39m\n\u001B[32m 64\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m.payload_key:\n\u001B[32m 65\u001B[39m \u001B[38;5;28mself\u001B[39m._url_params[\u001B[38;5;28mself\u001B[39m.payload_key] = text\n\u001B[32m---> \u001B[39m\u001B[32m67\u001B[39m response = \u001B[43mrequests\u001B[49m\u001B[43m.\u001B[49m\u001B[43mget\u001B[49m\u001B[43m(\u001B[49m\n\u001B[32m 68\u001B[39m \u001B[43m \u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43m_base_url\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mparams\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43m_url_params\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mproxies\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43mproxies\u001B[49m\n\u001B[32m 69\u001B[39m \u001B[43m\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 70\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m response.status_code == \u001B[32m429\u001B[39m:\n\u001B[32m 71\u001B[39m \u001B[38;5;28;01mraise\u001B[39;00m TooManyRequests()\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\requests\\api.py:73\u001B[39m, in \u001B[36mget\u001B[39m\u001B[34m(url, params, **kwargs)\u001B[39m\n\u001B[32m 62\u001B[39m \u001B[38;5;28;01mdef\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[34mget\u001B[39m(url, params=\u001B[38;5;28;01mNone\u001B[39;00m, **kwargs):\n\u001B[32m 63\u001B[39m \u001B[38;5;250m \u001B[39m\u001B[33mr\u001B[39m\u001B[33;03m\"\"\"Sends a GET request.\u001B[39;00m\n\u001B[32m 64\u001B[39m \n\u001B[32m 65\u001B[39m \u001B[33;03m :param url: URL for the new :class:`Request` object.\u001B[39;00m\n\u001B[32m (...)\u001B[39m\u001B[32m 70\u001B[39m \u001B[33;03m :rtype: requests.Response\u001B[39;00m\n\u001B[32m 71\u001B[39m \u001B[33;03m \"\"\"\u001B[39;00m\n\u001B[32m---> \u001B[39m\u001B[32m73\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mrequest\u001B[49m\u001B[43m(\u001B[49m\u001B[33;43m\"\u001B[39;49m\u001B[33;43mget\u001B[39;49m\u001B[33;43m\"\u001B[39;49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43murl\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mparams\u001B[49m\u001B[43m=\u001B[49m\u001B[43mparams\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43m*\u001B[49m\u001B[43m*\u001B[49m\u001B[43mkwargs\u001B[49m\u001B[43m)\u001B[49m\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\requests\\api.py:59\u001B[39m, in \u001B[36mrequest\u001B[39m\u001B[34m(method, url, **kwargs)\u001B[39m\n\u001B[32m 55\u001B[39m \u001B[38;5;66;03m# By using the 'with' statement we are sure the session is closed, thus we\u001B[39;00m\n\u001B[32m 56\u001B[39m \u001B[38;5;66;03m# avoid leaving sockets open which can trigger a ResourceWarning in some\u001B[39;00m\n\u001B[32m 57\u001B[39m \u001B[38;5;66;03m# cases, and look like a memory leak in others.\u001B[39;00m\n\u001B[32m 58\u001B[39m \u001B[38;5;28;01mwith\u001B[39;00m sessions.Session() \u001B[38;5;28;01mas\u001B[39;00m session:\n\u001B[32m---> \u001B[39m\u001B[32m59\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43msession\u001B[49m\u001B[43m.\u001B[49m\u001B[43mrequest\u001B[49m\u001B[43m(\u001B[49m\u001B[43mmethod\u001B[49m\u001B[43m=\u001B[49m\u001B[43mmethod\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43murl\u001B[49m\u001B[43m=\u001B[49m\u001B[43murl\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43m*\u001B[49m\u001B[43m*\u001B[49m\u001B[43mkwargs\u001B[49m\u001B[43m)\u001B[49m\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\requests\\sessions.py:589\u001B[39m, in \u001B[36mSession.request\u001B[39m\u001B[34m(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)\u001B[39m\n\u001B[32m 584\u001B[39m send_kwargs = {\n\u001B[32m 585\u001B[39m \u001B[33m\"\u001B[39m\u001B[33mtimeout\u001B[39m\u001B[33m\"\u001B[39m: timeout,\n\u001B[32m 586\u001B[39m \u001B[33m\"\u001B[39m\u001B[33mallow_redirects\u001B[39m\u001B[33m\"\u001B[39m: allow_redirects,\n\u001B[32m 587\u001B[39m }\n\u001B[32m 588\u001B[39m send_kwargs.update(settings)\n\u001B[32m--> \u001B[39m\u001B[32m589\u001B[39m resp = \u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43msend\u001B[49m\u001B[43m(\u001B[49m\u001B[43mprep\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43m*\u001B[49m\u001B[43m*\u001B[49m\u001B[43msend_kwargs\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 591\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m resp\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\requests\\sessions.py:703\u001B[39m, in \u001B[36mSession.send\u001B[39m\u001B[34m(self, request, **kwargs)\u001B[39m\n\u001B[32m 700\u001B[39m start = preferred_clock()\n\u001B[32m 702\u001B[39m \u001B[38;5;66;03m# Send the request\u001B[39;00m\n\u001B[32m--> \u001B[39m\u001B[32m703\u001B[39m r = \u001B[43madapter\u001B[49m\u001B[43m.\u001B[49m\u001B[43msend\u001B[49m\u001B[43m(\u001B[49m\u001B[43mrequest\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43m*\u001B[49m\u001B[43m*\u001B[49m\u001B[43mkwargs\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 705\u001B[39m \u001B[38;5;66;03m# Total elapsed time of the request (approximately)\u001B[39;00m\n\u001B[32m 706\u001B[39m elapsed = preferred_clock() - start\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\requests\\adapters.py:486\u001B[39m, in \u001B[36mHTTPAdapter.send\u001B[39m\u001B[34m(self, request, stream, timeout, verify, cert, proxies)\u001B[39m\n\u001B[32m 483\u001B[39m timeout = TimeoutSauce(connect=timeout, read=timeout)\n\u001B[32m 485\u001B[39m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[32m--> \u001B[39m\u001B[32m486\u001B[39m resp = \u001B[43mconn\u001B[49m\u001B[43m.\u001B[49m\u001B[43murlopen\u001B[49m\u001B[43m(\u001B[49m\n\u001B[32m 487\u001B[39m \u001B[43m \u001B[49m\u001B[43mmethod\u001B[49m\u001B[43m=\u001B[49m\u001B[43mrequest\u001B[49m\u001B[43m.\u001B[49m\u001B[43mmethod\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 488\u001B[39m \u001B[43m \u001B[49m\u001B[43murl\u001B[49m\u001B[43m=\u001B[49m\u001B[43murl\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 489\u001B[39m \u001B[43m \u001B[49m\u001B[43mbody\u001B[49m\u001B[43m=\u001B[49m\u001B[43mrequest\u001B[49m\u001B[43m.\u001B[49m\u001B[43mbody\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 490\u001B[39m \u001B[43m \u001B[49m\u001B[43mheaders\u001B[49m\u001B[43m=\u001B[49m\u001B[43mrequest\u001B[49m\u001B[43m.\u001B[49m\u001B[43mheaders\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 491\u001B[39m \u001B[43m \u001B[49m\u001B[43mredirect\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43;01mFalse\u001B[39;49;00m\u001B[43m,\u001B[49m\n\u001B[32m 492\u001B[39m \u001B[43m \u001B[49m\u001B[43massert_same_host\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43;01mFalse\u001B[39;49;00m\u001B[43m,\u001B[49m\n\u001B[32m 493\u001B[39m \u001B[43m \u001B[49m\u001B[43mpreload_content\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43;01mFalse\u001B[39;49;00m\u001B[43m,\u001B[49m\n\u001B[32m 494\u001B[39m \u001B[43m \u001B[49m\u001B[43mdecode_content\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43;01mFalse\u001B[39;49;00m\u001B[43m,\u001B[49m\n\u001B[32m 495\u001B[39m \u001B[43m \u001B[49m\u001B[43mretries\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43mmax_retries\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 496\u001B[39m \u001B[43m \u001B[49m\u001B[43mtimeout\u001B[49m\u001B[43m=\u001B[49m\u001B[43mtimeout\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 497\u001B[39m \u001B[43m \u001B[49m\u001B[43mchunked\u001B[49m\u001B[43m=\u001B[49m\u001B[43mchunked\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 498\u001B[39m \u001B[43m \u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 500\u001B[39m \u001B[38;5;28;01mexcept\u001B[39;00m (ProtocolError, \u001B[38;5;167;01mOSError\u001B[39;00m) \u001B[38;5;28;01mas\u001B[39;00m err:\n\u001B[32m 501\u001B[39m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mConnectionError\u001B[39;00m(err, request=request)\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\urllib3\\connectionpool.py:787\u001B[39m, in \u001B[36mHTTPConnectionPool.urlopen\u001B[39m\u001B[34m(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)\u001B[39m\n\u001B[32m 784\u001B[39m response_conn = conn \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m release_conn \u001B[38;5;28;01melse\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m\n\u001B[32m 786\u001B[39m \u001B[38;5;66;03m# Make the request on the HTTPConnection object\u001B[39;00m\n\u001B[32m--> \u001B[39m\u001B[32m787\u001B[39m response = \u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43m_make_request\u001B[49m\u001B[43m(\u001B[49m\n\u001B[32m 788\u001B[39m \u001B[43m \u001B[49m\u001B[43mconn\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 789\u001B[39m \u001B[43m \u001B[49m\u001B[43mmethod\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 790\u001B[39m \u001B[43m \u001B[49m\u001B[43murl\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 791\u001B[39m \u001B[43m \u001B[49m\u001B[43mtimeout\u001B[49m\u001B[43m=\u001B[49m\u001B[43mtimeout_obj\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 792\u001B[39m \u001B[43m \u001B[49m\u001B[43mbody\u001B[49m\u001B[43m=\u001B[49m\u001B[43mbody\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 793\u001B[39m \u001B[43m \u001B[49m\u001B[43mheaders\u001B[49m\u001B[43m=\u001B[49m\u001B[43mheaders\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 794\u001B[39m \u001B[43m \u001B[49m\u001B[43mchunked\u001B[49m\u001B[43m=\u001B[49m\u001B[43mchunked\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 795\u001B[39m \u001B[43m \u001B[49m\u001B[43mretries\u001B[49m\u001B[43m=\u001B[49m\u001B[43mretries\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 796\u001B[39m \u001B[43m \u001B[49m\u001B[43mresponse_conn\u001B[49m\u001B[43m=\u001B[49m\u001B[43mresponse_conn\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 797\u001B[39m \u001B[43m \u001B[49m\u001B[43mpreload_content\u001B[49m\u001B[43m=\u001B[49m\u001B[43mpreload_content\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 798\u001B[39m \u001B[43m \u001B[49m\u001B[43mdecode_content\u001B[49m\u001B[43m=\u001B[49m\u001B[43mdecode_content\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 799\u001B[39m \u001B[43m \u001B[49m\u001B[43m*\u001B[49m\u001B[43m*\u001B[49m\u001B[43mresponse_kw\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 800\u001B[39m \u001B[43m\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 802\u001B[39m \u001B[38;5;66;03m# Everything went great!\u001B[39;00m\n\u001B[32m 803\u001B[39m clean_exit = \u001B[38;5;28;01mTrue\u001B[39;00m\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\urllib3\\connectionpool.py:464\u001B[39m, in \u001B[36mHTTPConnectionPool._make_request\u001B[39m\u001B[34m(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)\u001B[39m\n\u001B[32m 461\u001B[39m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[32m 462\u001B[39m \u001B[38;5;66;03m# Trigger any extra validation we need to do.\u001B[39;00m\n\u001B[32m 463\u001B[39m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[32m--> \u001B[39m\u001B[32m464\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43m_validate_conn\u001B[49m\u001B[43m(\u001B[49m\u001B[43mconn\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 465\u001B[39m \u001B[38;5;28;01mexcept\u001B[39;00m (SocketTimeout, BaseSSLError) \u001B[38;5;28;01mas\u001B[39;00m e:\n\u001B[32m 466\u001B[39m \u001B[38;5;28mself\u001B[39m._raise_timeout(err=e, url=url, timeout_value=conn.timeout)\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\urllib3\\connectionpool.py:1093\u001B[39m, in \u001B[36mHTTPSConnectionPool._validate_conn\u001B[39m\u001B[34m(self, conn)\u001B[39m\n\u001B[32m 1091\u001B[39m \u001B[38;5;66;03m# Force connect early to allow us to validate the connection.\u001B[39;00m\n\u001B[32m 1092\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m conn.is_closed:\n\u001B[32m-> \u001B[39m\u001B[32m1093\u001B[39m \u001B[43mconn\u001B[49m\u001B[43m.\u001B[49m\u001B[43mconnect\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 1095\u001B[39m \u001B[38;5;66;03m# TODO revise this, see https://github.com/urllib3/urllib3/issues/2791\u001B[39;00m\n\u001B[32m 1096\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m conn.is_verified \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m conn.proxy_is_verified:\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\urllib3\\connection.py:790\u001B[39m, in \u001B[36mHTTPSConnection.connect\u001B[39m\u001B[34m(self)\u001B[39m\n\u001B[32m 787\u001B[39m \u001B[38;5;66;03m# Remove trailing '.' from fqdn hostnames to allow certificate validation\u001B[39;00m\n\u001B[32m 788\u001B[39m server_hostname_rm_dot = server_hostname.rstrip(\u001B[33m\"\u001B[39m\u001B[33m.\u001B[39m\u001B[33m\"\u001B[39m)\n\u001B[32m--> \u001B[39m\u001B[32m790\u001B[39m sock_and_verified = \u001B[43m_ssl_wrap_socket_and_match_hostname\u001B[49m\u001B[43m(\u001B[49m\n\u001B[32m 791\u001B[39m \u001B[43m \u001B[49m\u001B[43msock\u001B[49m\u001B[43m=\u001B[49m\u001B[43msock\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 792\u001B[39m \u001B[43m \u001B[49m\u001B[43mcert_reqs\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43mcert_reqs\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 793\u001B[39m \u001B[43m \u001B[49m\u001B[43mssl_version\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43mssl_version\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 794\u001B[39m \u001B[43m \u001B[49m\u001B[43mssl_minimum_version\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43mssl_minimum_version\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 795\u001B[39m \u001B[43m \u001B[49m\u001B[43mssl_maximum_version\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43mssl_maximum_version\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 796\u001B[39m \u001B[43m \u001B[49m\u001B[43mca_certs\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43mca_certs\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 797\u001B[39m \u001B[43m \u001B[49m\u001B[43mca_cert_dir\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43mca_cert_dir\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 798\u001B[39m \u001B[43m \u001B[49m\u001B[43mca_cert_data\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43mca_cert_data\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 799\u001B[39m \u001B[43m \u001B[49m\u001B[43mcert_file\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43mcert_file\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 800\u001B[39m \u001B[43m \u001B[49m\u001B[43mkey_file\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43mkey_file\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 801\u001B[39m \u001B[43m \u001B[49m\u001B[43mkey_password\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43mkey_password\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 802\u001B[39m \u001B[43m \u001B[49m\u001B[43mserver_hostname\u001B[49m\u001B[43m=\u001B[49m\u001B[43mserver_hostname_rm_dot\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 803\u001B[39m \u001B[43m \u001B[49m\u001B[43mssl_context\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43mssl_context\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 804\u001B[39m \u001B[43m \u001B[49m\u001B[43mtls_in_tls\u001B[49m\u001B[43m=\u001B[49m\u001B[43mtls_in_tls\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 805\u001B[39m \u001B[43m \u001B[49m\u001B[43massert_hostname\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43massert_hostname\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 806\u001B[39m \u001B[43m \u001B[49m\u001B[43massert_fingerprint\u001B[49m\u001B[43m=\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43massert_fingerprint\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 807\u001B[39m \u001B[43m \u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 808\u001B[39m \u001B[38;5;28mself\u001B[39m.sock = sock_and_verified.socket\n\u001B[32m 810\u001B[39m \u001B[38;5;66;03m# If an error occurs during connection/handshake we may need to release\u001B[39;00m\n\u001B[32m 811\u001B[39m \u001B[38;5;66;03m# our lock so another connection can probe the origin.\u001B[39;00m\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\urllib3\\connection.py:969\u001B[39m, in \u001B[36m_ssl_wrap_socket_and_match_hostname\u001B[39m\u001B[34m(sock, cert_reqs, ssl_version, ssl_minimum_version, ssl_maximum_version, cert_file, key_file, key_password, ca_certs, ca_cert_dir, ca_cert_data, assert_hostname, assert_fingerprint, server_hostname, ssl_context, tls_in_tls)\u001B[39m\n\u001B[32m 966\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m is_ipaddress(normalized):\n\u001B[32m 967\u001B[39m server_hostname = normalized\n\u001B[32m--> \u001B[39m\u001B[32m969\u001B[39m ssl_sock = \u001B[43mssl_wrap_socket\u001B[49m\u001B[43m(\u001B[49m\n\u001B[32m 970\u001B[39m \u001B[43m \u001B[49m\u001B[43msock\u001B[49m\u001B[43m=\u001B[49m\u001B[43msock\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 971\u001B[39m \u001B[43m \u001B[49m\u001B[43mkeyfile\u001B[49m\u001B[43m=\u001B[49m\u001B[43mkey_file\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 972\u001B[39m \u001B[43m \u001B[49m\u001B[43mcertfile\u001B[49m\u001B[43m=\u001B[49m\u001B[43mcert_file\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 973\u001B[39m \u001B[43m \u001B[49m\u001B[43mkey_password\u001B[49m\u001B[43m=\u001B[49m\u001B[43mkey_password\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 974\u001B[39m \u001B[43m \u001B[49m\u001B[43mca_certs\u001B[49m\u001B[43m=\u001B[49m\u001B[43mca_certs\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 975\u001B[39m \u001B[43m \u001B[49m\u001B[43mca_cert_dir\u001B[49m\u001B[43m=\u001B[49m\u001B[43mca_cert_dir\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 976\u001B[39m \u001B[43m \u001B[49m\u001B[43mca_cert_data\u001B[49m\u001B[43m=\u001B[49m\u001B[43mca_cert_data\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 977\u001B[39m \u001B[43m \u001B[49m\u001B[43mserver_hostname\u001B[49m\u001B[43m=\u001B[49m\u001B[43mserver_hostname\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 978\u001B[39m \u001B[43m \u001B[49m\u001B[43mssl_context\u001B[49m\u001B[43m=\u001B[49m\u001B[43mcontext\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 979\u001B[39m \u001B[43m \u001B[49m\u001B[43mtls_in_tls\u001B[49m\u001B[43m=\u001B[49m\u001B[43mtls_in_tls\u001B[49m\u001B[43m,\u001B[49m\n\u001B[32m 980\u001B[39m \u001B[43m\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 982\u001B[39m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[32m 983\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m assert_fingerprint:\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\.venv\\Lib\\site-packages\\urllib3\\util\\ssl_.py:458\u001B[39m, in \u001B[36mssl_wrap_socket\u001B[39m\u001B[34m(sock, keyfile, certfile, cert_reqs, ca_certs, server_hostname, ssl_version, ciphers, ssl_context, ca_cert_dir, key_password, ca_cert_data, tls_in_tls)\u001B[39m\n\u001B[32m 456\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m ca_certs \u001B[38;5;129;01mor\u001B[39;00m ca_cert_dir \u001B[38;5;129;01mor\u001B[39;00m ca_cert_data:\n\u001B[32m 457\u001B[39m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[32m--> \u001B[39m\u001B[32m458\u001B[39m \u001B[43mcontext\u001B[49m\u001B[43m.\u001B[49m\u001B[43mload_verify_locations\u001B[49m\u001B[43m(\u001B[49m\u001B[43mca_certs\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mca_cert_dir\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mca_cert_data\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 459\u001B[39m \u001B[38;5;28;01mexcept\u001B[39;00m \u001B[38;5;167;01mOSError\u001B[39;00m \u001B[38;5;28;01mas\u001B[39;00m e:\n\u001B[32m 460\u001B[39m \u001B[38;5;28;01mraise\u001B[39;00m SSLError(e) \u001B[38;5;28;01mfrom\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[34;01me\u001B[39;00m\n", "\u001B[31mKeyboardInterrupt\u001B[39m: " ] } ], "execution_count": 8 }, { "metadata": {}, "cell_type": "markdown", "source": "The `classify_rows` method is then used to classify the rows of the standardized dictionary based on the content of the specified columns. This classification helps in organizing the data and making it easier to work with during the harmonization process. The `MODEL_PATH` parameter specifies the path to a pre-trained model that is used for classification. You can provide your own model or use the default one provided in the `files` folder. The model is a **fine-tuned BERT model** for text classification. You can find more details about the model in the [documentation](https://harmonize-tools.github.io/socio4health/socio4health.utils.harmonizer_utils.classify_rows.html#socio4health.utils.harmonizer_utils.classify_rows).", "id": "35a68aa2dd5f1fba" }, { "metadata": { "ExecuteTime": { "end_time": "2025-08-11T16:20:38.823181Z", "start_time": "2025-08-11T16:20:38.744864Z" } }, "cell_type": "code", "source": [ "dic = harmonizer_utils.classify_rows(dic, \"question_en\", \"description_en\", \"possible_answers_en\",\n", " new_column_name=\"category\",\n", " MODEL_PATH=\"files/bert_finetuned_classifier\")" ], "id": "e647e0ac0333a014", "outputs": [ { "ename": "ValueError", "evalue": "The column 'question_en' is not found in the DataFrame.", "output_type": "error", "traceback": [ "\u001B[31m---------------------------------------------------------------------------\u001B[39m", "\u001B[31mValueError\u001B[39m Traceback (most recent call last)", "\u001B[36mCell\u001B[39m\u001B[36m \u001B[39m\u001B[32mIn[9]\u001B[39m\u001B[32m, line 1\u001B[39m\n\u001B[32m----> \u001B[39m\u001B[32m1\u001B[39m dic = \u001B[43mharmonizer_utils\u001B[49m\u001B[43m.\u001B[49m\u001B[43mclassify_rows\u001B[49m\u001B[43m(\u001B[49m\u001B[43mdic\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[33;43m\"\u001B[39;49m\u001B[33;43mquestion_en\u001B[39;49m\u001B[33;43m\"\u001B[39;49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[33;43m\"\u001B[39;49m\u001B[33;43mdescription_en\u001B[39;49m\u001B[33;43m\"\u001B[39;49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[33;43m\"\u001B[39;49m\u001B[33;43mpossible_answers_en\u001B[39;49m\u001B[33;43m\"\u001B[39;49m\u001B[43m,\u001B[49m\n\u001B[32m 2\u001B[39m \u001B[43m \u001B[49m\u001B[43mnew_column_name\u001B[49m\u001B[43m=\u001B[49m\u001B[33;43m\"\u001B[39;49m\u001B[33;43mcategory\u001B[39;49m\u001B[33;43m\"\u001B[39;49m\u001B[43m,\u001B[49m\n\u001B[32m 3\u001B[39m \u001B[43m \u001B[49m\u001B[43mMODEL_PATH\u001B[49m\u001B[43m=\u001B[49m\u001B[33;43m\"\u001B[39;49m\u001B[33;43mfiles/bert_finetuned_classifier\u001B[39;49m\u001B[33;43m\"\u001B[39;49m\u001B[43m)\u001B[49m\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\src\\socio4health\\utils\\harmonizer_utils.py:241\u001B[39m, in \u001B[36mclassify_rows\u001B[39m\u001B[34m(data, col1, col2, col3, new_column_name, MODEL_PATH)\u001B[39m\n\u001B[32m 239\u001B[39m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mTypeError\u001B[39;00m(\u001B[33m\"\u001B[39m\u001B[33mThe parameters col1, col2 and col3 must be strings.\u001B[39m\u001B[33m\"\u001B[39m)\n\u001B[32m 240\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m col \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;129;01min\u001B[39;00m data.columns:\n\u001B[32m--> \u001B[39m\u001B[32m241\u001B[39m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mValueError\u001B[39;00m(\u001B[33mf\u001B[39m\u001B[33m\"\u001B[39m\u001B[33mThe column \u001B[39m\u001B[33m'\u001B[39m\u001B[38;5;132;01m{\u001B[39;00mcol\u001B[38;5;132;01m}\u001B[39;00m\u001B[33m'\u001B[39m\u001B[33m is not found in the DataFrame.\u001B[39m\u001B[33m\"\u001B[39m)\n\u001B[32m 243\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(new_column_name, \u001B[38;5;28mstr\u001B[39m) \u001B[38;5;129;01mor\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m new_column_name:\n\u001B[32m 244\u001B[39m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mValueError\u001B[39;00m(\u001B[33m\"\u001B[39m\u001B[33mnew_column_name must be a non-empty string.\u001B[39m\u001B[33m\"\u001B[39m)\n", "\u001B[31mValueError\u001B[39m: The column 'question_en' is not found in the DataFrame." ] } ], "execution_count": 9 }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Extracting the data\n", "The `extract` method of the `Extractor` class is used retrieve the data from the specified input path. It returns a list of dataframes, each dataframe corresponding to a file extracted from the path." ], "id": "3b60a80f58e17a87" }, { "metadata": { "ExecuteTime": { "end_time": "2025-08-11T16:42:20.509879Z", "start_time": "2025-08-11T16:20:47.086777Z" } }, "cell_type": "code", "source": "dfs = bra_online_extractor.extract()", "id": "90f70dbe672991c1", "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-08-11 11:20:47,090 - INFO - ----------------------\n", "2025-08-11 11:20:47,091 - INFO - Starting data extraction...\n", "2025-08-11 11:20:47,092 - INFO - Extracting data in online mode...\n", "2025-08-11 11:20:47,094 - INFO - Scraping URL: https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/ with depth 0\n", "2025-08-11 11:20:48,941 - INFO - Successfully saved links to Output_scrap.json.\n", "2025-08-11 11:20:49,089 - INFO - Downloading files to: ../data\n", "Downloading files: 0%| | 0/4 [00:00 \u001B[39m\u001B[32m4\u001B[39m filtered_ddfs = \u001B[43mhar\u001B[49m\u001B[43m.\u001B[49m\u001B[43mdata_selector\u001B[49m\u001B[43m(\u001B[49m\u001B[43mdfs\u001B[49m\u001B[43m)\u001B[49m\n", "\u001B[36mFile \u001B[39m\u001B[32m~\\PycharmProjects\\socio4health\\src\\socio4health\\harmonizer.py:590\u001B[39m, in \u001B[36mHarmonizer.data_selector\u001B[39m\u001B[34m(self, ddfs)\u001B[39m\n\u001B[32m 588\u001B[39m \u001B[38;5;28;01mfor\u001B[39;00m ddf \u001B[38;5;129;01min\u001B[39;00m ddfs:\n\u001B[32m 589\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m.key_col \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;129;01min\u001B[39;00m ddf.columns:\n\u001B[32m--> \u001B[39m\u001B[32m590\u001B[39m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mKeyError\u001B[39;00m(\u001B[33mf\u001B[39m\u001B[33m\"\u001B[39m\u001B[33mKey column \u001B[39m\u001B[33m'\u001B[39m\u001B[38;5;132;01m{\u001B[39;00m\u001B[38;5;28mself\u001B[39m.key_col\u001B[38;5;132;01m}\u001B[39;00m\u001B[33m'\u001B[39m\u001B[33m not found in DataFrame\u001B[39m\u001B[33m\"\u001B[39m)\n\u001B[32m 592\u001B[39m filtered_ddf = ddf[ddf[\u001B[38;5;28mself\u001B[39m.key_col].isin(\u001B[38;5;28mself\u001B[39m.key_val)]\n\u001B[32m 593\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mlen\u001B[39m(filtered_ddf) == \u001B[32m0\u001B[39m:\n", "\u001B[31mKeyError\u001B[39m: \"Key column 'DPTO' not found in DataFrame\"" ] } ], "execution_count": 15 }, { "metadata": {}, "cell_type": "markdown", "source": "Finally, we can **join** the filtered dataframes into a **single dataframe** using the `join_data` method. This method combines the data from the **filtered dataframes** into a **single dataframe**, aligning the columns based on their names. The resulting dataframe will contain all the columns that are present in the filtered dataframes, and it will be ready for further analysis or **export** as a `CSV` file.", "id": "cefaf0588c9be884" }, { "metadata": { "ExecuteTime": { "end_time": "2025-08-11T16:43:11.090908Z", "start_time": "2025-08-11T16:43:11.043927Z" } }, "cell_type": "code", "source": [ "joined_df = har.join_data(filtered_ddfs)\n", "available_cols = joined_df.columns.tolist()\n", "print(f\"Available columns: {available_cols}\")\n", "print(f\"Shape of the joined DataFrame: {joined_df.shape}\")\n", "print(joined_df.head())\n", "joined_df.to_csv('data/GEIH_2022_harmonized.csv', index=False)" ], "id": "1ca59653ff1335ac", "outputs": [ { "ename": "NameError", "evalue": "name 'filtered_ddfs' is not defined", "output_type": "error", "traceback": [ "\u001B[31m---------------------------------------------------------------------------\u001B[39m", "\u001B[31mNameError\u001B[39m Traceback (most recent call last)", "\u001B[36mCell\u001B[39m\u001B[36m \u001B[39m\u001B[32mIn[16]\u001B[39m\u001B[32m, line 1\u001B[39m\n\u001B[32m----> \u001B[39m\u001B[32m1\u001B[39m joined_df = har.join_data(\u001B[43mfiltered_ddfs\u001B[49m)\n\u001B[32m 2\u001B[39m available_cols = joined_df.columns.tolist()\n\u001B[32m 3\u001B[39m \u001B[38;5;28mprint\u001B[39m(\u001B[33mf\u001B[39m\u001B[33m\"\u001B[39m\u001B[33mAvailable columns: \u001B[39m\u001B[38;5;132;01m{\u001B[39;00mavailable_cols\u001B[38;5;132;01m}\u001B[39;00m\u001B[33m\"\u001B[39m)\n", "\u001B[31mNameError\u001B[39m: name 'filtered_ddfs' is not defined" ] } ], "execution_count": 16 } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }