diff --git a/dhanush submission/README.md b/dhanush submission/README.md new file mode 100644 index 0000000..7adebc1 --- /dev/null +++ b/dhanush submission/README.md @@ -0,0 +1,50 @@ +# Gradient Works Exercise + +## Part-1 + +1. How many companies are in the dataset?
+There are 75 companies in the dataset. +2. How many unique URLs are in the dataset?
+There are 530 unique URLs in the dataset. On analyzing the prefixes of the URLs, it is observed that the URLs are from 77 different domains. 2 of the domains seems to have duplicate URLs. + +![Qn_2](assets/Qn_2.png) +3. What is the most common chunk type?
+The most common chunk type is `header` with 549 occurrences. + +![Qn_3](assets/Qn_3.png) + +4. What is the distribution of chunk types by company?
+Please refer to the jupyter notebook under the Notebooks folder for the distribution of chunk types by company. + +## Part-2 RAG + +### Architecture Diagram +![Architecture Diagram](assets/Architecture_diagram.png) + + +## Steps to run the code +1. Create a `.env` file and place your OPEN_AI API key in this format +``` +OPENAI_API_KEY = +COHERE_API_KEY = +``` +2. Run the `requirements.txt` file to install all the necessary libraries. +``` +pip install -r requirements.txt +``` +3. Run `chunking.py` first, as this converts the HTML content to text and saves the processed csv file. +4. Run `embedding.py` next to generate embeddings and store them as a numpy file. +5. The code is also exposed as an API using FASTAPI. To run the API server, run the following command inside the src folder. +``` +uvicorn main:app --reload --port 8080 +``` +This will start the API server at http://localhost:8080.
+6. Run `chat.py` next, which opens Streamlit in your browser, allowing you to ask relevant questions based on the csv file provided. +``` +streamlit run src/chat.py +``` +### Demo +![Demo](assets/demo.png) + +> [!NOTE] +The code is also available as a jupyter notebook under the notebooks folder. diff --git a/dhanush submission/assets/Architecture_diagram.png b/dhanush submission/assets/Architecture_diagram.png new file mode 100644 index 0000000..fe91cc7 Binary files /dev/null and b/dhanush submission/assets/Architecture_diagram.png differ diff --git a/dhanush submission/assets/Qn_2.png b/dhanush submission/assets/Qn_2.png new file mode 100644 index 0000000..823cb8c Binary files /dev/null and b/dhanush submission/assets/Qn_2.png differ diff --git a/dhanush submission/assets/Qn_3.png b/dhanush submission/assets/Qn_3.png new file mode 100644 index 0000000..151d0f7 Binary files /dev/null and b/dhanush submission/assets/Qn_3.png differ diff --git a/dhanush submission/assets/demo.png b/dhanush submission/assets/demo.png new file mode 100644 index 0000000..988fc42 Binary files /dev/null and b/dhanush submission/assets/demo.png differ diff --git a/dhanush submission/notebooks/Dhanush GW.ipynb b/dhanush submission/notebooks/Dhanush GW.ipynb new file mode 100644 index 0000000..1d1388c --- /dev/null +++ b/dhanush submission/notebooks/Dhanush GW.ipynb @@ -0,0 +1,2652 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "id": "32b4fb60", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from matplotlib import pyplot as plt" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "93a352f1", + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.read_csv('../data/content.csv')" + ] + }, + { + "cell_type": "markdown", + "id": "25e5952c", + "metadata": {}, + "source": [ + "# Tasks - Part 1" + ] + }, + { + "cell_type": "markdown", + "id": "30a8c8bb", + "metadata": {}, + "source": [ + "Here are some questions that we'd like you to answer about the dataset:\n", + "\n", + "1. How many companies are in the dataset?\n", + "2. How many unique URLs are in the dataset?\n", + "3. What is the most common chunk type?\n", + "4. What is the distribution of chunk types by company?" + ] + }, + { + "cell_type": "markdown", + "id": "4a9b154c", + "metadata": {}, + "source": [ + "## 1. How many companies are in the dataset?" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "13945546", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "75" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(np.unique(df['company_id']))" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "d8870337", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(np.unique(df['company_id'])) == len(np.unique(df['company_name']))" + ] + }, + { + "cell_type": "markdown", + "id": "da4434ca", + "metadata": {}, + "source": [ + "The dataset contains 75 unique companies. The columns `company_name` and `company_id` have a one-to-one correspondence, ensuring that there are no duplicate names in the `company_name` column, assuming company_id is unique." + ] + }, + { + "cell_type": "markdown", + "id": "f35dd734", + "metadata": {}, + "source": [ + "## 2. How many unique URLs are in the dataset?" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "1d455aac", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "530" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(np.unique(df['url']))" + ] + }, + { + "cell_type": "markdown", + "id": "5cfa3e26", + "metadata": {}, + "source": [ + "There are 530 unique urls in this dataset. Let's fetch the prefix urls for these companies and check if it matches with the unique company_id" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "022127a9", + "metadata": {}, + "outputs": [], + "source": [ + "prefix_url = [x.split('/')[0] + '//' + x.split('/')[2] for x in df['url']]" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "d7d237a4", + "metadata": {}, + "outputs": [], + "source": [ + "df['prefix_url'] = prefix_url" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "65408cba", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "77" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(np.unique(prefix_url))" + ] + }, + { + "cell_type": "markdown", + "id": "379500a6", + "metadata": {}, + "source": [ + "2 company names seems to be repeated" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "66786ba5", + "metadata": {}, + "outputs": [], + "source": [ + "company_id = []\n", + "uniq_url = []\n", + "for x in range(len(df['prefix_url'])):\n", + " if df.iloc[x]['prefix_url'] not in uniq_url:\n", + " uniq_url.append(df.iloc[x]['prefix_url'])\n", + " company_id.append(df.iloc[x]['company_id'])" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "de8ff277", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(company_id) == len(uniq_url)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "c5466bbc", + "metadata": {}, + "outputs": [], + "source": [ + "temp_df = pd.DataFrame({'company_id_unique': company_id, 'unique_url': uniq_url})" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "cb5893e2", + "metadata": {}, + "outputs": [], + "source": [ + "duplicate_df = temp_df[temp_df.duplicated(subset=['company_id_unique'])]" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "f3c9d280", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
company_id_uniqueunique_url
4a45e4bd7-0ea7-4c3f-ba6a-cdc5027f6c60https://www.bingotech.net
54a0dffac7-5b73-47bb-8a31-78440a1aef33https://4wheeltravels.com
56a45e4bd7-0ea7-4c3f-ba6a-cdc5027f6c60https://bingotech.net
57a0dffac7-5b73-47bb-8a31-78440a1aef33https://sg2plcpnl0188.prod.sin2.secureserver.n...
\n", + "
" + ], + "text/plain": [ + " company_id_unique \\\n", + "4 a45e4bd7-0ea7-4c3f-ba6a-cdc5027f6c60 \n", + "54 a0dffac7-5b73-47bb-8a31-78440a1aef33 \n", + "56 a45e4bd7-0ea7-4c3f-ba6a-cdc5027f6c60 \n", + "57 a0dffac7-5b73-47bb-8a31-78440a1aef33 \n", + "\n", + " unique_url \n", + "4 https://www.bingotech.net \n", + "54 https://4wheeltravels.com \n", + "56 https://bingotech.net \n", + "57 https://sg2plcpnl0188.prod.sin2.secureserver.n... " + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "temp_df[temp_df['company_id_unique'].isin(duplicate_df['company_id_unique'])]" + ] + }, + { + "cell_type": "markdown", + "id": "8ea31d37", + "metadata": {}, + "source": [ + "Bingotech duplicate domain names: https://www.bingotech.net and https://bingotech.net
\n", + "4wheeltravels duplicate domain names: https://4wheeltravels.com and https://sg2plcpnl0188.prod.sin2.secureserver.n..." + ] + }, + { + "cell_type": "markdown", + "id": "ea2849b5", + "metadata": {}, + "source": [ + "## 3. What is the most common chunk type?" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "5a26687d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "chunk_type\n", + "header 549\n", + "main 545\n", + "head 530\n", + "footer 504\n", + "Name: count, dtype: int64" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['chunk_type'].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "75f39608", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAigAAAGdCAYAAAA44ojeAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/H5lhTAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAhdElEQVR4nO3df1DUdeLH8Re/UWBBUBY5ES1LpTTSTNf8lmccZIyjJ9dV5xA6pGVIKaMZnalhd3Zel1mD2nSm3pyOl+dVkxmKlHajiIinY2pcOnXQ6cKVA4glP/f7R8PnblPLVXDf4PMxszPu5/Pez+f98cPqk93Pgo/L5XIJAADAIL7engAAAMD3ESgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjOPv7QlcidbWVp06dUphYWHy8fHx9nQAAMBlcLlcOnv2rGJjY+Xr+8OvkXTKQDl16pTi4uK8PQ0AAHAFKisr1adPnx8c0ykDJSwsTNJ3B2iz2bw8GwAAcDnq6uoUFxdn/T/+QzploLS9rWOz2QgUAAA6mcu5PIOLZAEAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBx/b0/ARP2eed/bU7huffFiqrenAAAwAK+gAAAA4xAoAADAOLzFg+sGb915D2/dAfAUr6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjMMPagPQ6fFD+LyHH8KHjsIrKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjONRoCxevFg+Pj5ut0GDBlnrz58/r6ysLEVFRSk0NFRpaWmqqqpy20ZFRYVSU1PVvXt3RUdHa968eWpubm6fowEAAF2Cv6cPuOWWW7Rz587/bsD/v5uYM2eO3n//fW3evFnh4eGaNWuWJk+erD179kiSWlpalJqaqpiYGO3du1enT5/WI488ooCAAP32t79th8MBAHQl/Z5539tTuG598WKqV/fvcaD4+/srJibmguW1tbVas2aNNm7cqHHjxkmS1q5dq8GDB2vfvn0aNWqUduzYoWPHjmnnzp2y2+1KTEzUkiVLNH/+fC1evFiBgYFXf0QAAKDT8/galM8++0yxsbG64YYbNGXKFFVUVEiSysrK1NTUpKSkJGvsoEGD1LdvXxUXF0uSiouLNWTIENntdmtMSkqK6urqdPTo0Uvus6GhQXV1dW43AADQdXkUKCNHjtS6detUUFCgVatW6fPPP9f//d//6ezZs3I6nQoMDFRERITbY+x2u5xOpyTJ6XS6xUnb+rZ1l7J06VKFh4dbt7i4OE+mDQAAOhmP3uIZP3689eehQ4dq5MiRio+P11tvvaVu3bq1++Ta5ObmKicnx7pfV1dHpAAA0IVd1ceMIyIidPPNN+vEiROKiYlRY2Ojampq3MZUVVVZ16zExMRc8KmetvsXu66lTVBQkGw2m9sNAAB0XVcVKPX19Tp58qR69+6t4cOHKyAgQEVFRdb68vJyVVRUyOFwSJIcDoeOHDmi6upqa0xhYaFsNpsSEhKuZioAAKAL8egtnrlz52rChAmKj4/XqVOntGjRIvn5+enhhx9WeHi4MjMzlZOTo8jISNlsNmVnZ8vhcGjUqFGSpOTkZCUkJCg9PV3Lli2T0+nUggULlJWVpaCgoA45QAAA0Pl4FChffvmlHn74YX399dfq1auXxowZo3379qlXr16SpOXLl8vX11dpaWlqaGhQSkqKVq5caT3ez89PW7du1cyZM+VwOBQSEqKMjAzl5eW171EBAIBOzaNA2bRp0w+uDw4OVn5+vvLz8y85Jj4+Xtu2bfNktwAA4DrD7+IBAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMa5qkB58cUX5ePjo9mzZ1vLzp8/r6ysLEVFRSk0NFRpaWmqqqpye1xFRYVSU1PVvXt3RUdHa968eWpubr6aqQAAgC7kigOltLRUr7/+uoYOHeq2fM6cOXrvvfe0efNm7d69W6dOndLkyZOt9S0tLUpNTVVjY6P27t2r9evXa926dVq4cOGVHwUAAOhSrihQ6uvrNWXKFL3xxhvq0aOHtby2tlZr1qzRyy+/rHHjxmn48OFau3at9u7dq3379kmSduzYoWPHjunPf/6zEhMTNX78eC1ZskT5+flqbGxsn6MCAACd2hUFSlZWllJTU5WUlOS2vKysTE1NTW7LBw0apL59+6q4uFiSVFxcrCFDhshut1tjUlJSVFdXp6NHj150fw0NDaqrq3O7AQCArsvf0wds2rRJBw8eVGlp6QXrnE6nAgMDFRER4bbcbrfL6XRaY/43TtrWt627mKVLl+r555/3dKoAAKCT8ugVlMrKSj311FPasGGDgoODO2pOF8jNzVVtba11q6ysvGb7BgAA155HgVJWVqbq6moNGzZM/v7+8vf31+7du/Xqq6/K399fdrtdjY2NqqmpcXtcVVWVYmJiJEkxMTEXfKqn7X7bmO8LCgqSzWZzuwEAgK7Lo0C59957deTIER06dMi63XHHHZoyZYr154CAABUVFVmPKS8vV0VFhRwOhyTJ4XDoyJEjqq6utsYUFhbKZrMpISGhnQ4LAAB0Zh5dgxIWFqZbb73VbVlISIiioqKs5ZmZmcrJyVFkZKRsNpuys7PlcDg0atQoSVJycrISEhKUnp6uZcuWyel0asGCBcrKylJQUFA7HRYAAOjMPL5I9scsX75cvr6+SktLU0NDg1JSUrRy5UprvZ+fn7Zu3aqZM2fK4XAoJCREGRkZysvLa++pAACATuqqA2XXrl1u94ODg5Wfn6/8/PxLPiY+Pl7btm272l0DAIAuit/FAwAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACM41GgrFq1SkOHDpXNZpPNZpPD4dAHH3xgrT9//ryysrIUFRWl0NBQpaWlqaqqym0bFRUVSk1NVffu3RUdHa158+apubm5fY4GAAB0CR4FSp8+ffTiiy+qrKxMBw4c0Lhx4zRx4kQdPXpUkjRnzhy999572rx5s3bv3q1Tp05p8uTJ1uNbWlqUmpqqxsZG7d27V+vXr9e6deu0cOHC9j0qAADQqfl7MnjChAlu93/zm99o1apV2rdvn/r06aM1a9Zo48aNGjdunCRp7dq1Gjx4sPbt26dRo0Zpx44dOnbsmHbu3Cm73a7ExEQtWbJE8+fP1+LFixUYGNh+RwYAADqtK74GpaWlRZs2bdK5c+fkcDhUVlampqYmJSUlWWMGDRqkvn37qri4WJJUXFysIUOGyG63W2NSUlJUV1dnvQpzMQ0NDaqrq3O7AQCArsvjQDly5IhCQ0MVFBSkxx9/XG+//bYSEhLkdDoVGBioiIgIt/F2u11Op1OS5HQ63eKkbX3buktZunSpwsPDrVtcXJyn0wYAAJ2Ix4EycOBAHTp0SCUlJZo5c6YyMjJ07NixjpibJTc3V7W1tdatsrKyQ/cHAAC8y6NrUCQpMDBQAwYMkCQNHz5cpaWlWrFihR588EE1NjaqpqbG7VWUqqoqxcTESJJiYmK0f/9+t+21fcqnbczFBAUFKSgoyNOpAgCATuqqfw5Ka2urGhoaNHz4cAUEBKioqMhaV15eroqKCjkcDkmSw+HQkSNHVF1dbY0pLCyUzWZTQkLC1U4FAAB0ER69gpKbm6vx48erb9++Onv2rDZu3Khdu3Zp+/btCg8PV2ZmpnJychQZGSmbzabs7Gw5HA6NGjVKkpScnKyEhASlp6dr2bJlcjqdWrBggbKysniFBAAAWDwKlOrqaj3yyCM6ffq0wsPDNXToUG3fvl0/+9nPJEnLly+Xr6+v0tLS1NDQoJSUFK1cudJ6vJ+fn7Zu3aqZM2fK4XAoJCREGRkZysvLa9+jAgAAnZpHgbJmzZofXB8cHKz8/Hzl5+dfckx8fLy2bdvmyW4BAMB1ht/FAwAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACM41GgLF26VCNGjFBYWJiio6M1adIklZeXu405f/68srKyFBUVpdDQUKWlpamqqsptTEVFhVJTU9W9e3dFR0dr3rx5am5uvvqjAQAAXYJHgbJ7925lZWVp3759KiwsVFNTk5KTk3Xu3DlrzJw5c/Tee+9p8+bN2r17t06dOqXJkydb61taWpSamqrGxkbt3btX69ev17p167Rw4cL2OyoAANCp+XsyuKCgwO3+unXrFB0drbKyMt19992qra3VmjVrtHHjRo0bN06StHbtWg0ePFj79u3TqFGjtGPHDh07dkw7d+6U3W5XYmKilixZovnz52vx4sUKDAxsv6MDAACd0lVdg1JbWytJioyMlCSVlZWpqalJSUlJ1phBgwapb9++Ki4uliQVFxdryJAhstvt1piUlBTV1dXp6NGjF91PQ0OD6urq3G4AAKDruuJAaW1t1ezZs3XXXXfp1ltvlSQ5nU4FBgYqIiLCbazdbpfT6bTG/G+ctK1vW3cxS5cuVXh4uHWLi4u70mkDAIBO4IoDJSsrS5988ok2bdrUnvO5qNzcXNXW1lq3ysrKDt8nAADwHo+uQWkza9Ysbd26VR9//LH69OljLY+JiVFjY6NqamrcXkWpqqpSTEyMNWb//v1u22v7lE/bmO8LCgpSUFDQlUwVAAB0Qh69guJyuTRr1iy9/fbb+vDDD9W/f3+39cOHD1dAQICKioqsZeXl5aqoqJDD4ZAkORwOHTlyRNXV1daYwsJC2Ww2JSQkXM2xAACALsKjV1CysrK0ceNGvfvuuwoLC7OuGQkPD1e3bt0UHh6uzMxM5eTkKDIyUjabTdnZ2XI4HBo1apQkKTk5WQkJCUpPT9eyZcvkdDq1YMECZWVl8SoJAACQ5GGgrFq1SpI0duxYt+Vr167V1KlTJUnLly+Xr6+v0tLS1NDQoJSUFK1cudIa6+fnp61bt2rmzJlyOBwKCQlRRkaG8vLyru5IAABAl+FRoLhcrh8dExwcrPz8fOXn519yTHx8vLZt2+bJrgEAwHWE38UDAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIzjcaB8/PHHmjBhgmJjY+Xj46N33nnHbb3L5dLChQvVu3dvdevWTUlJSfrss8/cxpw5c0ZTpkyRzWZTRESEMjMzVV9ff1UHAgAAug6PA+XcuXO67bbblJ+ff9H1y5Yt06uvvqrVq1erpKREISEhSklJ0fnz560xU6ZM0dGjR1VYWKitW7fq448/1owZM678KAAAQJfi7+kDxo8fr/Hjx190ncvl0iuvvKIFCxZo4sSJkqQ//elPstvteuedd/TQQw/p+PHjKigoUGlpqe644w5J0muvvab7779fL730kmJjY6/icAAAQFfQrtegfP7553I6nUpKSrKWhYeHa+TIkSouLpYkFRcXKyIiwooTSUpKSpKvr69KSkouut2GhgbV1dW53QAAQNfVroHidDolSXa73W253W631jmdTkVHR7ut9/f3V2RkpDXm+5YuXarw8HDrFhcX157TBgAAhukUn+LJzc1VbW2tdausrPT2lAAAQAdq10CJiYmRJFVVVbktr6qqstbFxMSourrabX1zc7POnDljjfm+oKAg2Ww2txsAAOi62jVQ+vfvr5iYGBUVFVnL6urqVFJSIofDIUlyOByqqalRWVmZNebDDz9Ua2urRo4c2Z7TAQAAnZTHn+Kpr6/XiRMnrPuff/65Dh06pMjISPXt21ezZ8/WCy+8oJtuukn9+/fXc889p9jYWE2aNEmSNHjwYN13332aPn26Vq9eraamJs2aNUsPPfQQn+ABAACSriBQDhw4oJ/+9KfW/ZycHElSRkaG1q1bp6efflrnzp3TjBkzVFNTozFjxqigoEDBwcHWYzZs2KBZs2bp3nvvla+vr9LS0vTqq6+2w+EAAICuwONAGTt2rFwu1yXX+/j4KC8vT3l5eZccExkZqY0bN3q6awAAcJ3oFJ/iAQAA1xcCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHK8GSn5+vvr166fg4GCNHDlS+/fv9+Z0AACAIbwWKH/5y1+Uk5OjRYsW6eDBg7rtttuUkpKi6upqb00JAAAYwmuB8vLLL2v69OmaNm2aEhIStHr1anXv3l1vvvmmt6YEAAAM4e+NnTY2NqqsrEy5ubnWMl9fXyUlJam4uPiC8Q0NDWpoaLDu19bWSpLq6uo6ZH6tDd90yHbx4zrqnEqcV2/qyPMqcW69iXPbdXXEuW3bpsvl+tGxXgmUr776Si0tLbLb7W7L7Xa7Pv300wvGL126VM8///wFy+Pi4jpsjvCO8Fe8PQN0BM5r18W57bo68tyePXtW4eHhPzjGK4HiqdzcXOXk5Fj3W1tbdebMGUVFRcnHx8eLMzNLXV2d4uLiVFlZKZvN5u3poB1xbrsmzmvXxbm9OJfLpbNnzyo2NvZHx3olUHr27Ck/Pz9VVVW5La+qqlJMTMwF44OCghQUFOS2LCIioiOn2KnZbDaeEF0U57Zr4rx2XZzbC/3YKydtvHKRbGBgoIYPH66ioiJrWWtrq4qKiuRwOLwxJQAAYBCvvcWTk5OjjIwM3XHHHbrzzjv1yiuv6Ny5c5o2bZq3pgQAAAzhtUB58MEH9Z///EcLFy6U0+lUYmKiCgoKLrhwFpcvKChIixYtuuDtMHR+nNuuifPadXFur56P63I+6wMAAHAN8bt4AACAcQgUAABgHAIFAAAYh0DxgrFjx2r27NnXfL/9+vXTK6+8cs33i44zdepUTZo0ydvTuK7xfO5aXC6XZsyYocjISPn4+OjQoUPentJ1q1P8JFkAF7dixYrL+p0WAC5PQUGB1q1bp127dumGG25Qz549r3qbY8eOVWJiIkHpIQIFl62xsVGBgYHengb+x+X+REYAl+fkyZPq3bu3Ro8e7e2pXOB6+zeYt3i8pLW1VU8//bQiIyMVExOjxYsXW+tqamr06KOPqlevXrLZbBo3bpwOHz5srT958qQmTpwou92u0NBQjRgxQjt37nTbfnV1tSZMmKBu3bqpf//+2rBhwwVz+LH9LF68WImJifrjH/+o/v37Kzg4uP3/Iq4jY8eOVXZ2tmbPnq0ePXrIbrfrjTfesH5AYVhYmAYMGKAPPvhAktTS0qLMzEz1799f3bp108CBA7VixQq3bX7/LZ6xY8fqySefvOTXFjqGCc9nXL2pU6cqOztbFRUV8vHxUb9+/dTQ0KAnn3xS0dHRCg4O1pgxY1RaWur2uN27d+vOO+9UUFCQevfurWeeeUbNzc3WNnfv3q0VK1bIx8dHPj4++uKLLyRJn3zyicaPH6/Q0FDZ7Xalp6frq6++srY7duxYzZo1S7Nnz1bPnj2VkpJyzf4uTECgeMn69esVEhKikpISLVu2THl5eSosLJQkPfDAA6qurtYHH3ygsrIyDRs2TPfee6/OnDkjSaqvr9f999+voqIi/eMf/9B9992nCRMmqKKiwtr+1KlTVVlZqY8++kh//etftXLlSlVXV7vN4cf2I0knTpzQli1b9Le//Y33YtvB+vXr1bNnT+3fv1/Z2dmaOXOmHnjgAY0ePVoHDx5UcnKy0tPT9c0336i1tVV9+vTR5s2bdezYMS1cuFDPPvus3nrrrR/dx6W+ttAxTHg+4+qtWLFCeXl56tOnj06fPq3S0lI9/fTT2rJli9avX6+DBw9qwIABSklJsc7fv//9b91///0aMWKEDh8+rFWrVmnNmjV64YUXrG06HA5Nnz5dp0+f1unTpxUXF6eamhqNGzdOt99+uw4cOKCCggJVVVXpl7/8pduc1q9fr8DAQO3Zs0erV6++5n8nXuXCNXfPPfe4xowZ47ZsxIgRrvnz57v+/ve/u2w2m+v8+fNu62+88UbX66+/fslt3nLLLa7XXnvN5XK5XOXl5S5Jrv3791vrjx8/7pLkWr58ucvlcl3WfhYtWuQKCAhwVVdXX/Gx4r++f96bm5tdISEhrvT0dGvZ6dOnXZJcxcXFF91GVlaWKy0tzbqfkZHhmjhx4iX34XL992sLHcOE5zPaz/Lly13x8fEul8vlqq+vdwUEBLg2bNhgrW9sbHTFxsa6li1b5nK5XK5nn33WNXDgQFdra6s1Jj8/3xUaGupqaWlxuVzffY089dRTbvtZsmSJKzk52W1ZZWWlS5KrvLzcetztt9/e3ofYaXANipcMHTrU7X7v3r1VXV2tw4cPq76+XlFRUW7rv/32W508eVLSd99xLV68WO+//75Onz6t5uZmffvtt9Z3XMePH5e/v7+GDx9uPX7QoEFuvwH6cvYjSfHx8erVq1e7HDPcz7ufn5+ioqI0ZMgQa1nbr3po++44Pz9fb775pioqKvTtt9+qsbFRiYmJl70P6b9fW+g43n4+o2OcPHlSTU1Nuuuuu6xlAQEBuvPOO3X8+HFJ350fh8MhHx8fa8xdd92l+vp6ffnll+rbt+9Ft3348GF99NFHCg0Nveh+b775ZklyO+/XGwLFSwICAtzu+/j4qLW1VfX19erdu7d27dp1wWPa/kGaO3euCgsL9dJLL2nAgAHq1q2bfvGLX6ixsfGy9385+5GkkJCQy94mftzFzvv/Lmv7R661tVWbNm3S3Llz9Yc//EEOh0NhYWH6/e9/r5KSEo/30dra2k5HgIvx9vMZnU99fb0mTJig3/3udxes6927t/Xn6/nfYALFMMOGDZPT6ZS/v7/69et30TF79uzR1KlT9fOf/1zSd1/obRddSd99d9Xc3KyysjKNGDFCklReXq6amhqP9gPv2rNnj0aPHq0nnnjCWva/r27BfNfq+YyOceONN1rXf8THx0uSmpqaVFpaav3sm8GDB2vLli1yuVzWNxh79uxRWFiY+vTpI0kKDAxUS0uL27aHDRumLVu2qF+/fvL357/ii+EiWcMkJSXJ4XBo0qRJ2rFjh7744gvt3btXv/71r3XgwAFJ0k033WRdtHr48GH96le/cvsOeeDAgbrvvvv02GOPqaSkRGVlZXr00UfVrVs3j/YD77rpppt04MABbd++Xf/85z/13HPPXfDpAZjtWj2f0TFCQkI0c+ZMzZs3TwUFBTp27JimT5+ub775RpmZmZKkJ554QpWVlcrOztann36qd999V4sWLVJOTo58fb/7L7Zfv34qKSnRF198oa+++kqtra3KysrSmTNn9PDDD6u0tFQnT57U9u3bNW3atAti5npFoBjGx8dH27Zt0913361p06bp5ptv1kMPPaR//etf1vUJL7/8snr06KHRo0drwoQJSklJ0bBhw9y2s3btWsXGxuqee+7R5MmTNWPGDEVHR3u0H3jXY489psmTJ+vBBx/UyJEj9fXXX7u9mgLzXavnMzrOiy++qLS0NKWnp2vYsGE6ceKEtm/frh49ekiSfvKTn2jbtm3av3+/brvtNj3++OPKzMzUggULrG3MnTtXfn5+SkhIUK9evVRRUaHY2Fjt2bNHLS0tSk5O1pAhQzR79mxFRERYYXO983G5+DGUAADALGQaAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOP8PSfZ31rcUydgAAAAASUVORK5CYII=", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "counts = df['chunk_type'].value_counts()\n", + "plt.bar(counts.index, counts.values) \n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "a92308fc", + "metadata": {}, + "source": [ + "**Header** is the most common chunk type." + ] + }, + { + "cell_type": "markdown", + "id": "08a0614f", + "metadata": {}, + "source": [ + "## 4. What is the distribution of chunk types by company?" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "9cd5724a", + "metadata": {}, + "outputs": [], + "source": [ + "chunk_type_distribution = df.groupby('company_id')['chunk_type'].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "3dfbc7ce", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
count
company_idchunk_type
041e9ac4-d7eb-499f-8cfb-fa95dca20cd5footer11
head11
header11
main11
057293c5-03c8-427d-9c78-d7cfee35e525footer9
.........
f7958fe2-7f18-4da1-be80-1b6ebd68e427main1
f9d892e2-ded2-4ac3-bf2d-bc65c5c58131footer11
head11
header11
main11
\n", + "

268 rows × 1 columns

\n", + "
" + ], + "text/plain": [ + " count\n", + "company_id chunk_type \n", + "041e9ac4-d7eb-499f-8cfb-fa95dca20cd5 footer 11\n", + " head 11\n", + " header 11\n", + " main 11\n", + "057293c5-03c8-427d-9c78-d7cfee35e525 footer 9\n", + "... ...\n", + "f7958fe2-7f18-4da1-be80-1b6ebd68e427 main 1\n", + "f9d892e2-ded2-4ac3-bf2d-bc65c5c58131 footer 11\n", + " head 11\n", + " header 11\n", + " main 11\n", + "\n", + "[268 rows x 1 columns]" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame(chunk_type_distribution)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "558c6f6c", + "metadata": {}, + "outputs": [], + "source": [ + "def create_company_dict(group):\n", + " chunk_type_counts = group['chunk_type'].value_counts().to_dict()\n", + " return {chunk_type: count for chunk_type, count in chunk_type_counts.items()}\n", + "\n", + "chunk_type_distribution_dict = {company_id: create_company_dict(group) for company_id, group in df.groupby('company_id')}" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "9a4a4013", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'041e9ac4-d7eb-499f-8cfb-fa95dca20cd5': {'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11,\n", + " 'main': 11},\n", + " '057293c5-03c8-427d-9c78-d7cfee35e525': {'head': 9,\n", + " 'header': 9,\n", + " 'footer': 9,\n", + " 'main': 9},\n", + " '0a2dd621-fc29-4d92-92af-a9f27d36b88a': {'main': 12,\n", + " 'header': 11,\n", + " 'footer': 11,\n", + " 'head': 11},\n", + " '1a944f93-a2ac-4352-9b65-a39c7addcb36': {'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11,\n", + " 'main': 11},\n", + " '1b7e4fac-5467-4773-a330-e9ef080aa00a': {'main': 12,\n", + " 'head': 11,\n", + " 'footer': 11,\n", + " 'header': 5},\n", + " '1b9af211-c118-45dc-a7a9-c36acd5c3606': {'head': 10,\n", + " 'header': 10,\n", + " 'footer': 10,\n", + " 'main': 10},\n", + " '1c454ddb-4be8-4c6a-922e-45c877efa542': {'main': 17,\n", + " 'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11},\n", + " '1ccb0060-ed06-40d9-affd-b94b7e5041a1': {'head': 1, 'main': 1},\n", + " '1d793954-ad6e-4d64-84d6-965b6d7e6e28': {'header': 22,\n", + " 'head': 11,\n", + " 'footer': 11,\n", + " 'main': 11},\n", + " '1fd25493-2b64-4e45-9286-496d1a2899c1': {'header': 23,\n", + " 'main': 11,\n", + " 'footer': 10,\n", + " 'head': 10},\n", + " '21a77c0a-0dae-4d9d-8f16-722c4cb80fa4': {'main': 11,\n", + " 'head': 9,\n", + " 'header': 9,\n", + " 'footer': 9},\n", + " '2312afd0-a0b4-41f2-a3f5-20fcaae1cf1d': {'head': 4,\n", + " 'header': 4,\n", + " 'footer': 4,\n", + " 'main': 4},\n", + " '234793b5-f217-4692-bad8-6ac7210c4d99': {'head': 1, 'header': 1, 'main': 1},\n", + " '2445ce14-bb54-4765-89bf-1f828686cb16': {'main': 7,\n", + " 'head': 1,\n", + " 'header': 1,\n", + " 'footer': 1},\n", + " '2646717c-561c-4a07-981c-04d60ae26c5d': {'head': 1, 'main': 1},\n", + " '27587b6e-5ba1-4c57-96f4-7fc16abac658': {'header': 7, 'head': 5, 'footer': 5},\n", + " '2abc40ce-3099-44fc-b5cd-a4abc80ad196': {'head': 11,\n", + " 'header': 11,\n", + " 'main': 11,\n", + " 'footer': 10},\n", + " '2b6f6bd0-fb3b-40b1-8728-6ee500e861a7': {'main': 12,\n", + " 'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11},\n", + " '311dffd7-84bc-4cb6-8bb9-0a87e2ebe0b6': {'head': 6, 'footer': 6, 'main': 6},\n", + " '33203cd4-b2f7-4a60-a949-b53c844f1288': {'head': 1, 'main': 1},\n", + " '350207ab-6df5-4624-9e11-1422ffd3d6e7': {'head': 2,\n", + " 'footer': 2,\n", + " 'main': 2,\n", + " 'header': 1},\n", + " '38ac3499-f153-42f5-a07e-ff286bb3058e': {'head': 1, 'main': 1},\n", + " '3f2be1af-115e-49fd-981c-523d7281c78e': {'header': 2,\n", + " 'head': 1,\n", + " 'footer': 1,\n", + " 'main': 1},\n", + " '47a9fbaa-0c12-442e-b030-78d3ee7b15d3': {'main': 15,\n", + " 'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11},\n", + " '4c1fde18-8a40-4ee7-9c3c-19152c7d1ff8': {'main': 17,\n", + " 'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11},\n", + " '4df58b56-4c19-4602-bdcc-a0f9bc0c1c81': {'main': 12,\n", + " 'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11},\n", + " '4e8e3cf3-79fb-4251-a2e9-970c1ee97068': {'main': 14,\n", + " 'head': 11,\n", + " 'header': 11,\n", + " 'footer': 10},\n", + " '4faddb9c-68bf-4bbe-b471-07d9b07fb7fe': {'head': 10,\n", + " 'header': 10,\n", + " 'main': 10,\n", + " 'footer': 9},\n", + " '506068b9-0b01-4853-b981-850fc0300d34': {'head': 1, 'main': 1},\n", + " '54481572-095b-4314-a99c-54c8d7cb3058': {'head': 10,\n", + " 'main': 10,\n", + " 'header': 8,\n", + " 'footer': 6},\n", + " '550e7ff3-6393-42f8-92c7-cb273f022445': {'main': 8,\n", + " 'head': 6,\n", + " 'header': 6,\n", + " 'footer': 6},\n", + " '5a767574-556b-439a-b1d8-89334fbdbaad': {'head': 9, 'header': 9, 'footer': 9},\n", + " '5c3a39f0-c5d6-4a24-85c3-0684b1228dce': {'head': 1,\n", + " 'header': 1,\n", + " 'footer': 1,\n", + " 'main': 1},\n", + " '5d23c4f4-1e1d-459b-847d-ea05d6f10056': {'header': 22,\n", + " 'head': 11,\n", + " 'footer': 11,\n", + " 'main': 11},\n", + " '649eac25-762d-4b81-98d2-e3a9a481c4ac': {'head': 10,\n", + " 'header': 10,\n", + " 'footer': 10,\n", + " 'main': 10},\n", + " '65cb7019-5546-4adf-8294-f03e5c35bf4d': {'head': 1,\n", + " 'header': 1,\n", + " 'footer': 1,\n", + " 'main': 1},\n", + " '66d2d1bc-73f9-4df4-855d-12330a2cde05': {'main': 17,\n", + " 'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11},\n", + " '69cdaf4f-298d-42db-abef-36c040489a58': {'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11,\n", + " 'main': 11},\n", + " '6f1491c1-aff5-45cf-98aa-2e07c9394f78': {'head': 11,\n", + " 'footer': 11,\n", + " 'main': 11,\n", + " 'header': 10},\n", + " '7227bcbe-e8b6-4285-8c73-4c9375d9df83': {'head': 7,\n", + " 'header': 7,\n", + " 'footer': 7,\n", + " 'main': 7},\n", + " '75c123ba-2e0a-4d3f-a7ec-f737c5f61c2e': {'head': 1, 'main': 1},\n", + " '77713668-9b82-4038-ac33-0b1a3c4d4715': {'main': 18,\n", + " 'head': 10,\n", + " 'header': 10,\n", + " 'footer': 10},\n", + " '784ba666-e00d-459e-9417-66f8a59f84ec': {'head': 11, 'main': 11},\n", + " '7af13eb6-125a-4e01-974b-712ea1a9dea7': {'head': 4,\n", + " 'header': 4,\n", + " 'footer': 4,\n", + " 'main': 4},\n", + " '84210586-05e5-4324-9c12-2883697d9937': {'head': 1, 'main': 1},\n", + " '8beba93c-0fd6-460b-a531-b8475bbbe6e8': {'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11,\n", + " 'main': 11},\n", + " '8db9b2e5-d6e6-4d29-ba56-347bb8507b08': {'head': 1, 'main': 1},\n", + " '9bf1ccd3-dd15-4fc2-8ee9-1c3ab1cc7a94': {'header': 10,\n", + " 'head': 5,\n", + " 'footer': 5},\n", + " '9c539408-9119-4540-8a04-68a0dc259d2d': {'head': 1,\n", + " 'header': 1,\n", + " 'footer': 1,\n", + " 'main': 1},\n", + " '9c8bf109-c7f6-4db4-ac5b-15838f692628': {'header': 18,\n", + " 'head': 9,\n", + " 'footer': 9,\n", + " 'main': 9},\n", + " 'a07fb185-9f97-498c-bcda-a4c21fe27467': {'main': 13,\n", + " 'header': 10,\n", + " 'footer': 10,\n", + " 'head': 10},\n", + " 'a0dffac7-5b73-47bb-8a31-78440a1aef33': {'head': 2, 'main': 2},\n", + " 'a45e4bd7-0ea7-4c3f-ba6a-cdc5027f6c60': {'head': 2,\n", + " 'footer': 2,\n", + " 'main': 2,\n", + " 'header': 1},\n", + " 'aab2261b-7065-460d-83de-99404fb50f65': {'main': 13,\n", + " 'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11},\n", + " 'acab4569-6d20-4f45-a649-501d998e96fd': {'head': 11,\n", + " 'main': 11,\n", + " 'header': 10,\n", + " 'footer': 10},\n", + " 'ae11f430-4c8f-4d77-bb02-c2f1c392bd6d': {'head': 1,\n", + " 'header': 1,\n", + " 'footer': 1,\n", + " 'main': 1},\n", + " 'b244f8a0-cdee-42a4-988f-8d431d4d0794': {'head': 1, 'main': 1},\n", + " 'b2a1f0aa-0578-4d89-8be1-56d86414a1b4': {'header': 22,\n", + " 'head': 11,\n", + " 'footer': 11,\n", + " 'main': 11},\n", + " 'b34a4aec-a2ab-46ca-abca-85b07e6646b2': {'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11,\n", + " 'main': 11},\n", + " 'b598d098-e179-4355-873b-a0119f658b24': {'head': 6,\n", + " 'header': 6,\n", + " 'footer': 6,\n", + " 'main': 6},\n", + " 'ba955d6d-d1bd-4e92-834f-c45f820679b6': {'head': 5,\n", + " 'header': 5,\n", + " 'footer': 5,\n", + " 'main': 5},\n", + " 'bb0eef13-1271-4f12-992f-209aab01bac3': {'head': 6,\n", + " 'header': 6,\n", + " 'footer': 6,\n", + " 'main': 6},\n", + " 'c4ef42c5-511c-4991-bce9-e04c739cbd22': {'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11,\n", + " 'main': 11},\n", + " 'c7d9376d-9eff-4ab0-8e35-230697588b93': {'head': 4,\n", + " 'main': 4,\n", + " 'header': 2,\n", + " 'footer': 2},\n", + " 'c84b92cc-28a5-4383-845d-237926a5f120': {'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11,\n", + " 'main': 11},\n", + " 'cd17faf2-60f5-4a27-b727-9d9337610077': {'head': 11,\n", + " 'header': 11,\n", + " 'main': 11,\n", + " 'footer': 10},\n", + " 'cd931e32-04fd-4265-8b9f-99943dbd77f4': {'header': 19,\n", + " 'head': 10,\n", + " 'footer': 10,\n", + " 'main': 1},\n", + " 'd09d7050-14d4-4bee-b7bc-709a046fa5c0': {'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11,\n", + " 'main': 11},\n", + " 'd90cd817-7db0-4394-a5ef-65df2bd96487': {'head': 5,\n", + " 'header': 5,\n", + " 'footer': 5,\n", + " 'main': 5},\n", + " 'ed5f1cb6-6ab1-4b61-9412-5c23e6f02cf0': {'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11,\n", + " 'main': 11},\n", + " 'edda975d-a01c-4e93-8245-68183fb19ced': {'head': 10,\n", + " 'header': 10,\n", + " 'footer': 10},\n", + " 'f3004596-8695-4808-8c54-d80cb63c98d0': {'head': 1, 'main': 1},\n", + " 'f5b2f14b-1186-4f74-aa14-86ec73da913f': {'footer': 18, 'head': 9, 'main': 9},\n", + " 'f7958fe2-7f18-4da1-be80-1b6ebd68e427': {'head': 1, 'footer': 1, 'main': 1},\n", + " 'f9d892e2-ded2-4ac3-bf2d-bc65c5c58131': {'head': 11,\n", + " 'header': 11,\n", + " 'footer': 11,\n", + " 'main': 11}}" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "chunk_type_distribution_dict" + ] + }, + { + "cell_type": "markdown", + "id": "2ff34159", + "metadata": {}, + "source": [ + "## General Analysis" + ] + }, + { + "cell_type": "markdown", + "id": "7816b4c8", + "metadata": {}, + "source": [ + "### Analyzing chunk_id\n" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "c9c57042", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2128" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(np.unique(df['chunk_id']))" + ] + }, + { + "cell_type": "markdown", + "id": "897874aa", + "metadata": {}, + "source": [ + "All the elements in the `chunk_id` are unique. This can be used as the primary key for indexing/lookup." + ] + }, + { + "cell_type": "markdown", + "id": "3d0b8ffb", + "metadata": {}, + "source": [ + "### Analyzing chunk_hash\n" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "e8233ca8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1337" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(np.unique(df['chunk_hash']))" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "e938bcd0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
company_idcompany_nameurlchunk_typechunk_hashchunkchunk_idprefix_url
60aab2261b-7065-460d-83de-99404fb50f65betahaus Sofiahttps://betahaus.bg/blogfooter2f75ead3d8821183e12c322c1800afe60a175572ca3b3a...<footer class=\"footer\"><h4>Връзка с нас</h4>ул...af3a36a1-fb5a-44b1-b5ae-9b040ca5889chttps://betahaus.bg
6321a77c0a-0dae-4d9d-8f16-722c4cb80fa4The Best Bees Companyhttps://bestbees.com/blog/header2a1407969dc3931af069563f58b3ec89bfa300ac6c4b1f...<header class=\"fl-builder-content fl-builder-c...293eeafb-09f9-4611-8d37-0e5f07dae564https://bestbees.com
6421a77c0a-0dae-4d9d-8f16-722c4cb80fa4The Best Bees Companyhttps://bestbees.com/blog/footerc04c6008e00dc3e53fdea0d3f8db942f8aff0bff155c17...<footer class=\"fl-builder-content fl-builder-c...fe49b6bb-9a6d-4c3a-8b23-40c8071e3cd4https://bestbees.com
8721a77c0a-0dae-4d9d-8f16-722c4cb80fa4The Best Bees Companyhttps://bestbees.com/get-started/header2a1407969dc3931af069563f58b3ec89bfa300ac6c4b1f...<header class=\"fl-builder-content fl-builder-c...a6851616-8a65-4409-9a63-66a6eb2153cchttps://bestbees.com
8821a77c0a-0dae-4d9d-8f16-722c4cb80fa4The Best Bees Companyhttps://bestbees.com/get-started/footerc04c6008e00dc3e53fdea0d3f8db942f8aff0bff155c17...<footer class=\"fl-builder-content fl-builder-c...271d5729-d600-47f6-87e2-5b7aa065f712https://bestbees.com
...........................
21111fd25493-2b64-4e45-9286-496d1a2899c1American Express Travelhttps://www.americanexpress.com/en-us/support/...header994f9b66932ae87ff338704924562cf913f73c0f49ae26...<ul><li><a href=\"https://www.americanexpress.c...ace4dac1-4401-4b5b-b61a-4b922d3a8362https://www.americanexpress.com
21121fd25493-2b64-4e45-9286-496d1a2899c1American Express Travelhttps://www.americanexpress.com/en-us/support/...footer7d4855c2c301923efef11a825dca16e1c6791d6cec96dc...<footer class=\"axp-footer__footer__footer___32...4ea2b388-7505-4e20-b57e-75eaddb4d242https://www.americanexpress.com
21200a2dd621-fc29-4d92-92af-a9f27d36b88aMobile Programming LLC.https://www.mobileprogramming.com/head023ecc39b08c78b255f966b4e7768ab00f491469e096f0...<head><title>App Design &amp; Development | Mo...07c06cb2-1d10-46ac-84e3-b0eaa0279fb2https://www.mobileprogramming.com
21210a2dd621-fc29-4d92-92af-a9f27d36b88aMobile Programming LLC.https://www.mobileprogramming.com/header4e4d70173788abe5d7e7004f590d78dd9911e775c91aa4...<header class=\"top-header\" id=\"myHeader\"><nav>...b9015037-7ef1-4bb1-8607-f63fefb06226https://www.mobileprogramming.com
21220a2dd621-fc29-4d92-92af-a9f27d36b88aMobile Programming LLC.https://www.mobileprogramming.com/footerf40bd3fed28b7926b308ed7e43d0c489df78b5813c5945...<footer class=\"container-fluid pd-0 footer-bg\"...6afb8c04-87cc-45d7-a7f4-c11df0d88d54https://www.mobileprogramming.com
\n", + "

791 rows × 8 columns

\n", + "
" + ], + "text/plain": [ + " company_id company_name \\\n", + "60 aab2261b-7065-460d-83de-99404fb50f65 betahaus Sofia \n", + "63 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 The Best Bees Company \n", + "64 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 The Best Bees Company \n", + "87 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 The Best Bees Company \n", + "88 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 The Best Bees Company \n", + "... ... ... \n", + "2111 1fd25493-2b64-4e45-9286-496d1a2899c1 American Express Travel \n", + "2112 1fd25493-2b64-4e45-9286-496d1a2899c1 American Express Travel \n", + "2120 0a2dd621-fc29-4d92-92af-a9f27d36b88a Mobile Programming LLC. \n", + "2121 0a2dd621-fc29-4d92-92af-a9f27d36b88a Mobile Programming LLC. \n", + "2122 0a2dd621-fc29-4d92-92af-a9f27d36b88a Mobile Programming LLC. \n", + "\n", + " url chunk_type \\\n", + "60 https://betahaus.bg/blog footer \n", + "63 https://bestbees.com/blog/ header \n", + "64 https://bestbees.com/blog/ footer \n", + "87 https://bestbees.com/get-started/ header \n", + "88 https://bestbees.com/get-started/ footer \n", + "... ... ... \n", + "2111 https://www.americanexpress.com/en-us/support/... header \n", + "2112 https://www.americanexpress.com/en-us/support/... footer \n", + "2120 https://www.mobileprogramming.com/ head \n", + "2121 https://www.mobileprogramming.com/ header \n", + "2122 https://www.mobileprogramming.com/ footer \n", + "\n", + " chunk_hash \\\n", + "60 2f75ead3d8821183e12c322c1800afe60a175572ca3b3a... \n", + "63 2a1407969dc3931af069563f58b3ec89bfa300ac6c4b1f... \n", + "64 c04c6008e00dc3e53fdea0d3f8db942f8aff0bff155c17... \n", + "87 2a1407969dc3931af069563f58b3ec89bfa300ac6c4b1f... \n", + "88 c04c6008e00dc3e53fdea0d3f8db942f8aff0bff155c17... \n", + "... ... \n", + "2111 994f9b66932ae87ff338704924562cf913f73c0f49ae26... \n", + "2112 7d4855c2c301923efef11a825dca16e1c6791d6cec96dc... \n", + "2120 023ecc39b08c78b255f966b4e7768ab00f491469e096f0... \n", + "2121 4e4d70173788abe5d7e7004f590d78dd9911e775c91aa4... \n", + "2122 f40bd3fed28b7926b308ed7e43d0c489df78b5813c5945... \n", + "\n", + " chunk \\\n", + "60