diff --git a/dhanush submission/README.md b/dhanush submission/README.md
new file mode 100644
index 0000000..7adebc1
--- /dev/null
+++ b/dhanush submission/README.md
@@ -0,0 +1,50 @@
+# Gradient Works Exercise
+
+## Part-1
+
+1. How many companies are in the dataset?
+There are 75 companies in the dataset.
+2. How many unique URLs are in the dataset?
+There are 530 unique URLs in the dataset. On analyzing the prefixes of the URLs, it is observed that the URLs are from 77 different domains. 2 of the domains seems to have duplicate URLs.
+
+
+3. What is the most common chunk type?
+The most common chunk type is `header` with 549 occurrences.
+
+
+
+4. What is the distribution of chunk types by company?
+Please refer to the jupyter notebook under the Notebooks folder for the distribution of chunk types by company.
+
+## Part-2 RAG
+
+### Architecture Diagram
+
+
+
+## Steps to run the code
+1. Create a `.env` file and place your OPEN_AI API key in this format
+```
+OPENAI_API_KEY =
+COHERE_API_KEY =
+```
+2. Run the `requirements.txt` file to install all the necessary libraries.
+```
+pip install -r requirements.txt
+```
+3. Run `chunking.py` first, as this converts the HTML content to text and saves the processed csv file.
+4. Run `embedding.py` next to generate embeddings and store them as a numpy file.
+5. The code is also exposed as an API using FASTAPI. To run the API server, run the following command inside the src folder.
+```
+uvicorn main:app --reload --port 8080
+```
+This will start the API server at http://localhost:8080.
+6. Run `chat.py` next, which opens Streamlit in your browser, allowing you to ask relevant questions based on the csv file provided.
+```
+streamlit run src/chat.py
+```
+### Demo
+
+
+> [!NOTE]
+The code is also available as a jupyter notebook under the notebooks folder.
diff --git a/dhanush submission/assets/Architecture_diagram.png b/dhanush submission/assets/Architecture_diagram.png
new file mode 100644
index 0000000..fe91cc7
Binary files /dev/null and b/dhanush submission/assets/Architecture_diagram.png differ
diff --git a/dhanush submission/assets/Qn_2.png b/dhanush submission/assets/Qn_2.png
new file mode 100644
index 0000000..823cb8c
Binary files /dev/null and b/dhanush submission/assets/Qn_2.png differ
diff --git a/dhanush submission/assets/Qn_3.png b/dhanush submission/assets/Qn_3.png
new file mode 100644
index 0000000..151d0f7
Binary files /dev/null and b/dhanush submission/assets/Qn_3.png differ
diff --git a/dhanush submission/assets/demo.png b/dhanush submission/assets/demo.png
new file mode 100644
index 0000000..988fc42
Binary files /dev/null and b/dhanush submission/assets/demo.png differ
diff --git a/dhanush submission/notebooks/Dhanush GW.ipynb b/dhanush submission/notebooks/Dhanush GW.ipynb
new file mode 100644
index 0000000..1d1388c
--- /dev/null
+++ b/dhanush submission/notebooks/Dhanush GW.ipynb
@@ -0,0 +1,2652 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "32b4fb60",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "from matplotlib import pyplot as plt"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "93a352f1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df = pd.read_csv('../data/content.csv')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "25e5952c",
+ "metadata": {},
+ "source": [
+ "# Tasks - Part 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "30a8c8bb",
+ "metadata": {},
+ "source": [
+ "Here are some questions that we'd like you to answer about the dataset:\n",
+ "\n",
+ "1. How many companies are in the dataset?\n",
+ "2. How many unique URLs are in the dataset?\n",
+ "3. What is the most common chunk type?\n",
+ "4. What is the distribution of chunk types by company?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4a9b154c",
+ "metadata": {},
+ "source": [
+ "## 1. How many companies are in the dataset?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "13945546",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "75"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(np.unique(df['company_id']))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "d8870337",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(np.unique(df['company_id'])) == len(np.unique(df['company_name']))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "da4434ca",
+ "metadata": {},
+ "source": [
+ "The dataset contains 75 unique companies. The columns `company_name` and `company_id` have a one-to-one correspondence, ensuring that there are no duplicate names in the `company_name` column, assuming company_id is unique."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f35dd734",
+ "metadata": {},
+ "source": [
+ "## 2. How many unique URLs are in the dataset?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "1d455aac",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "530"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(np.unique(df['url']))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5cfa3e26",
+ "metadata": {},
+ "source": [
+ "There are 530 unique urls in this dataset. Let's fetch the prefix urls for these companies and check if it matches with the unique company_id"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "022127a9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "prefix_url = [x.split('/')[0] + '//' + x.split('/')[2] for x in df['url']]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "d7d237a4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df['prefix_url'] = prefix_url"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "65408cba",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "77"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(np.unique(prefix_url))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "379500a6",
+ "metadata": {},
+ "source": [
+ "2 company names seems to be repeated"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "66786ba5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "company_id = []\n",
+ "uniq_url = []\n",
+ "for x in range(len(df['prefix_url'])):\n",
+ " if df.iloc[x]['prefix_url'] not in uniq_url:\n",
+ " uniq_url.append(df.iloc[x]['prefix_url'])\n",
+ " company_id.append(df.iloc[x]['company_id'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "de8ff277",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(company_id) == len(uniq_url)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "c5466bbc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "temp_df = pd.DataFrame({'company_id_unique': company_id, 'unique_url': uniq_url})"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "cb5893e2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "duplicate_df = temp_df[temp_df.duplicated(subset=['company_id_unique'])]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "f3c9d280",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " company_id_unique | \n",
+ " unique_url | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 4 | \n",
+ " a45e4bd7-0ea7-4c3f-ba6a-cdc5027f6c60 | \n",
+ " https://www.bingotech.net | \n",
+ "
\n",
+ " \n",
+ " 54 | \n",
+ " a0dffac7-5b73-47bb-8a31-78440a1aef33 | \n",
+ " https://4wheeltravels.com | \n",
+ "
\n",
+ " \n",
+ " 56 | \n",
+ " a45e4bd7-0ea7-4c3f-ba6a-cdc5027f6c60 | \n",
+ " https://bingotech.net | \n",
+ "
\n",
+ " \n",
+ " 57 | \n",
+ " a0dffac7-5b73-47bb-8a31-78440a1aef33 | \n",
+ " https://sg2plcpnl0188.prod.sin2.secureserver.n... | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " company_id_unique \\\n",
+ "4 a45e4bd7-0ea7-4c3f-ba6a-cdc5027f6c60 \n",
+ "54 a0dffac7-5b73-47bb-8a31-78440a1aef33 \n",
+ "56 a45e4bd7-0ea7-4c3f-ba6a-cdc5027f6c60 \n",
+ "57 a0dffac7-5b73-47bb-8a31-78440a1aef33 \n",
+ "\n",
+ " unique_url \n",
+ "4 https://www.bingotech.net \n",
+ "54 https://4wheeltravels.com \n",
+ "56 https://bingotech.net \n",
+ "57 https://sg2plcpnl0188.prod.sin2.secureserver.n... "
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "temp_df[temp_df['company_id_unique'].isin(duplicate_df['company_id_unique'])]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8ea31d37",
+ "metadata": {},
+ "source": [
+ "Bingotech duplicate domain names: https://www.bingotech.net and https://bingotech.net
\n",
+ "4wheeltravels duplicate domain names: https://4wheeltravels.com and https://sg2plcpnl0188.prod.sin2.secureserver.n..."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ea2849b5",
+ "metadata": {},
+ "source": [
+ "## 3. What is the most common chunk type?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "5a26687d",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "chunk_type\n",
+ "header 549\n",
+ "main 545\n",
+ "head 530\n",
+ "footer 504\n",
+ "Name: count, dtype: int64"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df['chunk_type'].value_counts()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "75f39608",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAigAAAGdCAYAAAA44ojeAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/H5lhTAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAhdElEQVR4nO3df1DUdeLH8Re/UWBBUBY5ES1LpTTSTNf8lmccZIyjJ9dV5xA6pGVIKaMZnalhd3Zel1mD2nSm3pyOl+dVkxmKlHajiIinY2pcOnXQ6cKVA4glP/f7R8PnblPLVXDf4PMxszPu5/Pez+f98cPqk93Pgo/L5XIJAADAIL7engAAAMD3ESgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjOPv7QlcidbWVp06dUphYWHy8fHx9nQAAMBlcLlcOnv2rGJjY+Xr+8OvkXTKQDl16pTi4uK8PQ0AAHAFKisr1adPnx8c0ykDJSwsTNJ3B2iz2bw8GwAAcDnq6uoUFxdn/T/+QzploLS9rWOz2QgUAAA6mcu5PIOLZAEAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBx/b0/ARP2eed/bU7huffFiqrenAAAwAK+gAAAA4xAoAADAOLzFg+sGb915D2/dAfAUr6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjMMPagPQ6fFD+LyHH8KHjsIrKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjONRoCxevFg+Pj5ut0GDBlnrz58/r6ysLEVFRSk0NFRpaWmqqqpy20ZFRYVSU1PVvXt3RUdHa968eWpubm6fowEAAF2Cv6cPuOWWW7Rz587/bsD/v5uYM2eO3n//fW3evFnh4eGaNWuWJk+erD179kiSWlpalJqaqpiYGO3du1enT5/WI488ooCAAP32t79th8MBAHQl/Z5539tTuG598WKqV/fvcaD4+/srJibmguW1tbVas2aNNm7cqHHjxkmS1q5dq8GDB2vfvn0aNWqUduzYoWPHjmnnzp2y2+1KTEzUkiVLNH/+fC1evFiBgYFXf0QAAKDT8/galM8++0yxsbG64YYbNGXKFFVUVEiSysrK1NTUpKSkJGvsoEGD1LdvXxUXF0uSiouLNWTIENntdmtMSkqK6urqdPTo0Uvus6GhQXV1dW43AADQdXkUKCNHjtS6detUUFCgVatW6fPPP9f//d//6ezZs3I6nQoMDFRERITbY+x2u5xOpyTJ6XS6xUnb+rZ1l7J06VKFh4dbt7i4OE+mDQAAOhmP3uIZP3689eehQ4dq5MiRio+P11tvvaVu3bq1++Ta5ObmKicnx7pfV1dHpAAA0IVd1ceMIyIidPPNN+vEiROKiYlRY2Ojampq3MZUVVVZ16zExMRc8KmetvsXu66lTVBQkGw2m9sNAAB0XVcVKPX19Tp58qR69+6t4cOHKyAgQEVFRdb68vJyVVRUyOFwSJIcDoeOHDmi6upqa0xhYaFsNpsSEhKuZioAAKAL8egtnrlz52rChAmKj4/XqVOntGjRIvn5+enhhx9WeHi4MjMzlZOTo8jISNlsNmVnZ8vhcGjUqFGSpOTkZCUkJCg9PV3Lli2T0+nUggULlJWVpaCgoA45QAAA0Pl4FChffvmlHn74YX399dfq1auXxowZo3379qlXr16SpOXLl8vX11dpaWlqaGhQSkqKVq5caT3ez89PW7du1cyZM+VwOBQSEqKMjAzl5eW171EBAIBOzaNA2bRp0w+uDw4OVn5+vvLz8y85Jj4+Xtu2bfNktwAA4DrD7+IBAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMa5qkB58cUX5ePjo9mzZ1vLzp8/r6ysLEVFRSk0NFRpaWmqqqpye1xFRYVSU1PVvXt3RUdHa968eWpubr6aqQAAgC7kigOltLRUr7/+uoYOHeq2fM6cOXrvvfe0efNm7d69W6dOndLkyZOt9S0tLUpNTVVjY6P27t2r9evXa926dVq4cOGVHwUAAOhSrihQ6uvrNWXKFL3xxhvq0aOHtby2tlZr1qzRyy+/rHHjxmn48OFau3at9u7dq3379kmSduzYoWPHjunPf/6zEhMTNX78eC1ZskT5+flqbGxsn6MCAACd2hUFSlZWllJTU5WUlOS2vKysTE1NTW7LBw0apL59+6q4uFiSVFxcrCFDhshut1tjUlJSVFdXp6NHj150fw0NDaqrq3O7AQCArsvf0wds2rRJBw8eVGlp6QXrnE6nAgMDFRER4bbcbrfL6XRaY/43TtrWt627mKVLl+r555/3dKoAAKCT8ugVlMrKSj311FPasGGDgoODO2pOF8jNzVVtba11q6ysvGb7BgAA155HgVJWVqbq6moNGzZM/v7+8vf31+7du/Xqq6/K399fdrtdjY2NqqmpcXtcVVWVYmJiJEkxMTEXfKqn7X7bmO8LCgqSzWZzuwEAgK7Lo0C59957deTIER06dMi63XHHHZoyZYr154CAABUVFVmPKS8vV0VFhRwOhyTJ4XDoyJEjqq6utsYUFhbKZrMpISGhnQ4LAAB0Zh5dgxIWFqZbb73VbVlISIiioqKs5ZmZmcrJyVFkZKRsNpuys7PlcDg0atQoSVJycrISEhKUnp6uZcuWyel0asGCBcrKylJQUFA7HRYAAOjMPL5I9scsX75cvr6+SktLU0NDg1JSUrRy5UprvZ+fn7Zu3aqZM2fK4XAoJCREGRkZysvLa++pAACATuqqA2XXrl1u94ODg5Wfn6/8/PxLPiY+Pl7btm272l0DAIAuit/FAwAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACM41GgrFq1SkOHDpXNZpPNZpPD4dAHH3xgrT9//ryysrIUFRWl0NBQpaWlqaqqym0bFRUVSk1NVffu3RUdHa158+apubm5fY4GAAB0CR4FSp8+ffTiiy+qrKxMBw4c0Lhx4zRx4kQdPXpUkjRnzhy999572rx5s3bv3q1Tp05p8uTJ1uNbWlqUmpqqxsZG7d27V+vXr9e6deu0cOHC9j0qAADQqfl7MnjChAlu93/zm99o1apV2rdvn/r06aM1a9Zo48aNGjdunCRp7dq1Gjx4sPbt26dRo0Zpx44dOnbsmHbu3Cm73a7ExEQtWbJE8+fP1+LFixUYGNh+RwYAADqtK74GpaWlRZs2bdK5c+fkcDhUVlampqYmJSUlWWMGDRqkvn37qri4WJJUXFysIUOGyG63W2NSUlJUV1dnvQpzMQ0NDaqrq3O7AQCArsvjQDly5IhCQ0MVFBSkxx9/XG+//bYSEhLkdDoVGBioiIgIt/F2u11Op1OS5HQ63eKkbX3buktZunSpwsPDrVtcXJyn0wYAAJ2Ix4EycOBAHTp0SCUlJZo5c6YyMjJ07NixjpibJTc3V7W1tdatsrKyQ/cHAAC8y6NrUCQpMDBQAwYMkCQNHz5cpaWlWrFihR588EE1NjaqpqbG7VWUqqoqxcTESJJiYmK0f/9+t+21fcqnbczFBAUFKSgoyNOpAgCATuqqfw5Ka2urGhoaNHz4cAUEBKioqMhaV15eroqKCjkcDkmSw+HQkSNHVF1dbY0pLCyUzWZTQkLC1U4FAAB0ER69gpKbm6vx48erb9++Onv2rDZu3Khdu3Zp+/btCg8PV2ZmpnJychQZGSmbzabs7Gw5HA6NGjVKkpScnKyEhASlp6dr2bJlcjqdWrBggbKysniFBAAAWDwKlOrqaj3yyCM6ffq0wsPDNXToUG3fvl0/+9nPJEnLly+Xr6+v0tLS1NDQoJSUFK1cudJ6vJ+fn7Zu3aqZM2fK4XAoJCREGRkZysvLa9+jAgAAnZpHgbJmzZofXB8cHKz8/Hzl5+dfckx8fLy2bdvmyW4BAMB1ht/FAwAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACM41GgLF26VCNGjFBYWJiio6M1adIklZeXu405f/68srKyFBUVpdDQUKWlpamqqsptTEVFhVJTU9W9e3dFR0dr3rx5am5uvvqjAQAAXYJHgbJ7925lZWVp3759KiwsVFNTk5KTk3Xu3DlrzJw5c/Tee+9p8+bN2r17t06dOqXJkydb61taWpSamqrGxkbt3btX69ev17p167Rw4cL2OyoAANCp+XsyuKCgwO3+unXrFB0drbKyMt19992qra3VmjVrtHHjRo0bN06StHbtWg0ePFj79u3TqFGjtGPHDh07dkw7d+6U3W5XYmKilixZovnz52vx4sUKDAxsv6MDAACd0lVdg1JbWytJioyMlCSVlZWpqalJSUlJ1phBgwapb9++Ki4uliQVFxdryJAhstvt1piUlBTV1dXp6NGjF91PQ0OD6urq3G4AAKDruuJAaW1t1ezZs3XXXXfp1ltvlSQ5nU4FBgYqIiLCbazdbpfT6bTG/G+ctK1vW3cxS5cuVXh4uHWLi4u70mkDAIBO4IoDJSsrS5988ok2bdrUnvO5qNzcXNXW1lq3ysrKDt8nAADwHo+uQWkza9Ysbd26VR9//LH69OljLY+JiVFjY6NqamrcXkWpqqpSTEyMNWb//v1u22v7lE/bmO8LCgpSUFDQlUwVAAB0Qh69guJyuTRr1iy9/fbb+vDDD9W/f3+39cOHD1dAQICKioqsZeXl5aqoqJDD4ZAkORwOHTlyRNXV1daYwsJC2Ww2JSQkXM2xAACALsKjV1CysrK0ceNGvfvuuwoLC7OuGQkPD1e3bt0UHh6uzMxM5eTkKDIyUjabTdnZ2XI4HBo1apQkKTk5WQkJCUpPT9eyZcvkdDq1YMECZWVl8SoJAACQ5GGgrFq1SpI0duxYt+Vr167V1KlTJUnLly+Xr6+v0tLS1NDQoJSUFK1cudIa6+fnp61bt2rmzJlyOBwKCQlRRkaG8vLyru5IAABAl+FRoLhcrh8dExwcrPz8fOXn519yTHx8vLZt2+bJrgEAwHWE38UDAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOAQKAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIzjcaB8/PHHmjBhgmJjY+Xj46N33nnHbb3L5dLChQvVu3dvdevWTUlJSfrss8/cxpw5c0ZTpkyRzWZTRESEMjMzVV9ff1UHAgAAug6PA+XcuXO67bbblJ+ff9H1y5Yt06uvvqrVq1erpKREISEhSklJ0fnz560xU6ZM0dGjR1VYWKitW7fq448/1owZM678KAAAQJfi7+kDxo8fr/Hjx190ncvl0iuvvKIFCxZo4sSJkqQ//elPstvteuedd/TQQw/p+PHjKigoUGlpqe644w5J0muvvab7779fL730kmJjY6/icAAAQFfQrtegfP7553I6nUpKSrKWhYeHa+TIkSouLpYkFRcXKyIiwooTSUpKSpKvr69KSkouut2GhgbV1dW53QAAQNfVroHidDolSXa73W253W631jmdTkVHR7ut9/f3V2RkpDXm+5YuXarw8HDrFhcX157TBgAAhukUn+LJzc1VbW2tdausrPT2lAAAQAdq10CJiYmRJFVVVbktr6qqstbFxMSourrabX1zc7POnDljjfm+oKAg2Ww2txsAAOi62jVQ+vfvr5iYGBUVFVnL6urqVFJSIofDIUlyOByqqalRWVmZNebDDz9Ua2urRo4c2Z7TAQAAnZTHn+Kpr6/XiRMnrPuff/65Dh06pMjISPXt21ezZ8/WCy+8oJtuukn9+/fXc889p9jYWE2aNEmSNHjwYN13332aPn26Vq9eraamJs2aNUsPPfQQn+ABAACSriBQDhw4oJ/+9KfW/ZycHElSRkaG1q1bp6efflrnzp3TjBkzVFNTozFjxqigoEDBwcHWYzZs2KBZs2bp3nvvla+vr9LS0vTqq6+2w+EAAICuwONAGTt2rFwu1yXX+/j4KC8vT3l5eZccExkZqY0bN3q6awAAcJ3oFJ/iAQAA1xcCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHAIFAAAYh0ABAADGIVAAAIBxCBQAAGAcAgUAABiHQAEAAMYhUAAAgHEIFAAAYBwCBQAAGIdAAQAAxiFQAACAcQgUAABgHK8GSn5+vvr166fg4GCNHDlS+/fv9+Z0AACAIbwWKH/5y1+Uk5OjRYsW6eDBg7rtttuUkpKi6upqb00JAAAYwmuB8vLLL2v69OmaNm2aEhIStHr1anXv3l1vvvmmt6YEAAAM4e+NnTY2NqqsrEy5ubnWMl9fXyUlJam4uPiC8Q0NDWpoaLDu19bWSpLq6uo6ZH6tDd90yHbx4zrqnEqcV2/qyPMqcW69iXPbdXXEuW3bpsvl+tGxXgmUr776Si0tLbLb7W7L7Xa7Pv300wvGL126VM8///wFy+Pi4jpsjvCO8Fe8PQN0BM5r18W57bo68tyePXtW4eHhPzjGK4HiqdzcXOXk5Fj3W1tbdebMGUVFRcnHx8eLMzNLXV2d4uLiVFlZKZvN5u3poB1xbrsmzmvXxbm9OJfLpbNnzyo2NvZHx3olUHr27Ck/Pz9VVVW5La+qqlJMTMwF44OCghQUFOS2LCIioiOn2KnZbDaeEF0U57Zr4rx2XZzbC/3YKydtvHKRbGBgoIYPH66ioiJrWWtrq4qKiuRwOLwxJQAAYBCvvcWTk5OjjIwM3XHHHbrzzjv1yiuv6Ny5c5o2bZq3pgQAAAzhtUB58MEH9Z///EcLFy6U0+lUYmKiCgoKLrhwFpcvKChIixYtuuDtMHR+nNuuifPadXFur56P63I+6wMAAHAN8bt4AACAcQgUAABgHAIFAAAYh0DxgrFjx2r27NnXfL/9+vXTK6+8cs33i44zdepUTZo0ydvTuK7xfO5aXC6XZsyYocjISPn4+OjQoUPentJ1q1P8JFkAF7dixYrL+p0WAC5PQUGB1q1bp127dumGG25Qz549r3qbY8eOVWJiIkHpIQIFl62xsVGBgYHengb+x+X+REYAl+fkyZPq3bu3Ro8e7e2pXOB6+zeYt3i8pLW1VU8//bQiIyMVExOjxYsXW+tqamr06KOPqlevXrLZbBo3bpwOHz5srT958qQmTpwou92u0NBQjRgxQjt37nTbfnV1tSZMmKBu3bqpf//+2rBhwwVz+LH9LF68WImJifrjH/+o/v37Kzg4uP3/Iq4jY8eOVXZ2tmbPnq0ePXrIbrfrjTfesH5AYVhYmAYMGKAPPvhAktTS0qLMzEz1799f3bp108CBA7VixQq3bX7/LZ6xY8fqySefvOTXFjqGCc9nXL2pU6cqOztbFRUV8vHxUb9+/dTQ0KAnn3xS0dHRCg4O1pgxY1RaWur2uN27d+vOO+9UUFCQevfurWeeeUbNzc3WNnfv3q0VK1bIx8dHPj4++uKLLyRJn3zyicaPH6/Q0FDZ7Xalp6frq6++srY7duxYzZo1S7Nnz1bPnj2VkpJyzf4uTECgeMn69esVEhKikpISLVu2THl5eSosLJQkPfDAA6qurtYHH3ygsrIyDRs2TPfee6/OnDkjSaqvr9f999+voqIi/eMf/9B9992nCRMmqKKiwtr+1KlTVVlZqY8++kh//etftXLlSlVXV7vN4cf2I0knTpzQli1b9Le//Y33YtvB+vXr1bNnT+3fv1/Z2dmaOXOmHnjgAY0ePVoHDx5UcnKy0tPT9c0336i1tVV9+vTR5s2bdezYMS1cuFDPPvus3nrrrR/dx6W+ttAxTHg+4+qtWLFCeXl56tOnj06fPq3S0lI9/fTT2rJli9avX6+DBw9qwIABSklJsc7fv//9b91///0aMWKEDh8+rFWrVmnNmjV64YUXrG06HA5Nnz5dp0+f1unTpxUXF6eamhqNGzdOt99+uw4cOKCCggJVVVXpl7/8pduc1q9fr8DAQO3Zs0erV6++5n8nXuXCNXfPPfe4xowZ47ZsxIgRrvnz57v+/ve/u2w2m+v8+fNu62+88UbX66+/fslt3nLLLa7XXnvN5XK5XOXl5S5Jrv3791vrjx8/7pLkWr58ucvlcl3WfhYtWuQKCAhwVVdXX/Gx4r++f96bm5tdISEhrvT0dGvZ6dOnXZJcxcXFF91GVlaWKy0tzbqfkZHhmjhx4iX34XL992sLHcOE5zPaz/Lly13x8fEul8vlqq+vdwUEBLg2bNhgrW9sbHTFxsa6li1b5nK5XK5nn33WNXDgQFdra6s1Jj8/3xUaGupqaWlxuVzffY089dRTbvtZsmSJKzk52W1ZZWWlS5KrvLzcetztt9/e3ofYaXANipcMHTrU7X7v3r1VXV2tw4cPq76+XlFRUW7rv/32W508eVLSd99xLV68WO+//75Onz6t5uZmffvtt9Z3XMePH5e/v7+GDx9uPX7QoEFuvwH6cvYjSfHx8erVq1e7HDPcz7ufn5+ioqI0ZMgQa1nbr3po++44Pz9fb775pioqKvTtt9+qsbFRiYmJl70P6b9fW+g43n4+o2OcPHlSTU1Nuuuuu6xlAQEBuvPOO3X8+HFJ350fh8MhHx8fa8xdd92l+vp6ffnll+rbt+9Ft3348GF99NFHCg0Nveh+b775ZklyO+/XGwLFSwICAtzu+/j4qLW1VfX19erdu7d27dp1wWPa/kGaO3euCgsL9dJLL2nAgAHq1q2bfvGLX6ixsfGy9385+5GkkJCQy94mftzFzvv/Lmv7R661tVWbNm3S3Llz9Yc//EEOh0NhYWH6/e9/r5KSEo/30dra2k5HgIvx9vMZnU99fb0mTJig3/3udxes6927t/Xn6/nfYALFMMOGDZPT6ZS/v7/69et30TF79uzR1KlT9fOf/1zSd1/obRddSd99d9Xc3KyysjKNGDFCklReXq6amhqP9gPv2rNnj0aPHq0nnnjCWva/r27BfNfq+YyOceONN1rXf8THx0uSmpqaVFpaav3sm8GDB2vLli1yuVzWNxh79uxRWFiY+vTpI0kKDAxUS0uL27aHDRumLVu2qF+/fvL357/ii+EiWcMkJSXJ4XBo0qRJ2rFjh7744gvt3btXv/71r3XgwAFJ0k033WRdtHr48GH96le/cvsOeeDAgbrvvvv02GOPqaSkRGVlZXr00UfVrVs3j/YD77rpppt04MABbd++Xf/85z/13HPPXfDpAZjtWj2f0TFCQkI0c+ZMzZs3TwUFBTp27JimT5+ub775RpmZmZKkJ554QpWVlcrOztann36qd999V4sWLVJOTo58fb/7L7Zfv34qKSnRF198oa+++kqtra3KysrSmTNn9PDDD6u0tFQnT57U9u3bNW3atAti5npFoBjGx8dH27Zt0913361p06bp5ptv1kMPPaR//etf1vUJL7/8snr06KHRo0drwoQJSklJ0bBhw9y2s3btWsXGxuqee+7R5MmTNWPGDEVHR3u0H3jXY489psmTJ+vBBx/UyJEj9fXXX7u9mgLzXavnMzrOiy++qLS0NKWnp2vYsGE6ceKEtm/frh49ekiSfvKTn2jbtm3av3+/brvtNj3++OPKzMzUggULrG3MnTtXfn5+SkhIUK9evVRRUaHY2Fjt2bNHLS0tSk5O1pAhQzR79mxFRERYYXO983G5+DGUAADALGQaAAAwDoECAACMQ6AAAADjECgAAMA4BAoAADAOgQIAAIxDoAAAAOMQKAAAwDgECgAAMA6BAgAAjEOgAAAA4xAoAADAOP8PSfZ31rcUydgAAAAASUVORK5CYII=",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "counts = df['chunk_type'].value_counts()\n",
+ "plt.bar(counts.index, counts.values) \n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a92308fc",
+ "metadata": {},
+ "source": [
+ "**Header** is the most common chunk type."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "08a0614f",
+ "metadata": {},
+ "source": [
+ "## 4. What is the distribution of chunk types by company?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "9cd5724a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "chunk_type_distribution = df.groupby('company_id')['chunk_type'].value_counts()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "3dfbc7ce",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " | \n",
+ " count | \n",
+ "
\n",
+ " \n",
+ " company_id | \n",
+ " chunk_type | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 041e9ac4-d7eb-499f-8cfb-fa95dca20cd5 | \n",
+ " footer | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ " head | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ " header | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ " main | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ " 057293c5-03c8-427d-9c78-d7cfee35e525 | \n",
+ " footer | \n",
+ " 9 | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " f7958fe2-7f18-4da1-be80-1b6ebd68e427 | \n",
+ " main | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ " f9d892e2-ded2-4ac3-bf2d-bc65c5c58131 | \n",
+ " footer | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ " head | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ " header | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ " main | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
268 rows × 1 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " count\n",
+ "company_id chunk_type \n",
+ "041e9ac4-d7eb-499f-8cfb-fa95dca20cd5 footer 11\n",
+ " head 11\n",
+ " header 11\n",
+ " main 11\n",
+ "057293c5-03c8-427d-9c78-d7cfee35e525 footer 9\n",
+ "... ...\n",
+ "f7958fe2-7f18-4da1-be80-1b6ebd68e427 main 1\n",
+ "f9d892e2-ded2-4ac3-bf2d-bc65c5c58131 footer 11\n",
+ " head 11\n",
+ " header 11\n",
+ " main 11\n",
+ "\n",
+ "[268 rows x 1 columns]"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pd.DataFrame(chunk_type_distribution)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "558c6f6c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def create_company_dict(group):\n",
+ " chunk_type_counts = group['chunk_type'].value_counts().to_dict()\n",
+ " return {chunk_type: count for chunk_type, count in chunk_type_counts.items()}\n",
+ "\n",
+ "chunk_type_distribution_dict = {company_id: create_company_dict(group) for company_id, group in df.groupby('company_id')}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "9a4a4013",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'041e9ac4-d7eb-499f-8cfb-fa95dca20cd5': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " '057293c5-03c8-427d-9c78-d7cfee35e525': {'head': 9,\n",
+ " 'header': 9,\n",
+ " 'footer': 9,\n",
+ " 'main': 9},\n",
+ " '0a2dd621-fc29-4d92-92af-a9f27d36b88a': {'main': 12,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'head': 11},\n",
+ " '1a944f93-a2ac-4352-9b65-a39c7addcb36': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " '1b7e4fac-5467-4773-a330-e9ef080aa00a': {'main': 12,\n",
+ " 'head': 11,\n",
+ " 'footer': 11,\n",
+ " 'header': 5},\n",
+ " '1b9af211-c118-45dc-a7a9-c36acd5c3606': {'head': 10,\n",
+ " 'header': 10,\n",
+ " 'footer': 10,\n",
+ " 'main': 10},\n",
+ " '1c454ddb-4be8-4c6a-922e-45c877efa542': {'main': 17,\n",
+ " 'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11},\n",
+ " '1ccb0060-ed06-40d9-affd-b94b7e5041a1': {'head': 1, 'main': 1},\n",
+ " '1d793954-ad6e-4d64-84d6-965b6d7e6e28': {'header': 22,\n",
+ " 'head': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " '1fd25493-2b64-4e45-9286-496d1a2899c1': {'header': 23,\n",
+ " 'main': 11,\n",
+ " 'footer': 10,\n",
+ " 'head': 10},\n",
+ " '21a77c0a-0dae-4d9d-8f16-722c4cb80fa4': {'main': 11,\n",
+ " 'head': 9,\n",
+ " 'header': 9,\n",
+ " 'footer': 9},\n",
+ " '2312afd0-a0b4-41f2-a3f5-20fcaae1cf1d': {'head': 4,\n",
+ " 'header': 4,\n",
+ " 'footer': 4,\n",
+ " 'main': 4},\n",
+ " '234793b5-f217-4692-bad8-6ac7210c4d99': {'head': 1, 'header': 1, 'main': 1},\n",
+ " '2445ce14-bb54-4765-89bf-1f828686cb16': {'main': 7,\n",
+ " 'head': 1,\n",
+ " 'header': 1,\n",
+ " 'footer': 1},\n",
+ " '2646717c-561c-4a07-981c-04d60ae26c5d': {'head': 1, 'main': 1},\n",
+ " '27587b6e-5ba1-4c57-96f4-7fc16abac658': {'header': 7, 'head': 5, 'footer': 5},\n",
+ " '2abc40ce-3099-44fc-b5cd-a4abc80ad196': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'main': 11,\n",
+ " 'footer': 10},\n",
+ " '2b6f6bd0-fb3b-40b1-8728-6ee500e861a7': {'main': 12,\n",
+ " 'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11},\n",
+ " '311dffd7-84bc-4cb6-8bb9-0a87e2ebe0b6': {'head': 6, 'footer': 6, 'main': 6},\n",
+ " '33203cd4-b2f7-4a60-a949-b53c844f1288': {'head': 1, 'main': 1},\n",
+ " '350207ab-6df5-4624-9e11-1422ffd3d6e7': {'head': 2,\n",
+ " 'footer': 2,\n",
+ " 'main': 2,\n",
+ " 'header': 1},\n",
+ " '38ac3499-f153-42f5-a07e-ff286bb3058e': {'head': 1, 'main': 1},\n",
+ " '3f2be1af-115e-49fd-981c-523d7281c78e': {'header': 2,\n",
+ " 'head': 1,\n",
+ " 'footer': 1,\n",
+ " 'main': 1},\n",
+ " '47a9fbaa-0c12-442e-b030-78d3ee7b15d3': {'main': 15,\n",
+ " 'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11},\n",
+ " '4c1fde18-8a40-4ee7-9c3c-19152c7d1ff8': {'main': 17,\n",
+ " 'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11},\n",
+ " '4df58b56-4c19-4602-bdcc-a0f9bc0c1c81': {'main': 12,\n",
+ " 'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11},\n",
+ " '4e8e3cf3-79fb-4251-a2e9-970c1ee97068': {'main': 14,\n",
+ " 'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 10},\n",
+ " '4faddb9c-68bf-4bbe-b471-07d9b07fb7fe': {'head': 10,\n",
+ " 'header': 10,\n",
+ " 'main': 10,\n",
+ " 'footer': 9},\n",
+ " '506068b9-0b01-4853-b981-850fc0300d34': {'head': 1, 'main': 1},\n",
+ " '54481572-095b-4314-a99c-54c8d7cb3058': {'head': 10,\n",
+ " 'main': 10,\n",
+ " 'header': 8,\n",
+ " 'footer': 6},\n",
+ " '550e7ff3-6393-42f8-92c7-cb273f022445': {'main': 8,\n",
+ " 'head': 6,\n",
+ " 'header': 6,\n",
+ " 'footer': 6},\n",
+ " '5a767574-556b-439a-b1d8-89334fbdbaad': {'head': 9, 'header': 9, 'footer': 9},\n",
+ " '5c3a39f0-c5d6-4a24-85c3-0684b1228dce': {'head': 1,\n",
+ " 'header': 1,\n",
+ " 'footer': 1,\n",
+ " 'main': 1},\n",
+ " '5d23c4f4-1e1d-459b-847d-ea05d6f10056': {'header': 22,\n",
+ " 'head': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " '649eac25-762d-4b81-98d2-e3a9a481c4ac': {'head': 10,\n",
+ " 'header': 10,\n",
+ " 'footer': 10,\n",
+ " 'main': 10},\n",
+ " '65cb7019-5546-4adf-8294-f03e5c35bf4d': {'head': 1,\n",
+ " 'header': 1,\n",
+ " 'footer': 1,\n",
+ " 'main': 1},\n",
+ " '66d2d1bc-73f9-4df4-855d-12330a2cde05': {'main': 17,\n",
+ " 'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11},\n",
+ " '69cdaf4f-298d-42db-abef-36c040489a58': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " '6f1491c1-aff5-45cf-98aa-2e07c9394f78': {'head': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11,\n",
+ " 'header': 10},\n",
+ " '7227bcbe-e8b6-4285-8c73-4c9375d9df83': {'head': 7,\n",
+ " 'header': 7,\n",
+ " 'footer': 7,\n",
+ " 'main': 7},\n",
+ " '75c123ba-2e0a-4d3f-a7ec-f737c5f61c2e': {'head': 1, 'main': 1},\n",
+ " '77713668-9b82-4038-ac33-0b1a3c4d4715': {'main': 18,\n",
+ " 'head': 10,\n",
+ " 'header': 10,\n",
+ " 'footer': 10},\n",
+ " '784ba666-e00d-459e-9417-66f8a59f84ec': {'head': 11, 'main': 11},\n",
+ " '7af13eb6-125a-4e01-974b-712ea1a9dea7': {'head': 4,\n",
+ " 'header': 4,\n",
+ " 'footer': 4,\n",
+ " 'main': 4},\n",
+ " '84210586-05e5-4324-9c12-2883697d9937': {'head': 1, 'main': 1},\n",
+ " '8beba93c-0fd6-460b-a531-b8475bbbe6e8': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " '8db9b2e5-d6e6-4d29-ba56-347bb8507b08': {'head': 1, 'main': 1},\n",
+ " '9bf1ccd3-dd15-4fc2-8ee9-1c3ab1cc7a94': {'header': 10,\n",
+ " 'head': 5,\n",
+ " 'footer': 5},\n",
+ " '9c539408-9119-4540-8a04-68a0dc259d2d': {'head': 1,\n",
+ " 'header': 1,\n",
+ " 'footer': 1,\n",
+ " 'main': 1},\n",
+ " '9c8bf109-c7f6-4db4-ac5b-15838f692628': {'header': 18,\n",
+ " 'head': 9,\n",
+ " 'footer': 9,\n",
+ " 'main': 9},\n",
+ " 'a07fb185-9f97-498c-bcda-a4c21fe27467': {'main': 13,\n",
+ " 'header': 10,\n",
+ " 'footer': 10,\n",
+ " 'head': 10},\n",
+ " 'a0dffac7-5b73-47bb-8a31-78440a1aef33': {'head': 2, 'main': 2},\n",
+ " 'a45e4bd7-0ea7-4c3f-ba6a-cdc5027f6c60': {'head': 2,\n",
+ " 'footer': 2,\n",
+ " 'main': 2,\n",
+ " 'header': 1},\n",
+ " 'aab2261b-7065-460d-83de-99404fb50f65': {'main': 13,\n",
+ " 'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11},\n",
+ " 'acab4569-6d20-4f45-a649-501d998e96fd': {'head': 11,\n",
+ " 'main': 11,\n",
+ " 'header': 10,\n",
+ " 'footer': 10},\n",
+ " 'ae11f430-4c8f-4d77-bb02-c2f1c392bd6d': {'head': 1,\n",
+ " 'header': 1,\n",
+ " 'footer': 1,\n",
+ " 'main': 1},\n",
+ " 'b244f8a0-cdee-42a4-988f-8d431d4d0794': {'head': 1, 'main': 1},\n",
+ " 'b2a1f0aa-0578-4d89-8be1-56d86414a1b4': {'header': 22,\n",
+ " 'head': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " 'b34a4aec-a2ab-46ca-abca-85b07e6646b2': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " 'b598d098-e179-4355-873b-a0119f658b24': {'head': 6,\n",
+ " 'header': 6,\n",
+ " 'footer': 6,\n",
+ " 'main': 6},\n",
+ " 'ba955d6d-d1bd-4e92-834f-c45f820679b6': {'head': 5,\n",
+ " 'header': 5,\n",
+ " 'footer': 5,\n",
+ " 'main': 5},\n",
+ " 'bb0eef13-1271-4f12-992f-209aab01bac3': {'head': 6,\n",
+ " 'header': 6,\n",
+ " 'footer': 6,\n",
+ " 'main': 6},\n",
+ " 'c4ef42c5-511c-4991-bce9-e04c739cbd22': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " 'c7d9376d-9eff-4ab0-8e35-230697588b93': {'head': 4,\n",
+ " 'main': 4,\n",
+ " 'header': 2,\n",
+ " 'footer': 2},\n",
+ " 'c84b92cc-28a5-4383-845d-237926a5f120': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " 'cd17faf2-60f5-4a27-b727-9d9337610077': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'main': 11,\n",
+ " 'footer': 10},\n",
+ " 'cd931e32-04fd-4265-8b9f-99943dbd77f4': {'header': 19,\n",
+ " 'head': 10,\n",
+ " 'footer': 10,\n",
+ " 'main': 1},\n",
+ " 'd09d7050-14d4-4bee-b7bc-709a046fa5c0': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " 'd90cd817-7db0-4394-a5ef-65df2bd96487': {'head': 5,\n",
+ " 'header': 5,\n",
+ " 'footer': 5,\n",
+ " 'main': 5},\n",
+ " 'ed5f1cb6-6ab1-4b61-9412-5c23e6f02cf0': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " 'edda975d-a01c-4e93-8245-68183fb19ced': {'head': 10,\n",
+ " 'header': 10,\n",
+ " 'footer': 10},\n",
+ " 'f3004596-8695-4808-8c54-d80cb63c98d0': {'head': 1, 'main': 1},\n",
+ " 'f5b2f14b-1186-4f74-aa14-86ec73da913f': {'footer': 18, 'head': 9, 'main': 9},\n",
+ " 'f7958fe2-7f18-4da1-be80-1b6ebd68e427': {'head': 1, 'footer': 1, 'main': 1},\n",
+ " 'f9d892e2-ded2-4ac3-bf2d-bc65c5c58131': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11}}"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "chunk_type_distribution_dict"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2ff34159",
+ "metadata": {},
+ "source": [
+ "## General Analysis"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7816b4c8",
+ "metadata": {},
+ "source": [
+ "### Analyzing chunk_id\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "id": "c9c57042",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2128"
+ ]
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(np.unique(df['chunk_id']))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "897874aa",
+ "metadata": {},
+ "source": [
+ "All the elements in the `chunk_id` are unique. This can be used as the primary key for indexing/lookup."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3d0b8ffb",
+ "metadata": {},
+ "source": [
+ "### Analyzing chunk_hash\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "id": "e8233ca8",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1337"
+ ]
+ },
+ "execution_count": 22,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(np.unique(df['chunk_hash']))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "id": "e938bcd0",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " company_id | \n",
+ " company_name | \n",
+ " url | \n",
+ " chunk_type | \n",
+ " chunk_hash | \n",
+ " chunk | \n",
+ " chunk_id | \n",
+ " prefix_url | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 60 | \n",
+ " aab2261b-7065-460d-83de-99404fb50f65 | \n",
+ " betahaus Sofia | \n",
+ " https://betahaus.bg/blog | \n",
+ " footer | \n",
+ " 2f75ead3d8821183e12c322c1800afe60a175572ca3b3a... | \n",
+ " <footer class=\"footer\"><h4>Връзка с нас</h4>ул... | \n",
+ " af3a36a1-fb5a-44b1-b5ae-9b040ca5889c | \n",
+ " https://betahaus.bg | \n",
+ "
\n",
+ " \n",
+ " 63 | \n",
+ " 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 | \n",
+ " The Best Bees Company | \n",
+ " https://bestbees.com/blog/ | \n",
+ " header | \n",
+ " 2a1407969dc3931af069563f58b3ec89bfa300ac6c4b1f... | \n",
+ " <header class=\"fl-builder-content fl-builder-c... | \n",
+ " 293eeafb-09f9-4611-8d37-0e5f07dae564 | \n",
+ " https://bestbees.com | \n",
+ "
\n",
+ " \n",
+ " 64 | \n",
+ " 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 | \n",
+ " The Best Bees Company | \n",
+ " https://bestbees.com/blog/ | \n",
+ " footer | \n",
+ " c04c6008e00dc3e53fdea0d3f8db942f8aff0bff155c17... | \n",
+ " <footer class=\"fl-builder-content fl-builder-c... | \n",
+ " fe49b6bb-9a6d-4c3a-8b23-40c8071e3cd4 | \n",
+ " https://bestbees.com | \n",
+ "
\n",
+ " \n",
+ " 87 | \n",
+ " 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 | \n",
+ " The Best Bees Company | \n",
+ " https://bestbees.com/get-started/ | \n",
+ " header | \n",
+ " 2a1407969dc3931af069563f58b3ec89bfa300ac6c4b1f... | \n",
+ " <header class=\"fl-builder-content fl-builder-c... | \n",
+ " a6851616-8a65-4409-9a63-66a6eb2153cc | \n",
+ " https://bestbees.com | \n",
+ "
\n",
+ " \n",
+ " 88 | \n",
+ " 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 | \n",
+ " The Best Bees Company | \n",
+ " https://bestbees.com/get-started/ | \n",
+ " footer | \n",
+ " c04c6008e00dc3e53fdea0d3f8db942f8aff0bff155c17... | \n",
+ " <footer class=\"fl-builder-content fl-builder-c... | \n",
+ " 271d5729-d600-47f6-87e2-5b7aa065f712 | \n",
+ " https://bestbees.com | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 2111 | \n",
+ " 1fd25493-2b64-4e45-9286-496d1a2899c1 | \n",
+ " American Express Travel | \n",
+ " https://www.americanexpress.com/en-us/support/... | \n",
+ " header | \n",
+ " 994f9b66932ae87ff338704924562cf913f73c0f49ae26... | \n",
+ " <ul><li><a href=\"https://www.americanexpress.c... | \n",
+ " ace4dac1-4401-4b5b-b61a-4b922d3a8362 | \n",
+ " https://www.americanexpress.com | \n",
+ "
\n",
+ " \n",
+ " 2112 | \n",
+ " 1fd25493-2b64-4e45-9286-496d1a2899c1 | \n",
+ " American Express Travel | \n",
+ " https://www.americanexpress.com/en-us/support/... | \n",
+ " footer | \n",
+ " 7d4855c2c301923efef11a825dca16e1c6791d6cec96dc... | \n",
+ " <footer class=\"axp-footer__footer__footer___32... | \n",
+ " 4ea2b388-7505-4e20-b57e-75eaddb4d242 | \n",
+ " https://www.americanexpress.com | \n",
+ "
\n",
+ " \n",
+ " 2120 | \n",
+ " 0a2dd621-fc29-4d92-92af-a9f27d36b88a | \n",
+ " Mobile Programming LLC. | \n",
+ " https://www.mobileprogramming.com/ | \n",
+ " head | \n",
+ " 023ecc39b08c78b255f966b4e7768ab00f491469e096f0... | \n",
+ " <head><title>App Design & Development | Mo... | \n",
+ " 07c06cb2-1d10-46ac-84e3-b0eaa0279fb2 | \n",
+ " https://www.mobileprogramming.com | \n",
+ "
\n",
+ " \n",
+ " 2121 | \n",
+ " 0a2dd621-fc29-4d92-92af-a9f27d36b88a | \n",
+ " Mobile Programming LLC. | \n",
+ " https://www.mobileprogramming.com/ | \n",
+ " header | \n",
+ " 4e4d70173788abe5d7e7004f590d78dd9911e775c91aa4... | \n",
+ " <header class=\"top-header\" id=\"myHeader\"><nav>... | \n",
+ " b9015037-7ef1-4bb1-8607-f63fefb06226 | \n",
+ " https://www.mobileprogramming.com | \n",
+ "
\n",
+ " \n",
+ " 2122 | \n",
+ " 0a2dd621-fc29-4d92-92af-a9f27d36b88a | \n",
+ " Mobile Programming LLC. | \n",
+ " https://www.mobileprogramming.com/ | \n",
+ " footer | \n",
+ " f40bd3fed28b7926b308ed7e43d0c489df78b5813c5945... | \n",
+ " <footer class=\"container-fluid pd-0 footer-bg\"... | \n",
+ " 6afb8c04-87cc-45d7-a7f4-c11df0d88d54 | \n",
+ " https://www.mobileprogramming.com | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
791 rows × 8 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " company_id company_name \\\n",
+ "60 aab2261b-7065-460d-83de-99404fb50f65 betahaus Sofia \n",
+ "63 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 The Best Bees Company \n",
+ "64 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 The Best Bees Company \n",
+ "87 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 The Best Bees Company \n",
+ "88 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 The Best Bees Company \n",
+ "... ... ... \n",
+ "2111 1fd25493-2b64-4e45-9286-496d1a2899c1 American Express Travel \n",
+ "2112 1fd25493-2b64-4e45-9286-496d1a2899c1 American Express Travel \n",
+ "2120 0a2dd621-fc29-4d92-92af-a9f27d36b88a Mobile Programming LLC. \n",
+ "2121 0a2dd621-fc29-4d92-92af-a9f27d36b88a Mobile Programming LLC. \n",
+ "2122 0a2dd621-fc29-4d92-92af-a9f27d36b88a Mobile Programming LLC. \n",
+ "\n",
+ " url chunk_type \\\n",
+ "60 https://betahaus.bg/blog footer \n",
+ "63 https://bestbees.com/blog/ header \n",
+ "64 https://bestbees.com/blog/ footer \n",
+ "87 https://bestbees.com/get-started/ header \n",
+ "88 https://bestbees.com/get-started/ footer \n",
+ "... ... ... \n",
+ "2111 https://www.americanexpress.com/en-us/support/... header \n",
+ "2112 https://www.americanexpress.com/en-us/support/... footer \n",
+ "2120 https://www.mobileprogramming.com/ head \n",
+ "2121 https://www.mobileprogramming.com/ header \n",
+ "2122 https://www.mobileprogramming.com/ footer \n",
+ "\n",
+ " chunk_hash \\\n",
+ "60 2f75ead3d8821183e12c322c1800afe60a175572ca3b3a... \n",
+ "63 2a1407969dc3931af069563f58b3ec89bfa300ac6c4b1f... \n",
+ "64 c04c6008e00dc3e53fdea0d3f8db942f8aff0bff155c17... \n",
+ "87 2a1407969dc3931af069563f58b3ec89bfa300ac6c4b1f... \n",
+ "88 c04c6008e00dc3e53fdea0d3f8db942f8aff0bff155c17... \n",
+ "... ... \n",
+ "2111 994f9b66932ae87ff338704924562cf913f73c0f49ae26... \n",
+ "2112 7d4855c2c301923efef11a825dca16e1c6791d6cec96dc... \n",
+ "2120 023ecc39b08c78b255f966b4e7768ab00f491469e096f0... \n",
+ "2121 4e4d70173788abe5d7e7004f590d78dd9911e775c91aa4... \n",
+ "2122 f40bd3fed28b7926b308ed7e43d0c489df78b5813c5945... \n",
+ "\n",
+ " chunk \\\n",
+ "60