diff --git a/dhanush submission/README.md b/dhanush submission/README.md
new file mode 100644
index 0000000..7adebc1
--- /dev/null
+++ b/dhanush submission/README.md
@@ -0,0 +1,50 @@
+# Gradient Works Exercise
+
+## Part-1
+
+1. How many companies are in the dataset?
+There are 75 companies in the dataset.
+2. How many unique URLs are in the dataset?
+There are 530 unique URLs in the dataset. On analyzing the prefixes of the URLs, it is observed that the URLs are from 77 different domains. 2 of the domains seems to have duplicate URLs.
+
+
+3. What is the most common chunk type?
+The most common chunk type is `header` with 549 occurrences.
+
+
+
+4. What is the distribution of chunk types by company?
+Please refer to the jupyter notebook under the Notebooks folder for the distribution of chunk types by company.
+
+## Part-2 RAG
+
+### Architecture Diagram
+
+
+
+## Steps to run the code
+1. Create a `.env` file and place your OPEN_AI API key in this format
+```
+OPENAI_API_KEY =
+COHERE_API_KEY =
+```
+2. Run the `requirements.txt` file to install all the necessary libraries.
+```
+pip install -r requirements.txt
+```
+3. Run `chunking.py` first, as this converts the HTML content to text and saves the processed csv file.
+4. Run `embedding.py` next to generate embeddings and store them as a numpy file.
+5. The code is also exposed as an API using FASTAPI. To run the API server, run the following command inside the src folder.
+```
+uvicorn main:app --reload --port 8080
+```
+This will start the API server at http://localhost:8080.
+6. Run `chat.py` next, which opens Streamlit in your browser, allowing you to ask relevant questions based on the csv file provided.
+```
+streamlit run src/chat.py
+```
+### Demo
+
+
+> [!NOTE]
+The code is also available as a jupyter notebook under the notebooks folder.
diff --git a/dhanush submission/assets/Architecture_diagram.png b/dhanush submission/assets/Architecture_diagram.png
new file mode 100644
index 0000000..fe91cc7
Binary files /dev/null and b/dhanush submission/assets/Architecture_diagram.png differ
diff --git a/dhanush submission/assets/Qn_2.png b/dhanush submission/assets/Qn_2.png
new file mode 100644
index 0000000..823cb8c
Binary files /dev/null and b/dhanush submission/assets/Qn_2.png differ
diff --git a/dhanush submission/assets/Qn_3.png b/dhanush submission/assets/Qn_3.png
new file mode 100644
index 0000000..151d0f7
Binary files /dev/null and b/dhanush submission/assets/Qn_3.png differ
diff --git a/dhanush submission/assets/demo.png b/dhanush submission/assets/demo.png
new file mode 100644
index 0000000..988fc42
Binary files /dev/null and b/dhanush submission/assets/demo.png differ
diff --git a/dhanush submission/notebooks/Dhanush GW.ipynb b/dhanush submission/notebooks/Dhanush GW.ipynb
new file mode 100644
index 0000000..1d1388c
--- /dev/null
+++ b/dhanush submission/notebooks/Dhanush GW.ipynb
@@ -0,0 +1,2652 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "32b4fb60",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "from matplotlib import pyplot as plt"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "93a352f1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df = pd.read_csv('../data/content.csv')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "25e5952c",
+ "metadata": {},
+ "source": [
+ "# Tasks - Part 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "30a8c8bb",
+ "metadata": {},
+ "source": [
+ "Here are some questions that we'd like you to answer about the dataset:\n",
+ "\n",
+ "1. How many companies are in the dataset?\n",
+ "2. How many unique URLs are in the dataset?\n",
+ "3. What is the most common chunk type?\n",
+ "4. What is the distribution of chunk types by company?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4a9b154c",
+ "metadata": {},
+ "source": [
+ "## 1. How many companies are in the dataset?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "13945546",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "75"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(np.unique(df['company_id']))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "d8870337",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(np.unique(df['company_id'])) == len(np.unique(df['company_name']))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "da4434ca",
+ "metadata": {},
+ "source": [
+ "The dataset contains 75 unique companies. The columns `company_name` and `company_id` have a one-to-one correspondence, ensuring that there are no duplicate names in the `company_name` column, assuming company_id is unique."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f35dd734",
+ "metadata": {},
+ "source": [
+ "## 2. How many unique URLs are in the dataset?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "1d455aac",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "530"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(np.unique(df['url']))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5cfa3e26",
+ "metadata": {},
+ "source": [
+ "There are 530 unique urls in this dataset. Let's fetch the prefix urls for these companies and check if it matches with the unique company_id"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "022127a9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "prefix_url = [x.split('/')[0] + '//' + x.split('/')[2] for x in df['url']]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "d7d237a4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df['prefix_url'] = prefix_url"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "65408cba",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "77"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(np.unique(prefix_url))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "379500a6",
+ "metadata": {},
+ "source": [
+ "2 company names seems to be repeated"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "66786ba5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "company_id = []\n",
+ "uniq_url = []\n",
+ "for x in range(len(df['prefix_url'])):\n",
+ " if df.iloc[x]['prefix_url'] not in uniq_url:\n",
+ " uniq_url.append(df.iloc[x]['prefix_url'])\n",
+ " company_id.append(df.iloc[x]['company_id'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "de8ff277",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(company_id) == len(uniq_url)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "c5466bbc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "temp_df = pd.DataFrame({'company_id_unique': company_id, 'unique_url': uniq_url})"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "cb5893e2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "duplicate_df = temp_df[temp_df.duplicated(subset=['company_id_unique'])]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "f3c9d280",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " company_id_unique | \n",
+ " unique_url | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 4 | \n",
+ " a45e4bd7-0ea7-4c3f-ba6a-cdc5027f6c60 | \n",
+ " https://www.bingotech.net | \n",
+ "
\n",
+ " \n",
+ " 54 | \n",
+ " a0dffac7-5b73-47bb-8a31-78440a1aef33 | \n",
+ " https://4wheeltravels.com | \n",
+ "
\n",
+ " \n",
+ " 56 | \n",
+ " a45e4bd7-0ea7-4c3f-ba6a-cdc5027f6c60 | \n",
+ " https://bingotech.net | \n",
+ "
\n",
+ " \n",
+ " 57 | \n",
+ " a0dffac7-5b73-47bb-8a31-78440a1aef33 | \n",
+ " https://sg2plcpnl0188.prod.sin2.secureserver.n... | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " company_id_unique \\\n",
+ "4 a45e4bd7-0ea7-4c3f-ba6a-cdc5027f6c60 \n",
+ "54 a0dffac7-5b73-47bb-8a31-78440a1aef33 \n",
+ "56 a45e4bd7-0ea7-4c3f-ba6a-cdc5027f6c60 \n",
+ "57 a0dffac7-5b73-47bb-8a31-78440a1aef33 \n",
+ "\n",
+ " unique_url \n",
+ "4 https://www.bingotech.net \n",
+ "54 https://4wheeltravels.com \n",
+ "56 https://bingotech.net \n",
+ "57 https://sg2plcpnl0188.prod.sin2.secureserver.n... "
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "temp_df[temp_df['company_id_unique'].isin(duplicate_df['company_id_unique'])]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8ea31d37",
+ "metadata": {},
+ "source": [
+ "Bingotech duplicate domain names: https://www.bingotech.net and https://bingotech.net
\n",
+ "4wheeltravels duplicate domain names: https://4wheeltravels.com and https://sg2plcpnl0188.prod.sin2.secureserver.n..."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ea2849b5",
+ "metadata": {},
+ "source": [
+ "## 3. What is the most common chunk type?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "5a26687d",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "chunk_type\n",
+ "header 549\n",
+ "main 545\n",
+ "head 530\n",
+ "footer 504\n",
+ "Name: count, dtype: int64"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df['chunk_type'].value_counts()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "75f39608",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "counts = df['chunk_type'].value_counts()\n",
+ "plt.bar(counts.index, counts.values) \n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a92308fc",
+ "metadata": {},
+ "source": [
+ "**Header** is the most common chunk type."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "08a0614f",
+ "metadata": {},
+ "source": [
+ "## 4. What is the distribution of chunk types by company?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "9cd5724a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "chunk_type_distribution = df.groupby('company_id')['chunk_type'].value_counts()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "3dfbc7ce",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " | \n",
+ " count | \n",
+ "
\n",
+ " \n",
+ " company_id | \n",
+ " chunk_type | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 041e9ac4-d7eb-499f-8cfb-fa95dca20cd5 | \n",
+ " footer | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ " head | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ " header | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ " main | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ " 057293c5-03c8-427d-9c78-d7cfee35e525 | \n",
+ " footer | \n",
+ " 9 | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " f7958fe2-7f18-4da1-be80-1b6ebd68e427 | \n",
+ " main | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ " f9d892e2-ded2-4ac3-bf2d-bc65c5c58131 | \n",
+ " footer | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ " head | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ " header | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ " main | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
268 rows × 1 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " count\n",
+ "company_id chunk_type \n",
+ "041e9ac4-d7eb-499f-8cfb-fa95dca20cd5 footer 11\n",
+ " head 11\n",
+ " header 11\n",
+ " main 11\n",
+ "057293c5-03c8-427d-9c78-d7cfee35e525 footer 9\n",
+ "... ...\n",
+ "f7958fe2-7f18-4da1-be80-1b6ebd68e427 main 1\n",
+ "f9d892e2-ded2-4ac3-bf2d-bc65c5c58131 footer 11\n",
+ " head 11\n",
+ " header 11\n",
+ " main 11\n",
+ "\n",
+ "[268 rows x 1 columns]"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pd.DataFrame(chunk_type_distribution)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "558c6f6c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def create_company_dict(group):\n",
+ " chunk_type_counts = group['chunk_type'].value_counts().to_dict()\n",
+ " return {chunk_type: count for chunk_type, count in chunk_type_counts.items()}\n",
+ "\n",
+ "chunk_type_distribution_dict = {company_id: create_company_dict(group) for company_id, group in df.groupby('company_id')}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "9a4a4013",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'041e9ac4-d7eb-499f-8cfb-fa95dca20cd5': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " '057293c5-03c8-427d-9c78-d7cfee35e525': {'head': 9,\n",
+ " 'header': 9,\n",
+ " 'footer': 9,\n",
+ " 'main': 9},\n",
+ " '0a2dd621-fc29-4d92-92af-a9f27d36b88a': {'main': 12,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'head': 11},\n",
+ " '1a944f93-a2ac-4352-9b65-a39c7addcb36': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " '1b7e4fac-5467-4773-a330-e9ef080aa00a': {'main': 12,\n",
+ " 'head': 11,\n",
+ " 'footer': 11,\n",
+ " 'header': 5},\n",
+ " '1b9af211-c118-45dc-a7a9-c36acd5c3606': {'head': 10,\n",
+ " 'header': 10,\n",
+ " 'footer': 10,\n",
+ " 'main': 10},\n",
+ " '1c454ddb-4be8-4c6a-922e-45c877efa542': {'main': 17,\n",
+ " 'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11},\n",
+ " '1ccb0060-ed06-40d9-affd-b94b7e5041a1': {'head': 1, 'main': 1},\n",
+ " '1d793954-ad6e-4d64-84d6-965b6d7e6e28': {'header': 22,\n",
+ " 'head': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " '1fd25493-2b64-4e45-9286-496d1a2899c1': {'header': 23,\n",
+ " 'main': 11,\n",
+ " 'footer': 10,\n",
+ " 'head': 10},\n",
+ " '21a77c0a-0dae-4d9d-8f16-722c4cb80fa4': {'main': 11,\n",
+ " 'head': 9,\n",
+ " 'header': 9,\n",
+ " 'footer': 9},\n",
+ " '2312afd0-a0b4-41f2-a3f5-20fcaae1cf1d': {'head': 4,\n",
+ " 'header': 4,\n",
+ " 'footer': 4,\n",
+ " 'main': 4},\n",
+ " '234793b5-f217-4692-bad8-6ac7210c4d99': {'head': 1, 'header': 1, 'main': 1},\n",
+ " '2445ce14-bb54-4765-89bf-1f828686cb16': {'main': 7,\n",
+ " 'head': 1,\n",
+ " 'header': 1,\n",
+ " 'footer': 1},\n",
+ " '2646717c-561c-4a07-981c-04d60ae26c5d': {'head': 1, 'main': 1},\n",
+ " '27587b6e-5ba1-4c57-96f4-7fc16abac658': {'header': 7, 'head': 5, 'footer': 5},\n",
+ " '2abc40ce-3099-44fc-b5cd-a4abc80ad196': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'main': 11,\n",
+ " 'footer': 10},\n",
+ " '2b6f6bd0-fb3b-40b1-8728-6ee500e861a7': {'main': 12,\n",
+ " 'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11},\n",
+ " '311dffd7-84bc-4cb6-8bb9-0a87e2ebe0b6': {'head': 6, 'footer': 6, 'main': 6},\n",
+ " '33203cd4-b2f7-4a60-a949-b53c844f1288': {'head': 1, 'main': 1},\n",
+ " '350207ab-6df5-4624-9e11-1422ffd3d6e7': {'head': 2,\n",
+ " 'footer': 2,\n",
+ " 'main': 2,\n",
+ " 'header': 1},\n",
+ " '38ac3499-f153-42f5-a07e-ff286bb3058e': {'head': 1, 'main': 1},\n",
+ " '3f2be1af-115e-49fd-981c-523d7281c78e': {'header': 2,\n",
+ " 'head': 1,\n",
+ " 'footer': 1,\n",
+ " 'main': 1},\n",
+ " '47a9fbaa-0c12-442e-b030-78d3ee7b15d3': {'main': 15,\n",
+ " 'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11},\n",
+ " '4c1fde18-8a40-4ee7-9c3c-19152c7d1ff8': {'main': 17,\n",
+ " 'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11},\n",
+ " '4df58b56-4c19-4602-bdcc-a0f9bc0c1c81': {'main': 12,\n",
+ " 'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11},\n",
+ " '4e8e3cf3-79fb-4251-a2e9-970c1ee97068': {'main': 14,\n",
+ " 'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 10},\n",
+ " '4faddb9c-68bf-4bbe-b471-07d9b07fb7fe': {'head': 10,\n",
+ " 'header': 10,\n",
+ " 'main': 10,\n",
+ " 'footer': 9},\n",
+ " '506068b9-0b01-4853-b981-850fc0300d34': {'head': 1, 'main': 1},\n",
+ " '54481572-095b-4314-a99c-54c8d7cb3058': {'head': 10,\n",
+ " 'main': 10,\n",
+ " 'header': 8,\n",
+ " 'footer': 6},\n",
+ " '550e7ff3-6393-42f8-92c7-cb273f022445': {'main': 8,\n",
+ " 'head': 6,\n",
+ " 'header': 6,\n",
+ " 'footer': 6},\n",
+ " '5a767574-556b-439a-b1d8-89334fbdbaad': {'head': 9, 'header': 9, 'footer': 9},\n",
+ " '5c3a39f0-c5d6-4a24-85c3-0684b1228dce': {'head': 1,\n",
+ " 'header': 1,\n",
+ " 'footer': 1,\n",
+ " 'main': 1},\n",
+ " '5d23c4f4-1e1d-459b-847d-ea05d6f10056': {'header': 22,\n",
+ " 'head': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " '649eac25-762d-4b81-98d2-e3a9a481c4ac': {'head': 10,\n",
+ " 'header': 10,\n",
+ " 'footer': 10,\n",
+ " 'main': 10},\n",
+ " '65cb7019-5546-4adf-8294-f03e5c35bf4d': {'head': 1,\n",
+ " 'header': 1,\n",
+ " 'footer': 1,\n",
+ " 'main': 1},\n",
+ " '66d2d1bc-73f9-4df4-855d-12330a2cde05': {'main': 17,\n",
+ " 'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11},\n",
+ " '69cdaf4f-298d-42db-abef-36c040489a58': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " '6f1491c1-aff5-45cf-98aa-2e07c9394f78': {'head': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11,\n",
+ " 'header': 10},\n",
+ " '7227bcbe-e8b6-4285-8c73-4c9375d9df83': {'head': 7,\n",
+ " 'header': 7,\n",
+ " 'footer': 7,\n",
+ " 'main': 7},\n",
+ " '75c123ba-2e0a-4d3f-a7ec-f737c5f61c2e': {'head': 1, 'main': 1},\n",
+ " '77713668-9b82-4038-ac33-0b1a3c4d4715': {'main': 18,\n",
+ " 'head': 10,\n",
+ " 'header': 10,\n",
+ " 'footer': 10},\n",
+ " '784ba666-e00d-459e-9417-66f8a59f84ec': {'head': 11, 'main': 11},\n",
+ " '7af13eb6-125a-4e01-974b-712ea1a9dea7': {'head': 4,\n",
+ " 'header': 4,\n",
+ " 'footer': 4,\n",
+ " 'main': 4},\n",
+ " '84210586-05e5-4324-9c12-2883697d9937': {'head': 1, 'main': 1},\n",
+ " '8beba93c-0fd6-460b-a531-b8475bbbe6e8': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " '8db9b2e5-d6e6-4d29-ba56-347bb8507b08': {'head': 1, 'main': 1},\n",
+ " '9bf1ccd3-dd15-4fc2-8ee9-1c3ab1cc7a94': {'header': 10,\n",
+ " 'head': 5,\n",
+ " 'footer': 5},\n",
+ " '9c539408-9119-4540-8a04-68a0dc259d2d': {'head': 1,\n",
+ " 'header': 1,\n",
+ " 'footer': 1,\n",
+ " 'main': 1},\n",
+ " '9c8bf109-c7f6-4db4-ac5b-15838f692628': {'header': 18,\n",
+ " 'head': 9,\n",
+ " 'footer': 9,\n",
+ " 'main': 9},\n",
+ " 'a07fb185-9f97-498c-bcda-a4c21fe27467': {'main': 13,\n",
+ " 'header': 10,\n",
+ " 'footer': 10,\n",
+ " 'head': 10},\n",
+ " 'a0dffac7-5b73-47bb-8a31-78440a1aef33': {'head': 2, 'main': 2},\n",
+ " 'a45e4bd7-0ea7-4c3f-ba6a-cdc5027f6c60': {'head': 2,\n",
+ " 'footer': 2,\n",
+ " 'main': 2,\n",
+ " 'header': 1},\n",
+ " 'aab2261b-7065-460d-83de-99404fb50f65': {'main': 13,\n",
+ " 'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11},\n",
+ " 'acab4569-6d20-4f45-a649-501d998e96fd': {'head': 11,\n",
+ " 'main': 11,\n",
+ " 'header': 10,\n",
+ " 'footer': 10},\n",
+ " 'ae11f430-4c8f-4d77-bb02-c2f1c392bd6d': {'head': 1,\n",
+ " 'header': 1,\n",
+ " 'footer': 1,\n",
+ " 'main': 1},\n",
+ " 'b244f8a0-cdee-42a4-988f-8d431d4d0794': {'head': 1, 'main': 1},\n",
+ " 'b2a1f0aa-0578-4d89-8be1-56d86414a1b4': {'header': 22,\n",
+ " 'head': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " 'b34a4aec-a2ab-46ca-abca-85b07e6646b2': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " 'b598d098-e179-4355-873b-a0119f658b24': {'head': 6,\n",
+ " 'header': 6,\n",
+ " 'footer': 6,\n",
+ " 'main': 6},\n",
+ " 'ba955d6d-d1bd-4e92-834f-c45f820679b6': {'head': 5,\n",
+ " 'header': 5,\n",
+ " 'footer': 5,\n",
+ " 'main': 5},\n",
+ " 'bb0eef13-1271-4f12-992f-209aab01bac3': {'head': 6,\n",
+ " 'header': 6,\n",
+ " 'footer': 6,\n",
+ " 'main': 6},\n",
+ " 'c4ef42c5-511c-4991-bce9-e04c739cbd22': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " 'c7d9376d-9eff-4ab0-8e35-230697588b93': {'head': 4,\n",
+ " 'main': 4,\n",
+ " 'header': 2,\n",
+ " 'footer': 2},\n",
+ " 'c84b92cc-28a5-4383-845d-237926a5f120': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " 'cd17faf2-60f5-4a27-b727-9d9337610077': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'main': 11,\n",
+ " 'footer': 10},\n",
+ " 'cd931e32-04fd-4265-8b9f-99943dbd77f4': {'header': 19,\n",
+ " 'head': 10,\n",
+ " 'footer': 10,\n",
+ " 'main': 1},\n",
+ " 'd09d7050-14d4-4bee-b7bc-709a046fa5c0': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " 'd90cd817-7db0-4394-a5ef-65df2bd96487': {'head': 5,\n",
+ " 'header': 5,\n",
+ " 'footer': 5,\n",
+ " 'main': 5},\n",
+ " 'ed5f1cb6-6ab1-4b61-9412-5c23e6f02cf0': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11},\n",
+ " 'edda975d-a01c-4e93-8245-68183fb19ced': {'head': 10,\n",
+ " 'header': 10,\n",
+ " 'footer': 10},\n",
+ " 'f3004596-8695-4808-8c54-d80cb63c98d0': {'head': 1, 'main': 1},\n",
+ " 'f5b2f14b-1186-4f74-aa14-86ec73da913f': {'footer': 18, 'head': 9, 'main': 9},\n",
+ " 'f7958fe2-7f18-4da1-be80-1b6ebd68e427': {'head': 1, 'footer': 1, 'main': 1},\n",
+ " 'f9d892e2-ded2-4ac3-bf2d-bc65c5c58131': {'head': 11,\n",
+ " 'header': 11,\n",
+ " 'footer': 11,\n",
+ " 'main': 11}}"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "chunk_type_distribution_dict"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2ff34159",
+ "metadata": {},
+ "source": [
+ "## General Analysis"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7816b4c8",
+ "metadata": {},
+ "source": [
+ "### Analyzing chunk_id\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "id": "c9c57042",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2128"
+ ]
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(np.unique(df['chunk_id']))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "897874aa",
+ "metadata": {},
+ "source": [
+ "All the elements in the `chunk_id` are unique. This can be used as the primary key for indexing/lookup."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3d0b8ffb",
+ "metadata": {},
+ "source": [
+ "### Analyzing chunk_hash\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "id": "e8233ca8",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1337"
+ ]
+ },
+ "execution_count": 22,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(np.unique(df['chunk_hash']))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "id": "e938bcd0",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " company_id | \n",
+ " company_name | \n",
+ " url | \n",
+ " chunk_type | \n",
+ " chunk_hash | \n",
+ " chunk | \n",
+ " chunk_id | \n",
+ " prefix_url | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 60 | \n",
+ " aab2261b-7065-460d-83de-99404fb50f65 | \n",
+ " betahaus Sofia | \n",
+ " https://betahaus.bg/blog | \n",
+ " footer | \n",
+ " 2f75ead3d8821183e12c322c1800afe60a175572ca3b3a... | \n",
+ " <footer class=\"footer\"><h4>Връзка с нас</h4>ул... | \n",
+ " af3a36a1-fb5a-44b1-b5ae-9b040ca5889c | \n",
+ " https://betahaus.bg | \n",
+ "
\n",
+ " \n",
+ " 63 | \n",
+ " 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 | \n",
+ " The Best Bees Company | \n",
+ " https://bestbees.com/blog/ | \n",
+ " header | \n",
+ " 2a1407969dc3931af069563f58b3ec89bfa300ac6c4b1f... | \n",
+ " <header class=\"fl-builder-content fl-builder-c... | \n",
+ " 293eeafb-09f9-4611-8d37-0e5f07dae564 | \n",
+ " https://bestbees.com | \n",
+ "
\n",
+ " \n",
+ " 64 | \n",
+ " 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 | \n",
+ " The Best Bees Company | \n",
+ " https://bestbees.com/blog/ | \n",
+ " footer | \n",
+ " c04c6008e00dc3e53fdea0d3f8db942f8aff0bff155c17... | \n",
+ " <footer class=\"fl-builder-content fl-builder-c... | \n",
+ " fe49b6bb-9a6d-4c3a-8b23-40c8071e3cd4 | \n",
+ " https://bestbees.com | \n",
+ "
\n",
+ " \n",
+ " 87 | \n",
+ " 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 | \n",
+ " The Best Bees Company | \n",
+ " https://bestbees.com/get-started/ | \n",
+ " header | \n",
+ " 2a1407969dc3931af069563f58b3ec89bfa300ac6c4b1f... | \n",
+ " <header class=\"fl-builder-content fl-builder-c... | \n",
+ " a6851616-8a65-4409-9a63-66a6eb2153cc | \n",
+ " https://bestbees.com | \n",
+ "
\n",
+ " \n",
+ " 88 | \n",
+ " 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 | \n",
+ " The Best Bees Company | \n",
+ " https://bestbees.com/get-started/ | \n",
+ " footer | \n",
+ " c04c6008e00dc3e53fdea0d3f8db942f8aff0bff155c17... | \n",
+ " <footer class=\"fl-builder-content fl-builder-c... | \n",
+ " 271d5729-d600-47f6-87e2-5b7aa065f712 | \n",
+ " https://bestbees.com | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 2111 | \n",
+ " 1fd25493-2b64-4e45-9286-496d1a2899c1 | \n",
+ " American Express Travel | \n",
+ " https://www.americanexpress.com/en-us/support/... | \n",
+ " header | \n",
+ " 994f9b66932ae87ff338704924562cf913f73c0f49ae26... | \n",
+ " <ul><li><a href=\"https://www.americanexpress.c... | \n",
+ " ace4dac1-4401-4b5b-b61a-4b922d3a8362 | \n",
+ " https://www.americanexpress.com | \n",
+ "
\n",
+ " \n",
+ " 2112 | \n",
+ " 1fd25493-2b64-4e45-9286-496d1a2899c1 | \n",
+ " American Express Travel | \n",
+ " https://www.americanexpress.com/en-us/support/... | \n",
+ " footer | \n",
+ " 7d4855c2c301923efef11a825dca16e1c6791d6cec96dc... | \n",
+ " <footer class=\"axp-footer__footer__footer___32... | \n",
+ " 4ea2b388-7505-4e20-b57e-75eaddb4d242 | \n",
+ " https://www.americanexpress.com | \n",
+ "
\n",
+ " \n",
+ " 2120 | \n",
+ " 0a2dd621-fc29-4d92-92af-a9f27d36b88a | \n",
+ " Mobile Programming LLC. | \n",
+ " https://www.mobileprogramming.com/ | \n",
+ " head | \n",
+ " 023ecc39b08c78b255f966b4e7768ab00f491469e096f0... | \n",
+ " <head><title>App Design & Development | Mo... | \n",
+ " 07c06cb2-1d10-46ac-84e3-b0eaa0279fb2 | \n",
+ " https://www.mobileprogramming.com | \n",
+ "
\n",
+ " \n",
+ " 2121 | \n",
+ " 0a2dd621-fc29-4d92-92af-a9f27d36b88a | \n",
+ " Mobile Programming LLC. | \n",
+ " https://www.mobileprogramming.com/ | \n",
+ " header | \n",
+ " 4e4d70173788abe5d7e7004f590d78dd9911e775c91aa4... | \n",
+ " <header class=\"top-header\" id=\"myHeader\"><nav>... | \n",
+ " b9015037-7ef1-4bb1-8607-f63fefb06226 | \n",
+ " https://www.mobileprogramming.com | \n",
+ "
\n",
+ " \n",
+ " 2122 | \n",
+ " 0a2dd621-fc29-4d92-92af-a9f27d36b88a | \n",
+ " Mobile Programming LLC. | \n",
+ " https://www.mobileprogramming.com/ | \n",
+ " footer | \n",
+ " f40bd3fed28b7926b308ed7e43d0c489df78b5813c5945... | \n",
+ " <footer class=\"container-fluid pd-0 footer-bg\"... | \n",
+ " 6afb8c04-87cc-45d7-a7f4-c11df0d88d54 | \n",
+ " https://www.mobileprogramming.com | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
791 rows × 8 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " company_id company_name \\\n",
+ "60 aab2261b-7065-460d-83de-99404fb50f65 betahaus Sofia \n",
+ "63 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 The Best Bees Company \n",
+ "64 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 The Best Bees Company \n",
+ "87 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 The Best Bees Company \n",
+ "88 21a77c0a-0dae-4d9d-8f16-722c4cb80fa4 The Best Bees Company \n",
+ "... ... ... \n",
+ "2111 1fd25493-2b64-4e45-9286-496d1a2899c1 American Express Travel \n",
+ "2112 1fd25493-2b64-4e45-9286-496d1a2899c1 American Express Travel \n",
+ "2120 0a2dd621-fc29-4d92-92af-a9f27d36b88a Mobile Programming LLC. \n",
+ "2121 0a2dd621-fc29-4d92-92af-a9f27d36b88a Mobile Programming LLC. \n",
+ "2122 0a2dd621-fc29-4d92-92af-a9f27d36b88a Mobile Programming LLC. \n",
+ "\n",
+ " url chunk_type \\\n",
+ "60 https://betahaus.bg/blog footer \n",
+ "63 https://bestbees.com/blog/ header \n",
+ "64 https://bestbees.com/blog/ footer \n",
+ "87 https://bestbees.com/get-started/ header \n",
+ "88 https://bestbees.com/get-started/ footer \n",
+ "... ... ... \n",
+ "2111 https://www.americanexpress.com/en-us/support/... header \n",
+ "2112 https://www.americanexpress.com/en-us/support/... footer \n",
+ "2120 https://www.mobileprogramming.com/ head \n",
+ "2121 https://www.mobileprogramming.com/ header \n",
+ "2122 https://www.mobileprogramming.com/ footer \n",
+ "\n",
+ " chunk_hash \\\n",
+ "60 2f75ead3d8821183e12c322c1800afe60a175572ca3b3a... \n",
+ "63 2a1407969dc3931af069563f58b3ec89bfa300ac6c4b1f... \n",
+ "64 c04c6008e00dc3e53fdea0d3f8db942f8aff0bff155c17... \n",
+ "87 2a1407969dc3931af069563f58b3ec89bfa300ac6c4b1f... \n",
+ "88 c04c6008e00dc3e53fdea0d3f8db942f8aff0bff155c17... \n",
+ "... ... \n",
+ "2111 994f9b66932ae87ff338704924562cf913f73c0f49ae26... \n",
+ "2112 7d4855c2c301923efef11a825dca16e1c6791d6cec96dc... \n",
+ "2120 023ecc39b08c78b255f966b4e7768ab00f491469e096f0... \n",
+ "2121 4e4d70173788abe5d7e7004f590d78dd9911e775c91aa4... \n",
+ "2122 f40bd3fed28b7926b308ed7e43d0c489df78b5813c5945... \n",
+ "\n",
+ " chunk \\\n",
+ "60