This project provides a suite of Python scripts to automate the extraction, transformation, and integration of documentation pages from a DeepWiki instance into an Astro/Starlight based technical documentation website. The pipeline handles content ingestion, page generation, link resolution, and several post-processing steps to ensure the documentation is clean and well-formatted.
The primary goal of this pipeline is to maintain an Astro/Starlight documentation site that mirrors content from a specified DeepWiki. It facilitates:
- Initial Migration: Transferring all relevant DeepWiki pages to the Astro/Starlight site.
- Content Updates: Pulling the latest changes from DeepWiki and updating the corresponding Astro/Starlight pages.
The pipeline is designed to:
- Scrape navigation and content from DeepWiki.
- Convert DeepWiki content structure into Starlight-compatible
.mdx
files. - Resolve internal links and source code links (pointing to GitHub).
- Automate the placement of Astro components for features like collapsible asides and source links.
- Perform cleanup tasks like removing redundant headers and fixing misplaced import statements.
Before running any scripts, ensure you have the following set up:
- Python: Python 3.7+ is recommended.
- Dependencies: Install necessary Python packages. Common libraries used are
selenium
andbeautifulsoup4
(though BeautifulSoup is optional inrun_cdi.py
). You can typically install these using pip:pip install selenium beautifulsoup4
- WebDriver:
- ChromeDriver: Download the ChromeDriver executable that matches your Chrome/Brave browser version. The path to this executable must be correctly set in
config.py
. - Brave Browser (Optional): If you intend to use Brave Browser, ensure its executable path is also correctly set in
config.py
. Otherwise, the scripts will default to Chrome/Chromium.
- ChromeDriver: Download the ChromeDriver executable that matches your Chrome/Brave browser version. The path to this executable must be correctly set in
- Project Structure: The scripts assume a certain project structure, particularly for
config.py
to correctly determineWORKSPACE_BASE
if derived automatically. - Astro/Starlight Project: An existing Astro project with Starlight integration set up. The
TARGET_DOCS_DIR
inconfig.py
should point to thesrc/content/docs/
directory (or equivalent) of this Astro project.
The config.py
file is central to the pipeline. It contains all necessary paths, URLs, and settings for the scripts. You must review and update this file before running the pipeline.
Key configuration variables include:
WORKSPACE_BASE
: The absolute path to your workspace. Can be set directly or derived ifconfig.py
is in the workspace root.BASE_DEEPWIKI_URL
: The base URL of the DeepWiki instance to scrape (e.g.,"https://deepwiki.com/tplr-ai/templar"
).ROOT_DEEPWIKI_PAGE_HREF
: The DeepWiki page HREF that will become the mainindex.mdx
for your documentation (e.g.,"/tplr-ai/templar/1-overview"
).GITHUB_REPO_URL
: The base URL of your GitHub repository for resolving source links (e.g.,"https://github.com/tplr-ai/templar"
).GITHUB_REF
: The branch, commit hash, or tag to use for GitHub source links (e.g.,"bb2fc2a9"
,"main"
).INGESTED_DATA_JSON_PATH
: Path where the intermediate JSON data (scraped from DeepWiki) will be saved. Defaults toingested_deepwiki_data.json
in the workspace base.TARGET_DOCS_DIR
: Crucial setting. The absolute path to your Astro project's content directory where.mdx
files will be generated (e.g.,"/Users/monkey/docs.tplr.ai/src/content/docs"
).CHROMEDRIVER_EXECUTABLE_PATH
: The absolute path to your ChromeDriver executable (e.g.,"/Users/monkey/chromedriver-mac-arm64/chromedriver"
).BRAVE_EXECUTABLE_PATH
(Optional): The absolute path to your Brave browser executable if you prefer it over Chrome. Set toNone
or an empty string to use default Chrome/Chromium.MIN_PAGE_LEN_HEURISTIC
: Minimum character length for a script payload to be considered a full page during content extraction.FILE_MAPPING_OVERRIDES
(Optional): Allows manual overrides for page slugs, categories, or titles if the automated generation isn't suitable for specific DeepWiki pages.
Ensure all paths are correct for your local environment.
The pipeline consists of several Python scripts that must be run in a specific order to update your Astro/Starlight documentation from DeepWiki.
Order of Execution:
run_cdi.py
(Comprehensive DeepWiki Ingestor)run_spbcp.py
(Starlight Page Builder & Component Placer)run_remove_redundant_h1s.py
(Redundant H1 Remover)run_fix_misplaced_imports.py
(Misplaced Imports Fixer)run_convert_internal_anchors.py
(Internal Anchor Converter)run_fix_source_links.py
(Source Link Fixer)
- Ensure all paths and URLs in
config.py
are correctly set up for your environment and the target DeepWiki/GitHub repository.
- Script:
run_cdi.py
- Purpose: This script uses Selenium to navigate the DeepWiki site, scrape the navigation structure (sitemap), and extract the Markdown content from each page. It also performs initial link processing and identifies Mermaid diagrams and "Relevant source files" sections.
- Output: Saves all extracted data into a structured JSON file specified by
INGESTED_DATA_JSON_PATH
(e.g.,ingested_deepwiki_data.json
). - How to run:
python run_cdi.py
- Script:
run_spbcp.py
- Purpose: Reads the
ingested_deepwiki_data.json
file generated in the previous step. It then constructs the.mdx
files for each documentation page, placing them into theTARGET_DOCS_DIR
. This script handles frontmatter generation, imports necessary Astro components (CollapsibleAside.astro
,SourceLink.astro
), replaces DeepWiki-specific link formats with Starlight/Astro compatible ones, and embeds components for source links. - Input:
ingested_deepwiki_data.json
- Output: Generates
.mdx
files in theTARGET_DOCS_DIR
. - How to run:
python run_spbcp.py
- Script:
run_remove_redundant_h1s.py
- Purpose: Starlight typically uses the frontmatter
title
to generate the main H1 header for a page. If the Markdown content itself starts with an H1 that is identical to the frontmatter title, this script removes that redundant H1 from the body of the.mdx
file. - Input:
.mdx
files inTARGET_DOCS_DIR
. - Output: Modifies
.mdx
files in place. - How to run:
python run_remove_redundant_h1s.py
- Script:
run_fix_misplaced_imports.py
- Purpose: Sometimes, import statements for Astro components might accidentally be placed within the frontmatter block by earlier scripts. This script identifies such misplaced imports and moves them to the correct position immediately after the frontmatter.
- Input:
.mdx
files inTARGET_DOCS_DIR
. - Output: Modifies
.mdx
files in place. - How to run:
python run_fix_misplaced_imports.py
- Script:
run_convert_internal_anchors.py
- Purpose: This script processes specific types of internal anchor links. If a link like
[Link Text](#some-anchor)
was intended to point to an anchor on a different page (where "Link Text" matches the title of that other page), this script converts it to the full Astro path like[Link Text](/path/to/other-page#some-anchor)
. It uses theingested_deepwiki_data.json
file to map titles to Astro paths. - Inputs:
.mdx
files inTARGET_DOCS_DIR
.ingested_deepwiki_data.json
(for title-to-path mapping).
- Output: Modifies
.mdx
files in place. - How to run:
python run_convert_internal_anchors.py
- Script:
run_fix_source_links.py
- Purpose: This script ensures that Markdown links intended to be source code references (e.g.,
[path/to/file.py:10-20]()
) are correctly converted into<SourceLink>
Astro components, pointing to the appropriate GitHub URL (constructed usingGITHUB_BLOB_URL_PREFIX
andGITHUB_REF
fromconfig.py
). It handles links that might have been missed or improperly formatted byrun_spbcp.py
. - Input:
.mdx
files inTARGET_DOCS_DIR
. - Output: Modifies
.mdx
files in place. - How to run:
python run_fix_source_links.py
After completing these steps, the .mdx
files in your src/content/docs/
directory should be updated with the latest content from DeepWiki, properly formatted and linked for your Astro/Starlight site.
To automate the entire update pipeline, a GitHub Actions workflow is provided.
Workflow File:
.github/workflows/manual-docs-update-apt-get.yml
Features:
- Manual Trigger: Can be run manually from the "Actions" tab of your GitHub repository.
- Customizable Inputs: Allows optional inputs for the commit message and the target branch when running the workflow.
- Automated Environment Setup:
- Sets up Python 3.10.
- Installs dependencies like Selenium and BeautifulSoup4.
- Installs Google Chrome and the correct ChromeDriver version strictly via
apt-get
within the runner environment.
- Dynamic Configuration: Generates a
config.py
file at runtime fromconfig.template.py
(expected atpublic/scripts/pipeline/config.template.py
), automatically populating the detected ChromeDriver path and the absolute path to the documentation directory (e.g.,src/content/docs
) within the repository. - Full Pipeline Execution: Runs all six Python pipeline scripts in the correct order.
- Automatic Commits & Push: If changes are detected in the documentation directory (e.g.,
src/content/docs
) or theingested_deepwiki_data.json
file, the workflow automatically commits these changes and pushes them to the specified (or current) branch.
Prerequisites for Using the Workflow:
config.template.py
: This file must exist, for example, atpublic/scripts/pipeline/config.template.py
. It needs to contain the placeholders__CHROMEDRIVER_PATH_PLACEHOLDER__
and__TARGET_DOCS_DIR_PLACEHOLDER__
. Other configurations within this template should be appropriate for the CI environment or managed via GitHub Secrets.- Workflow Configuration: The workflow YAML file itself contains an environment variable (
TARGET_DOCS_PATH_IN_REPO
) that specifies the relative path to your Starlight documentation directory (e.g.,src/content/docs
). Ensure this path is correct in the workflow file. - Repository Permissions: The workflow requires
contents: write
permission to commit and push changes. This is usually configured by default for actions running on the repository.
How to Use the Workflow:
- Navigate to the Actions tab in your GitHub repository.
- In the left sidebar, find and select the workflow named "Manual Docs Update from DeepWiki (apt-get)".
- Click the Run workflow button (usually on the right side).
- Optionally, you can customize the "Commit message for documentation updates" and "Branch to commit updates to".
- Click the green Run workflow button to start the process.
- You can monitor the progress and logs of the workflow run in the Actions tab.
Once the pipeline has generated or updated the .mdx
files in src/content/docs/
, you can use the standard Astro commands to build, preview, and develop your documentation site.
All commands are run from the root of your Astro project, from a terminal:
Command | Action |
---|---|
npm install |
Installs dependencies |
npm run dev |
Starts local dev server at localhost:4321 |
npm run build |
Build your production site to ./dist/ |
npm run preview |
Preview your build locally, before deploying |
npm run astro ... |
Run CLI commands like astro add , astro check |
npm run astro -- --help |
Get help using the Astro CLI |
A typical Astro + Starlight project includes:
.
βββ public/ # Static assets (favicons, etc.)
βββ src/
β βββ assets/ # Images, etc., embeddable in Markdown
β βββ components/ # Custom Astro components (e.g., CollapsibleAside.astro, SourceLink.astro)
β βββ content/
β β βββ docs/ # Your .mdx documentation files (managed by this pipeline)
β β βββ config.ts # Starlight content collection configuration
β βββ env.d.ts # TypeScript environment declarations
βββ astro.config.mjs # Astro configuration file
βββ package.json
βββ tsconfig.json
- .mdx files: Starlight looks for
.md
or.mdx
files in thesrc/content/docs/
directory. These are generated and updated by this pipeline. - Images: Can be added to
src/assets/
and embedded in Markdown with a relative link. - Custom Components: Ensure that any Astro components referenced by the pipeline (e.g.,
CollapsibleAside.astro
,SourceLink.astro
) exist in your Astro project, typically undersrc/components/
. The paths in the generated.mdx
files (e.g.,@components/CollapsibleAside.astro
) assume appropriate Astro path aliases are configured in yourtsconfig.json
orjsconfig.json
.
- Selenium Errors: Most Selenium errors stem from an incorrect ChromeDriver path or version mismatch with your browser. Double-check
CHROMEDRIVER_EXECUTABLE_PATH
inconfig.py
. - DeepWiki Structure Changes: If the HTML structure of DeepWiki changes significantly, the scraping logic in
run_cdi.py
(especially CSS selectors) might need adjustments. - Content Extraction Issues: The
MIN_PAGE_LEN_HEURISTIC
inconfig.py
helps filter out irrelevant script payloads. If valid content is missed or irrelevant snippets are included, this value might need tuning. - Link Resolution: The accuracy of resolved GitHub links depends heavily on the
GITHUB_REPO_URL
andGITHUB_REF
settings inconfig.py
. - File Paths in
config.py
: Use absolute paths forTARGET_DOCS_DIR
,CHROMEDRIVER_EXECUTABLE_PATH
, andBRAVE_EXECUTABLE_PATH
to avoid ambiguity. - Astro Path Aliases: Ensure that path aliases like
@components/
(used for importingCollapsibleAside.astro
andSourceLink.astro
) are correctly configured in your Astro project'stsconfig.json
orjsconfig.json
. Example:// tsconfig.json or jsconfig.json { "compilerOptions": { "baseUrl": ".", "paths": { "@components/*": ["src/components/*"], "@assets/*": ["src/assets/*"] // Add other aliases as needed } } }
Check out Starlightβs docs, read the Astro documentation, or jump into the Astro Discord server.