Skip to content

Conversation

allisonwang-db
Copy link
Contributor

@allisonwang-db allisonwang-db commented Sep 22, 2025

What changes were proposed in this pull request?

This PR adds the initial script to generate llms.txt file for Spark main documentation website.
Note, for API Docs, they should point to their own llms.txt files once available.
Here is the current llms.txt file generated by this script.

# Apache Spark

> Apache Spark™ is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

Documentation home: https://spark.apache.org/docs/latest/

## Programming Guides

- [Quick Start](https://spark.apache.org/docs/latest/quick-start.html)
- [RDD Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
- [Spark SQL, Datasets, and DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html)
- [Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
- [Spark Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html)
- [MLlib](https://spark.apache.org/docs/latest/ml-guide.html)
- [GraphX](https://spark.apache.org/docs/latest/graphx-programming-guide.html)
- [SparkR](https://spark.apache.org/docs/latest/sparkr.html)
- [PySpark](https://spark.apache.org/docs/latest/api/python/getting_started/index.html)
- [Spark SQL CLI](https://spark.apache.org/docs/latest/sql-distributed-sql-engine-spark-sql-cli.html)

## API Docs

- [Spark Python API](https://spark.apache.org/docs/latest/api/python/index.html)
- [Spark Scala API](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html)
- [Spark Java API](https://spark.apache.org/docs/latest/api/java/index.html)
- [Spark R API](https://spark.apache.org/docs/latest/api/R/index.html)
- [Spark SQL Built-in Functions](https://spark.apache.org/docs/latest/api/sql/index.html)

## Deployment Guides

- [Cluster Overview](https://spark.apache.org/docs/latest/cluster-overview.html)
- [Submitting Applications](https://spark.apache.org/docs/latest/submitting-applications.html)
- [Standalone Deploy Mode](https://spark.apache.org/docs/latest/spark-standalone.html)
- [YARN](https://spark.apache.org/docs/latest/running-on-yarn.html)
- [Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html)

## Other Documents

- [Configuration](https://spark.apache.org/docs/latest/configuration.html)
- [Monitoring](https://spark.apache.org/docs/latest/monitoring.html)
- [Web UI](https://spark.apache.org/docs/latest/web-ui.html)
- [Tuning Guide](https://spark.apache.org/docs/latest/tuning.html)
- [Job Scheduling](https://spark.apache.org/docs/latest/job-scheduling.html)
- [Security](https://spark.apache.org/docs/latest/security.html)
- [Hardware Provisioning](https://spark.apache.org/docs/latest/hardware-provisioning.html)
- [Cloud Infrastructures](https://spark.apache.org/docs/latest/cloud-integration.html)
- [Migration Guide](https://spark.apache.org/docs/latest/migration-guide.html)
- [Building Spark](https://spark.apache.org/docs/latest/building-spark.html)

## External Resources

- [Apache Spark Home](https://spark.apache.org/)
- [Downloads](https://spark.apache.org/downloads.html)
- [GitHub Repository](https://github.com/apache/spark)
- [Issue Tracker (JIRA)](https://issues.apache.org/jira/projects/SPARK)
- [Mailing Lists](https://spark.apache.org/mailing-lists.html)
- [Community](https://spark.apache.org/community.html)
- [Contributing](https://spark.apache.org/contributing.html)

Why are the changes needed?

To improve documentations

Does this PR introduce any user-facing change?

No. This PR along will not add the newly generated llms.txt files to the website.

How was this patch tested?

Manually running locally.

Was this patch authored or co-authored using generative AI tooling?

No

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if it works

@cloud-fan
Copy link
Contributor

hmm which part is dynamic? Can this be a static file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants