WebGen-Bench

This repository contains the code for reproducing the paper WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch.

The dataset is also at 🤗 luzimu/WebGen-Bench and on WebGen-Bench (Kaggle). But for the running of this repo, you do not need to download the data from these sources, as the data is already placed under the data directory.

Data	HF Links
WebGen-Bench	🤗 luzimu/WebGen-Bench
WebGen-Bench_train_data	🤗 luzimu/WebGen-Bench_train_data

The model weights can be downloaded from:

Models	HF Links
WebGen-LM-7B	🤗 luzimu/WebGen-LM-7B
WebGen-LM-14B	🤗 luzimu/WebGen-LM-14B
WebGen-LM-32B	🤗 luzimu/WebGen-LM-32B

(The code under src was executed on a Windows 11 system. It should also run on Linux with minor adjustments. The code under src-remote was executed on a Linux server. This README often uses deepseek/deepseek-chat-v3-0324:free as an example. You can replace it with other models.)

The experiment outputs are placed under outputs.zip. It includes the output of the LLM-based agents that were tested in the paper.

If you wish to deploy Qwen2.5-VL-32B-Instruct yourself for UI agent testing, or you wish to reproduce the training of WebGen-LM, you can download the necessary models using src-remote\download\download.py.

Testing Proprietary and Open-Source Models

Installation

First, install Node.js following Node.js Download Page.

Then, install pm2:

npm install -g pm2

To install WebVoyager:

conda create -p env\webvoyager python=3.10 -y
conda activate env\webvoyager
cd webvoyager
pip install -r requirements.txt
pip install numpy
# we encoungered a minor conflict, which we patched up by commenting proxies=proxies, in "env\webvoyager\lib\site-packages\openai\_base_client.py", line 738

Testing Bolt.diy

Installing and Starting Bolt.diy

First, install our forked version of Bolt.diy. Minor adjustments had been made to it in order to support automatic evaluation.

git clone https://github.com/mnluzimu/bolt.diy-Fork.git
cd bolt.diy-Fork
npm install -g pnpm
pnpm install

Before starting the service, first copy the .env.example file and rename it as .env.local. Then configure the api keys and base urls. You should configure the LLM provider API of your choice in .env.local. (For example, create an openrouter api key at Open Router AIP and past it at OPEN_ROUTER_API_KEY.)

Also, remember to configer VITE_GITHUB_ACCESS_TOKEN with your github api, which will be used for importing templates.

Then run:

pnpm run dev

This will output something like:


> [email protected] dev D:\research\bolt\opensource\bolt.diy-Fork
> node pre-start.cjs  && remix vite:dev


★═══════════════════════════════════════★
          B O L T . D I Y
         ⚡️  Welcome  ⚡️
★═══════════════════════════════════════★

📍 Current Version Tag: v"0.0.7"
📍 Current Commit Version: "08d88c1"
  Please wait until the URL appears here
★═══════════════════════════════════════★
 warn  Data fetching is changing to a single fetch in React Router v7
┃ You can use the `v3_singleFetch` future flag to opt-in early.
┃ -> https://remix.run/docs/en/2.13.1/start/future-flags#v3_singleFetch
┗
  ➜  Local:   http://localhost:5173/
  ➜  Network: use --host to expose
  ➜  press h + enter to show help

You can get the url of the Bolt.diy service after Local: (in this case: http://localhost:5173/).

Starting Automatic Testing

You can start automatic testing of Bolt.diy by running the following command, for example:

python src\automatic_bolt_diy\eval_bolt_diy.py ^
    --jsonl_path data\test.jsonl ^
    --url http://localhost:5173/ ^
    --provider OpenRouter ^
    --desired_model deepseek/deepseek-chat-v3-0324:free

This example command would output the results under downloads\OpenRouter\deepseek-chat-v3-0324_free_test, including .json and .zip files for each test sample. The testing can sometimes be aborted due to connection issues, so you are recommanded to use src\automatic_bolt_diy\loop.bat by replacing the command inside the loop to achieve automatic restarting.

Evaluating Generated Websites with an UI Agent

You can deploy Qwen2.5-VL-32B-Instruct on a server with four GPUs using src-remote/deploy/deploy_qwenvl_32b.sh. After installation following commands in src-remote/deploy/install.sh.

Then we use the WebVoyager UI agent to perform test case operations and assess the outcome. Assuming you have installed the env\webvoyager conda environment previously, you can run src\ui_test_bolt\run_ui_eval_with_answer.bat, for example:

src\ui_test_bolt\run_ui_eval_with_answer.bat downloads\OpenRouter\deepseek-chat-v3-0324_free_test

This example command would output the UI agent testing results under downloads\OpenRouter\deepseek-chat-v3-0324_free_test\extracted\results.

Computing the Accuracy

Then you can compute the accuracy as well as other statistics such as yes rate, partial rate, and no rate using src\ui_test_bolt\compute_acc.py. For example:

python src\ui_test_bolt\compute_acc.py downloads\OpenRouter\deepseek-chat-v3-0324_free_test

This example would print the results in terminal, as well as record the results in downloads\OpenRouter\deepseek-chat-v3-0324_free_test\extracted\table.md.

Evaluating Appearance Score

Generate the appearance score of the websites using:

python src\grade_appearance_bolt_diy\eval_appearance.py downloads\OpenRouter\deepseek-chat-v3-0324_free_test -t data\test.jsonl

This would generate the screeshot and result.json file under downloads\OpenRouter\deepseek-chat-v3-0324_free_test\extracted\000007\shots. Then compute average appearance score using:

python src\grade_appearance_bolt_diy\compute_grade.py src\grade_appearance_bolt_diy\eval_appearance.py downloads\OpenRouter\deepseek-chat-v3-0324_free_test

Testing OpenHands

You can test OpenHands using our forked repo OpenHands-WebGen-Fork. You should configure it based on OpenHands README, then run:

docker pull docker.all-hands.dev/all-hands-ai/runtime:0.25-nikolaik
cd OpenHands-WebGen-Fork
python src/test_webgen-bench/test_webgen_bench.py

Testing Aider

You can test Aider using our forked repo Aider-WebGen-Fork. You should configure it based on Aider README, then run:

cd .\working_dirs
python ..\src\batch_generate.py

Training WebGen-LM

Data Deduplication and Decontamination

This part is not necessary for reproducing training. It is our data deduplication and decontamination process, which was conducted to ensure that the training set is not contaminated by the test set. We place the files for deduplication and decontamination under src-remote\process_train\deduplicate.

pip install sentence-transformers scikit-learn editdistance
python src-remote/process_train/deduplicate/rule_deduplication.py
python src-remote/process_train/deduplicate/decontamination_ngram.py
python src-remote/process_train/deduplicate/test_decontamination_semantic.py --test_file data/test.jsonl --train_file data/train_processed/train_decontaminated_ngram5.jsonl --sim_threshold 0.55 --output_file data/train_processed/train_decontaminated_ngram5_semantic.jsonl --contaminated_file data/train_processed/train_contaminated_ngram5_semantic.jsonl

Generating Training Data

(If you do not need to generate your own data, you can skip this section and directly use the data under data/train_data.)

Data Generation

First, generate training data using:

python src/automatic_bolt_diy/eval_bolt_diy.py ^
    --jsonl_path data/train.jsonl ^
    --url http://localhost:5173/ ^
    --provider OpenRouter ^
    --desired_model deepseek/deepseek-chat-v3-0324:free

Remember to replace http://localhost:5173/ with the actual url of your bolt.diy service. This will output data under downloads\OpenRouter\deepseek-chat-v3-0324_free_train.

Data Filtering

Then, filter the data by generating the appearance score of each website using:

python src\grade_appearance_bolt_diy\eval_appearance.py downloads\OpenRouter\deepseek-chat-v3-0324_free_train -t data\train.jsonl

This would generate the screeshot and result.json file under downloads\OpenRouter\deepseek-chat-v3-0324_free_train\extracted\000007\shots. Then extract the filtered files using:

python src\grade_appearance_bolt_diy\filter_based_on_result.py downloads\OpenRouter\deepseek-chat-v3-0324_free_train

This would copy the filtered files under downloads\OpenRouter\deepseek-chat-v3-0324_free_train\deepseek-chat-v3-0324_free_train_filtered.

Converting to Training File Format

(We uplated the downloads\OpenRouter\deepseek-chat-v3-0324_free_train\deepseek-chat-v3-0324_free_train_filtered to a Linux server and executed the following files there.)

Convert the filtered files into the training format by running:

python src-remote/process_train/process_for_train/get_train.py

Finetuning

Installation

Create a conda environment:

conda create -p env/trainenv python=3.10
conda activate env/trainenv

First, install pytorch from Pytorch Official Website based on your cuda version. Then install other dependencies:

pip install -r requirements.txt

Training

The training scripts are under src-remote/train. You likely need to modify the files based on your own cluster before running:

bash src-remote/train/train_WebGen-LM-7B.sh
bash src-remote/train/train_WebGen-LM-14B.sh
bash src-remote/train/train_WebGen-LM-32B.sh
bash src-remote/train/train_Qwen2_5-Coder-32B-Instruct_ablation_150samples.sh
bash src-remote/train/train_Qwen2_5-Coder-32B-Instruct_ablation_300samples.sh

Evaluation

This is the same as Testing Bolt.diy. The only difference is that you should host the trained model yourself using vllm. As in src-remote/deploy/deploy_coder.sh:

vllm serve models/Qwen2_5-Coder-32B-Instruct_app-bench_train_batch13_filtered_decontaminated_new \
    --dtype auto \
    --host 0.0.0.0 \
    --port 8000 \
    --pipeline-parallel-size 1 \
    --tensor-parallel-size 4 \
    --cpu-offload-gb 0 \

You should also configure the .env.local file in bolt.diy-Fork by setting the values of OPENAI_LIKE_API_BASE_URL to http://IP_ADDRESS:PORT/v1. Then you can start inference by running:

python src\automatic_bolt_diy\eval_bolt_diy.py ^
    --jsonl_path data\test.jsonl ^
    --url http://localhost:5173/ ^
    --provider OpenAILike ^
    --desired_model Qwen2_5-Coder-32B-Instruct_app-bench_train_batch13_filtered_decontaminated_new

Everything after is similar to Testing Bolt.diy.

Citation

If you find our project useful, please cite:

@misc{lu2025webgenbenchevaluatingllmsgenerating,
      title={WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch}, 
      author={Zimu Lu and Yunqiao Yang and Houxing Ren and Haotian Hou and Han Xiao and Ke Wang and Weikang Shi and Aojun Zhou and Mingjie Zhan and Hongsheng Li},
      year={2025},
      eprint={2505.03733},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.03733}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WebGen-Bench

Testing Proprietary and Open-Source Models

Installation

Testing Bolt.diy

Installing and Starting Bolt.diy

Starting Automatic Testing

Evaluating Generated Websites with an UI Agent

Computing the Accuracy

Evaluating Appearance Score

Testing OpenHands

Testing Aider

Training WebGen-LM

Data Deduplication and Decontamination

Generating Training Data

Data Generation

Data Filtering

Converting to Training File Format

Finetuning

Installation

Training

Evaluation

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
src-remote		src-remote
src		src
webvoyager		webvoyager
.gitignore		.gitignore
README.md		README.md
outputs.zip		outputs.zip

mnluzimu/WebGen-Bench

Folders and files

Latest commit

History

Repository files navigation

WebGen-Bench

Testing Proprietary and Open-Source Models

Installation

Testing Bolt.diy

Installing and Starting Bolt.diy

Starting Automatic Testing

Evaluating Generated Websites with an UI Agent

Computing the Accuracy

Evaluating Appearance Score

Testing OpenHands

Testing Aider

Training WebGen-LM

Data Deduplication and Decontamination

Generating Training Data

Data Generation

Data Filtering

Converting to Training File Format

Finetuning

Installation

Training

Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages