DOJO-INTERFACE-CODER-7B: User Interface Generation Model Trained On High Quality Synthetic Data Curated Using Distributed Human Feedback

We are thrilled to release DOJO-INTERFACE-CODER-7B, a-first-of-its-kind Large Language Model (LLM) specialized in generating complex, interactive, and visually appealing frontend interfaces.

DOJO-INTERFACE-CODER-7B is trained on high quality synthetic data generated by state-of-the-art AI models. Data quality is further guaranteed using code verifiers, LLM-as-judge, and distributed human feedback.

Leveraging Dojo's distributed human feedback infrastructure, we curated two datasets:

Dojo-SFT: A comprehensive dataset for supervised fine-tuning (SFT), filtered using LLM-as-judge.
Dojo-DPO: A preference dataset for Direct Preference Optimization (DPO), curated using human feedback scores to align the model's output with human aesthetic and functional preferences.

Our development process followed a two-stage post-training methodology. We began with the powerful Qwen2.5-Coder-7B-Instruct as our base model. This foundation was then elevated through a supervised fine-tuning phase with Dojo-SFT, resulting in DOJO-INTERFACE-CODER-7B-SFT, followed by a direct preference optimization stage using Dojo-DPO. This produced the final, highly specialized DOJO-INTERFACE-CODER-7B.

DOJO-INTERFACE-CODER-7B is capable of generating functional and visually appealing frontend, far exceeding the interface generation capabilities of its base model. Beyond its primary use case, the model demonstrates remarkable generalization against other benchmarks beyond MMLU, GSM8k, and HumanEval.

Dojo Network

Dojo Network is an open, distributed crowdsourcing platform for human-generated datasets. It leverages on Bittensor's blockchain-based incentives to reward participants for providing high-quality preference data. Validators are also rewarded for verifying submissions that adheres to certain standards defined by the validation mechanism. This allows for the creation of high-quality, human-aligned datasets that can be used to train and align AI models.

Training Objective and Summary

The primary objective is to train a 7B parameter language model highly capable of generating complex frontend interface that is functional, visually appealing, and strong adherence to instructions, using high quality, human feedback grounded data.

Currently, models of similar size perform poorly on interface generation tasks. Qwen2.5-Coder-7B-Instruct, while strong in other coding tasks, struggles with generating complex interfaces in two aspects. Firstly, the model faces challenges in following complex interface generation instructions and generating Javascript code that works, resulting in poor interactivity and overall satisfaction in its output. A more critical limitation, however, is its tendency to generate only a skeleton of the HTML, CSS, and especially JavaScript code, leaving users to implement the rest.

Our SFT stage steers the model towards generating the complete code instead of a skeleton of the code, and doing so in a consistent, highly structured fashion. To a lesser extent, the SFT model is better at instruction following. Nevertheless, there were still large rooms for improvements in terms of instruction following, aesthetics, and functionality of interactive elements.

Our DPO stage further improves the interface generation capabilities of our SFT model. Using a DPO dataset grounded by Dojo’s distributed human feedback, we trained a model that addresses the aforementioned issues of aesthetics and interactivity. Most importantly, training on human feedback data means that the model learns how to generate outputs that closely align with what the end users want.

A slightly surprising auxiliary improvement that the DPO stage brings is its ability to recover loss of the model’s general capabilities due to continual forgetting in the SFT stage. Thus, DPO not only improves performance specifically for interface generation capabilities, but also other domains such as Python programming (HumanEval, MBPP), Grade School Math (GSM8K), and general language understanding (MMLU).

Dataset

Our models are trained entirely on high quality synthetic data generated using Claude 3.5 Sonnet. Below is an outline on how we curate our SFT and DPO datasets through our synthetic data generation and distributed human feedback pipeline:

Given an instruction, we first generate the so-called highest quality completion. Then, we continuously prompt the model to generate a new completion that is slightly worse than the previous one. For each instruction, we collect four completions varying in terms of quality.
For each instruction, we gather the top two completions (based on synthetic ground truth) and create separate <instruction, completion> pairs. We do this for all instructions to create our SFT dataset (Dojo-SFT).
Leveraging the Dojo Network, we distribute instructions and their corresponding completions to a distributed network of miners (essentially labels) to score the quality of the frontend interface generated from each completion, on a Likert scale from 1-10.
Using the scores as a proxy for preference ranking, we select the top two completions in terms of human feedback ground truth scores to create <instruction, chosen, rejected> triplets, where the chosen response is the higher scoring completion of the two. This leads to our DPO dataset (Dojo-DPO).

To mitigate continual forgetting in the SFT stage, we interweaved Dojo-SFT with an Openhermes2.5 subset using a 1:1 ratio. A summary of the three datasets is as follows:

Dojo SFT: 25000 rows, curated using synthetic ground truth
Dojo DPO: 12500 rows, curated using human feedback ground truth
Openhermes2.5 subset: 25000 rows, all JavaScript-related rows are filtered out to control for the impact of our Dojo datasets on generating interactive elements with JavaScript

Training Details

For all training runs, we utilized Distributed Data Parallel with a single 4xA100 node, using the LLaMA-Factory framework.

Hyperparameters are selected based on a mix of prior knowledge on scaling law and empirical results. For compute efficiency, we utilized Low Rank Adaptation (LoRA) for both SFT and DPO stages. We used a LoRA rank of 16 for SFT, and 256 for DPO.

For more information on hyperparameters, please refer to the Appendix section.

Performance (Interface Generation)

We compare the performance on interface generation tasks across three models: the Base model (Qwen2.5 Coder 7B Instruct), the SFT model, and the DPO model.

In summary, SFT eliminates the Base model’s intrinsic refusal to generate the complete code and yields a more consistent structure to its outputs. Much of the learning with regards to generating high quality interfaces, however, takes place in the DPO stage.

Base vs SFT

In the Base model, there is an intrinsic refusal to generate the full code, often leaving comments instructing the users to implement themselves in parts of the code. This refusal is present and invariant to different prompting techniques.
SFT model eliminates the issue entirely, as all the outputs are complete and correctly structured. As a result, it yielded better performance in our interface evaluation.
However, majority of the improvements can be attributed to the solving of the intrinsic refusal problem in the base model. In terms of interactivity and visual appeal, there was still a lot of room for improvement at this stage.

DPO vs SFT

Trained on a preference dataset curated from Dojo’s human feedback data, the DPO model further improves its interface generation capability.
We observed the greatest improvement in terms of instruction following. DPO also achieved significant improvements in terms of visuals and interactivity.

To test the models’ interface generation capabilities, we evaluated the models on 40 diverse, complex, and out of distribution interface generation tasks. Based on the results, there is almost twice as many tasks where DPO outperformed SFT than the other way round.

	Number of tasks
DPO is better	18
SFT is better	10
Equal performance	12
Total	40

Performance (General)

On general benchmarks, SFT’s performance relative to Base degraded on most benchmarks (despite interweaving with openhermes2.5 to mitigate continuous forgetting). DPO was able to reverse the collapse in performance, where on most benchmarks it maintained its performance and even outperformed Base in some instances. This empirical phenomenon is backed by recent research on how SFT memorizes and RLHF generalizes; what was surprising, however, was the fact that DPO was able to generalize despite the fact that the preference dataset only contains samples that are specific to interface generation tasks.

Benchmark	Base	SFT	DPO
HumanEval	0.829	0.713 (-13.97%)	0.829 (0.00%)
HumanEval Plus	0.768	0.634 (-17.46%)	0.768 (0.00%)
MBPP	0.696	0.638 (-8.33%)	0.7 (0.57%)
MBPP Plus	0.791	0.733 (-7.36%)	0.796 (0.67%)
GSM8K	0.717	0.733 (2.11%)	0.721 (0.42%)
MMLU	0.644	0.641 (-0.44%)	0.645 (0.15%)
HellaSwag	0.566	0.558 (-1.46%)	0.567 (0.09%)

Conclusion and Future Works

DOJO-INTERFACE-CODER-7B represents a step forward in code generation, especially for building complex, interactive frontend interfaces. By combining high-quality synthetic data with distributed human feedback through the Dojo Network, we successfully trained a 7B parameter model that not only generates complete, visually appealing, and functional code but also generalizes well across unrelated benchmarks such as HumanEval, GSM8k, and MMLU.

Our two-stage post-training process involving supervised fine-tuning with Dojo-SFT followed by preference optimization with Dojo-DPO, demonstrates the effectiveness of structured, human-aligned training. Notably, even though the DPO dataset was domain-specific, it helped recover general capabilities lost during SFT, underscoring the broader potential of preference-based training.

Looking ahead, we plan to expand the Dojo feedback pipeline to support iterative and richer human-in-the-loop workflows, improving agentic reasoning and output refinement. We also intend to extend this framework beyond code, introducing support for additional modalities, to further push the boundaries of human-aligned model capabilities.

Appendix

Training details for SFT

### model
model_name_or_path: Qwen/Qwen2.5-Coder-7-Instruct

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
lora_rank: 16

### dataset
dataset: sft_dojo_openhermes2-5-no-js_1-1
template: qwen
cutoff_len: 32768
overwrite_cache: true
preprocessing_num_workers: 1
preprocessing_batch_size: 1000

### output
logging_steps: 20
save_steps: 2000
save_total_limit: 5
plot_loss: true
overwrite_output_dir: true
report_to: wandb

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 2
learning_rate: 1.0e-5
num_train_epochs: 3
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 1800000000
dataloader_drop_last: true
adam_beta1: 0.9
adam_beta2: 0.95
weight_decay: 0.1
load_best_model_at_end: true

### eval
val_size: 0.05
per_device_eval_batch_size: 2
eval_strategy: steps
eval_steps: 250

Training details for DPO

### model
model_name_or_path: <path to SFT model>
### method
stage: dpo
do_train: true
finetuning_type: lora
lora_target: all
lora_rank: 256
lora_alpha: 512
pref_beta: 0.1
pref_loss: sigmoid

### dataset
dataset: dpo_dojo_12500_hfgt-1-2
template: qwen
cutoff_len: 32768
overwrite_cache: true
preprocessing_num_workers: 1
preprocessing_batch_size: 1000

### output
logging_steps: 20
save_steps: 500
save_total_limit: 5
plot_loss: true
overwrite_output_dir: true
report_to: wandb

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-6
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 1800000000
dataloader_drop_last: true
adam_beta1: 0.9
adam_beta2: 0.95
weight_decay: 0.1
load_best_model_at_end: true

### eval
val_size: 0.05
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 250