We are thrilled to release DOJO-INTERFACE-CODER-7B, a-first-of-its-kind Large Language Model (LLM) specialized in generating complex, interactive, and visually appealing frontend interfaces.

DOJO-INTERFACE-CODER-7B is trained on high quality synthetic data generated by state-of-the-art AI models. Data quality is further guaranteed using code verifiers, LLM-as-judge, and distributed human feedback.
Leveraging Dojo's distributed human feedback infrastructure, we curated two datasets:
- Dojo-SFT: A comprehensive dataset for supervised fine-tuning (SFT), filtered using LLM-as-judge.
- Dojo-DPO: A preference dataset for Direct Preference Optimization (DPO), curated using human feedback scores to align the model's output with human aesthetic and functional preferences.
Our development process followed a two-stage post-training methodology. We began with the powerful Qwen2.5-Coder-7B-Instruct as our base model. This foundation was then elevated through a supervised fine-tuning phase with Dojo-SFT, resulting in DOJO-INTERFACE-CODER-7B-SFT, followed by a direct preference optimization stage using Dojo-DPO. This produced the final, highly specialized DOJO-INTERFACE-CODER-7B.
DOJO-INTERFACE-CODER-7B is capable of generating functional and visually appealing frontend, far exceeding the interface generation capabilities of its base model. Beyond its primary use case, the model demonstrates remarkable generalization against other benchmarks beyond MMLU, GSM8k, and HumanEval.
Dojo Network
Dojo Network is an open, distributed crowdsourcing platform for human-generated datasets. It leverages on Bittensor's blockchain-based incentives to reward participants for providing high-quality preference data. Validators are also rewarded for verifying submissions that adheres to certain standards defined by the validation mechanism. This allows for the creation of high-quality, human-aligned datasets that can be used to train and align AI models.
Training Objective and Summary
The primary objective is to train a 7B parameter language model highly capable of generating complex frontend interface that is functional, visually appealing, and strong adherence to instructions, using high quality, human feedback grounded data.
Currently, models of similar size perform poorly on interface generation tasks. Qwen2.5-Coder-7B-Instruct, while strong in other coding tasks, struggles with generating complex interfaces in two aspects. Firstly, the model faces challenges in following complex interface generation instructions and generating Javascript code that works, resulting in poor interactivity and overall satisfaction in its output. A more critical limitation, however, is its tendency to generate only a skeleton of the HTML, CSS, and especially JavaScript code, leaving users to implement the rest.
Our SFT stage steers the model towards generating the complete code instead of a skeleton of the code, and doing so in a consistent, highly structured fashion. To a lesser extent, the SFT model is better at instruction following. Nevertheless, there were still large rooms for improvements in terms of instruction following, aesthetics, and functionality of interactive elements.
Our DPO stage further improves the interface generation capabilities of our SFT model. Using a DPO dataset grounded by Dojo’s distributed human feedback, we trained a model that addresses the aforementioned issues of aesthetics and interactivity. Most importantly, training on human feedback data means that the model learns how to generate outputs that closely align with what the end users want.
A slightly surprising auxiliary improvement that the DPO stage brings is its ability to recover loss of the model’s general capabilities due to continual forgetting in the SFT stage. Thus, DPO not only improves performance specifically for interface generation capabilities, but also other domains such as Python programming (HumanEval, MBPP), Grade School Math (GSM8K), and general language understanding (MMLU).
Dataset
Our models are trained entirely on high quality synthetic data generated using Claude 3.5 Sonnet. Below is an outline on how we curate our SFT and DPO datasets through our synthetic data generation and distributed human feedback pipeline:
- Given an instruction, we first generate the so-called highest quality completion. Then, we continuously prompt the model to generate a new completion that is slightly worse than the previous one. For each instruction, we collect four completions varying in terms of quality.
- For each instruction, we gather the top two completions (based on synthetic ground truth) and create separate <instruction, completion> pairs. We do this for all instructions to create our SFT dataset (Dojo-SFT).
- Leveraging the Dojo Network, we distribute instructions and their corresponding completions to a distributed network of miners (essentially labels) to score the quality of the frontend interface generated from each completion, on a Likert scale from 1-10.
- Using the scores as a proxy for preference ranking, we select the top two completions in terms of human feedback ground truth scores to create <instruction, chosen, rejected> triplets, where the chosen response is the higher scoring completion of the two. This leads to our DPO dataset (Dojo-DPO).
To mitigate continual forgetting in the SFT stage, we interweaved Dojo-SFT with an Openhermes2.5 subset using a 1:1 ratio. A summary of the three datasets is as follows:
- Dojo SFT: 25000 rows, curated using synthetic ground truth
- Dojo DPO: 12500 rows, curated using human feedback ground truth
- Openhermes2.5 subset: 25000 rows, all JavaScript-related rows are filtered out to control for the impact of our Dojo datasets on generating interactive elements with JavaScript
Training Details
For all training runs, we utilized Distributed Data Parallel with a single 4xA100 node, using the LLaMA-Factory framework.
Hyperparameters are selected based on a mix of prior knowledge on scaling law and empirical results. For compute efficiency, we utilized Low Rank Adaptation (LoRA) for both SFT and DPO stages. We used a LoRA rank of 16 for SFT, and 256 for DPO.
For more information on hyperparameters, please refer to the Appendix section.
Performance (Interface Generation)
We compare the performance on interface generation tasks across three models: the Base model (Qwen2.5 Coder 7B Instruct), the SFT model, and the DPO model.
In summary, SFT eliminates the Base model’s intrinsic refusal to generate the complete code and yields a more consistent structure to its outputs. Much of the learning with regards to generating high quality interfaces, however, takes place in the DPO stage.
Base vs SFT
- In the Base model, there is an intrinsic refusal to generate the full code, often leaving comments instructing the users to implement themselves in parts of the code. This refusal is present and invariant to different prompting techniques.
- SFT model eliminates the issue entirely, as all the outputs are complete and correctly structured. As a result, it yielded better performance in our interface evaluation.
- However, majority of the improvements can be attributed to the solving of the intrinsic refusal problem in the base model. In terms of interactivity and visual appeal, there was still a lot of room for improvement at this stage.
DPO vs SFT
- Trained on a preference dataset curated from Dojo’s human feedback data, the DPO model further improves its interface generation capability.
- We observed the greatest improvement in terms of instruction following. DPO also achieved significant improvements in terms of visuals and interactivity.
To test the models’ interface generation capabilities, we evaluated the models on 40 diverse, complex, and out of distribution interface generation tasks. Based on the results, there is almost twice as many tasks where DPO outperformed SFT than the other way round.
Performance (General)
On general benchmarks, SFT’s performance relative to Base degraded on most benchmarks (despite interweaving with openhermes2.5 to mitigate continuous forgetting). DPO was able to reverse the collapse in performance, where on most benchmarks it maintained its performance and even outperformed Base in some instances. This empirical phenomenon is backed by recent research on how SFT memorizes and RLHF generalizes; what was surprising, however, was the fact that DPO was able to generalize despite the fact that the preference dataset only contains samples that are specific to interface generation tasks.
Conclusion and Future Works
DOJO-INTERFACE-CODER-7B represents a step forward in code generation, especially for building complex, interactive frontend interfaces. By combining high-quality synthetic data with distributed human feedback through the Dojo Network, we successfully trained a 7B parameter model that not only generates complete, visually appealing, and functional code but also generalizes well across unrelated benchmarks such as HumanEval, GSM8k, and MMLU.
Our two-stage post-training process involving supervised fine-tuning with Dojo-SFT followed by preference optimization with Dojo-DPO, demonstrates the effectiveness of structured, human-aligned training. Notably, even though the DPO dataset was domain-specific, it helped recover general capabilities lost during SFT, underscoring the broader potential of preference-based training.
Looking ahead, we plan to expand the Dojo feedback pipeline to support iterative and richer human-in-the-loop workflows, improving agentic reasoning and output refinement. We also intend to extend this framework beyond code, introducing support for additional modalities, to further push the boundaries of human-aligned model capabilities.
Links
- Dojo-Synthetic-SFT Dataset
- Dojo-HumanFeedback-DPO Dataset
- DOJO-INTERFACE-CODER-7B (DPO)
- DOJO-INTERFACE-CODER-7B-SFT