๐ง
No Job Selected
Create a new training job or select one from the sidebar
Job Name
Optimization
Number of times the learning rate restarts from max to min.
Minimum LR as a fraction of the initial LR (e.g. 0.1 = decays to 10%).
Separates weight penalty from gradient scaling to prevent
overfitting when using Weight Decay.
Training Schedule
Performance
Keeps workers alive between epochs (less startup lag).
Trades compute for VRAM savings. Recommended on.
Requires flash-attn package.
Uses dynamo inductor backend. Slower first step, faster training after.
Loads model to VRAM to save system RAM.
Moves transformer blocks to RAM to save VRAM. 0=Off.
Offloads activations to CPU during backward pass. Requires Gradient Checkpointing.
Caching
Pre-encode images as .safetensors files for faster training.
Pre-encode captions. Required if text encoder is frozen.
Global Dataset Settings
Max resolution(s) for bucketing. Comma-separate for multi-resolution caching (e.g. 512, 1024).
Comma-separate to assign varying batch sizes per resolution.
Train at each resolution sequentially (low โ high) instead of mixing them. Requires at least 2 resolutions above. Works with both epochs and max steps.
Each fraction is the portion of total steps for that resolution. Must sum to 1.0.
Use image alpha channel as loss mask. Images without alpha train normally.
Bucketing
Groups images by aspect ratio for efficient training.
Images smaller than the bucket resolution will not be upscaled, saving VRAM and disk
space.
Keeps DataLoader order within resolution buckets.
Dataset Folders
LoRA trains lightweight adapter weights. Full Finetune trains the entire DiT model.
LoRA Configuration
Higher = more capacity, more VRAM.
Scaling factor. Usually same as dim.
Freeze text encoder. Recommended for most LoRA training.
Dropout rate (0-1). Randomly zeroes LoRA neurons to reduce overfitting. 0 = off.
Space-separated key=value pairs passed to the network module.
Initialize from an existing LoRA. Often used for fine-tuning the lora. Usually not needed.
Full Finetune Options
Keep the LLM adapter weights frozen during training. Recommended โ the adapter is pre-trained and retraining it risks degrading text understanding and causes DDP graph errors.
Resume Training
Automatically resumes from the most recent saved state in the output folder.
Resume training state (optimizer, scheduler, step count). Leave blank when auto-resume is enabled.
In-Training Sampling
Test Generation
Requires flash-attn.
Better Multi-GPU support.
Loading GPUs...
Sample Prompts
No sample prompts yet. Click "+ Add Prompt" to add one.
Hardware Allocation
Loading GPUs...
Multi-GPU Optimization
DDP replicates the full model on each GPU and syncs gradients. FSDP1/FSDP2 shard parameters to reduce VRAM (FSDP2 is the newer, simpler API).
Windows-only custom backend replacing NCCL for native multi-GPU. Auto-detected in TP/SP mode. Incompatible with Torch Compile.
Reduces gradient memory overhead by eliminating a copy per step. Recommended for DDP training.
Allows DDP to overlap communication and computation more aggressively. Recommended when model structure does not change between steps.
Select a strategy to see details.
For more information, please refer to the official PyTorch docs.
Moves parameters to Main System RAM when not in use. Slows down training but drastically reduces VRAM requirements.
Frees gathered parameters after the forward pass. Highly recommended for maximizing VRAM savings.
Uses FSDP's native activation checkpointing. Can be used alongside standard gradient checkpointing for maximum VRAM savings.
Only rank 0 loads the model from disk; other ranks receive weights via broadcast. Reduces peak system RAM by ~50% during startup. Automatically enables Sync Module States (required by accelerate).
Overlaps parameter all-gather with gradient computation in the backward pass. BACKWARD_PRE gives the best throughput by fetching the next layer while computing current gradients. Costs a small VRAM peak.
Pre-fetches the next layer's parameters during the forward pass to overlap communication with computation. Safe for DiT/UNet (static graphs). Speeds up forward pass at a small VRAM cost.
Required for LoRA training. Keeps original parameter references so FSDP handles mixed frozen/trainable parameters (frozen UNet + LoRA hooks) correctly. Disable only if you know the entire model has uniform requires_grad.
Prevents too many simultaneous all-gather ops from piling up, reducing CUDA malloc retries and potential OOM spikes. Recommended on.
Determines how FSDP clusters parameters into sharded units. TRANSFORMER_BASED_WRAP is recommended for large models.
Only shards modules with at least this many parameters. Default is 100M (1e8).
The exact class name of the transformer block. Required for TRANSFORMER_BASED_WRAP.
Linux only. This is not available on Windows.
When enabled (default), frees unsharded parameters after each forward pass and re-all-gathers them in backward. Maximizes VRAM savings. Disable to keep parameters unsharded between forward and backward โ saves one backward all-gather at the cost of higher VRAM.
Moves sharded parameters and gradients to CPU when not in use. Drastically reduces VRAM at the cost of H2D/D2H copy overhead per step.
Frees intermediate activations during forward and recomputes them in backward. Reduces VRAM at the cost of extra compute. Combine with Reshard After Forward for maximum VRAM savings.
Only rank 0 loads the model checkpoint; other ranks receive weights via broadcast. Reduces peak system RAM by ~50% during startup.
Determines how FSDP2 groups parameters into sharded units. TRANSFORMER_BASED_WRAP is recommended for large DiT/UNet models: each transformer block is all-gathered and reduce-scattered independently, enabling communication/computation overlap.
Only shards modules with at least this many parameters. Default is 100M (1e8).
Exact class name of the transformer block to wrap per-layer. Required for TRANSFORMER_BASED_WRAP.
FSDP2 vs FSDP1: FSDP2 uses
torch.distributed.fsdp.fully_shard (PyTorch โฅ 2.4). Parameters are always exposed as originals (no FlatParameter), making it compatible with LoRA. Backward prefetch, forward prefetch, and limit-all-gathers are handled automatically. Sharding strategy is replaced by the single Reshard After Forward toggle.
Higher stage saves more VRAM but can increase communication overhead.
Primarily useful for ZeRO-Offload configurations that support this mode.
DeepSpeed note: This mode uses Accelerate DeepSpeed launch options plus existing training arguments.
Number of GPUs for tensor parallelism. Must equal the number of selected GPUs above.
Use Gloo on native Windows. Use NCCL from WSL/Linux. Auto lets the TP/SP script choose.
Always enabled for this mode. Spatial tokens are split across GPUs alongside weight sharding.
Debug option. Leaves attention q/k/v projections unfused instead of using the packed QKV/KV TP path.
TP/SP notes: All standard training configs (LR, batch size, optimizer, network dim, etc.) work. Sample generation during training is disabled. Backend is selected above. Torch Compile is incompatible.
Diagnostics
Prints per-step timing breakdown (forward, backward, communication, optimizer, Python overhead) to the training log.
Also print per-microbatch fwd/bwd times within each step (only available with step profiling enabled).
Generated Samples
No sample images yet. They will appear here during training.
Training Console
Waiting for training to start...
TensorBoard
Not running
๐
TensorBoard
Click "Launch" to start TensorBoard and view training metrics.
Logs directory: <job>/logs/
Job Maintenance
Open the job's directory in file explorer.
Logging
Delete all TensorBoard event files for this job.
Danger Zone
Revert all settings to template defaults.