Anima LoRA Training

🧠

No Job Selected

Create a new training job or select one from the sidebar

Job Name

Optimization

Learning Rate

Text Encoder LR

Optimizer

LR Scheduler

LR Warmup Steps

Weight Decay

Seed

Restart Cycles Number of times the learning rate restarts from max to min.

Min LR Ratio Minimum LR as a fraction of the initial LR (e.g. 0.1 = decays to 10%).

Decouple Weight Decay Separates weight penalty from gradient scaling to prevent overfitting when using Weight Decay.

Training Schedule

Duration Unit

Epochs Steps

Max Epochs

Save Every N Epochs

Max Steps

Save Every N Steps

Output Name

Save Format

Performance

Mixed Precision

Save Precision

DataLoader Workers

Persistent DataLoader Workers Keeps workers alive between epochs (less startup lag).

Gradient Checkpointing Trades compute for VRAM savings. Recommended on.

Flash Attention Requires flash-attn package.

Torch Compile Uses dynamo inductor backend. Slower first step, faster training after.

Low RAM Optimization Loads model to VRAM to save system RAM.

Blocks to Swap Moves transformer blocks to RAM to save VRAM. 0=Off.

Activation Offload Offloads activations to CPU during backward pass. Requires Gradient Checkpointing.

Caching

Cache Latents to Disk Pre-encode images as .safetensors files for faster training.

VAE Batch Size

Cache Text Encoder Outputs to Disk Pre-encode captions. Required if text encoder is frozen.

Timestep Sample Method

Flow Shift

Global Dataset Settings

Resolution(s) Max resolution(s) for bucketing. Comma-separate for multi-resolution caching (e.g. 512, 1024).

Batch Size(s) Comma-separate to assign varying batch sizes per resolution.

Progressive Resolution Schedule Train at each resolution sequentially (low → high) instead of mixing them. Requires at least 2 resolutions above. Works with both epochs and max steps.

Each fraction is the portion of total steps for that resolution. Must sum to 1.0.

Gradient Accumulation

Caption Extension

Alpha Mask Use image alpha channel as loss mask. Images without alpha train normally.

Bucketing

Enable Aspect Ratio Bucketing Groups images by aspect ratio for efficient training.

Do Not Upscale Images Images smaller than the bucket resolution will not be upscaled, saving VRAM and disk space.

Min Bucket Resolution

Max Bucket Resolution

Bucket Resolution Steps

Disable Bucket Shuffle Keeps DataLoader order within resolution buckets.

Dataset Folders

Training Type LoRA trains lightweight adapter weights. Full Finetune trains the entire DiT model.

LoRA Configuration

Network Module

Network Dim (Rank) Higher = more capacity, more VRAM.

Network Alpha Scaling factor. Usually same as dim.

Train UNet Only Freeze text encoder. Recommended for most LoRA training.

Network Dropout Dropout rate (0-1). Randomly zeroes LoRA neurons to reduce overfitting. 0 = off.

Network Args Space-separated key=value pairs passed to the network module.

Network Weights (LoRA checkpoint) Initialize from an existing LoRA. Often used for fine-tuning the lora. Usually not needed.

Full Finetune Options

Freeze LLM Adapter Keep the LLM adapter weights frozen during training. Recommended — the adapter is pre-trained and retraining it risks degrading text understanding and causes DDP graph errors.

Resume Training

Auto-resume from last saved state Automatically resumes from the most recent saved state in the output folder.

Resume State Folder Resume training state (optimizer, scheduler, step count). Leave blank when auto-resume is enabled.

In-Training Sampling

Enable Sampling

Sample Every N Epochs

Sample Every N Steps

Test Generation

Base Model (No LoRA)

LoRA Strength

Flash Attention Requires flash-attn.

Sage Attention Better Multi-GPU support.

GPU Selection

Loading GPUs...

Multi-GPU Mode

Keep Model Loaded

Sample Prompts

Negative Prompt

Steps

Scale

Seed

No sample prompts yet. Click "+ Add Prompt" to add one.

Hardware Allocation

GPU Selection

Loading GPUs...

Multi-GPU Optimization

Parallelism Mode DDP replicates the full model on each GPU and syncs gradients. FSDP1/FSDP2 shard parameters to reduce VRAM (FSDP2 is the newer, simpler API).

Use CUDA Direct Backend Windows-only custom backend replacing NCCL for native multi-GPU. Auto-detected in TP/SP mode. Incompatible with Torch Compile.

DDP Options

Gradient as Bucket View Reduces gradient memory overhead by eliminating a copy per step. Recommended for DDP training.

Static Graph Allows DDP to overlap communication and computation more aggressively. Recommended when model structure does not change between steps.

Sharding Strategy

Select a strategy to see details.

For more information, please refer to the official PyTorch docs.

CPU Offloading Moves parameters to Main System RAM when not in use. Slows down training but drastically reduces VRAM requirements.

Enable Resharding After Forward Frees gathered parameters after the forward pass. Highly recommended for maximizing VRAM savings.

Enable FSDP Activation Checkpointing Uses FSDP's native activation checkpointing. Can be used alongside standard gradient checkpointing for maximum VRAM savings.

CPU RAM Efficient Loading Only rank 0 loads the model from disk; other ranks receive weights via broadcast. Reduces peak system RAM by ~50% during startup. Automatically enables Sync Module States (required by accelerate).

Backward Prefetch Overlaps parameter all-gather with gradient computation in the backward pass. BACKWARD_PRE gives the best throughput by fetching the next layer while computing current gradients. Costs a small VRAM peak.

Forward Prefetch Pre-fetches the next layer's parameters during the forward pass to overlap communication with computation. Safe for DiT/UNet (static graphs). Speeds up forward pass at a small VRAM cost.

Use Original Parameters Required for LoRA training. Keeps original parameter references so FSDP handles mixed frozen/trainable parameters (frozen UNet + LoRA hooks) correctly. Disable only if you know the entire model has uniform requires_grad.

Limit All-Gathers Prevents too many simultaneous all-gather ops from piling up, reducing CUDA malloc retries and potential OOM spikes. Recommended on.

Auto Wrap Policy Determines how FSDP clusters parameters into sharded units. TRANSFORMER_BASED_WRAP is recommended for large models.

Parameters Threshold Only shards modules with at least this many parameters. Default is 100M (1e8).

Transformer Layer to Wrap The exact class name of the transformer block. Required for TRANSFORMER_BASED_WRAP.

Linux only. This is not available on Windows.

Reshard After Forward When enabled (default), frees unsharded parameters after each forward pass and re-all-gathers them in backward. Maximizes VRAM savings. Disable to keep parameters unsharded between forward and backward — saves one backward all-gather at the cost of higher VRAM.

CPU Offloading Moves sharded parameters and gradients to CPU when not in use. Drastically reduces VRAM at the cost of H2D/D2H copy overhead per step.

Activation Checkpointing Frees intermediate activations during forward and recomputes them in backward. Reduces VRAM at the cost of extra compute. Combine with Reshard After Forward for maximum VRAM savings.

CPU RAM Efficient Loading Only rank 0 loads the model checkpoint; other ranks receive weights via broadcast. Reduces peak system RAM by ~50% during startup.

Auto Wrap Policy Determines how FSDP2 groups parameters into sharded units. TRANSFORMER_BASED_WRAP is recommended for large DiT/UNet models: each transformer block is all-gathered and reduce-scattered independently, enabling communication/computation overlap.

Parameters Threshold Only shards modules with at least this many parameters. Default is 100M (1e8).

Transformer Layer to Wrap Exact class name of the transformer block to wrap per-layer. Required for TRANSFORMER_BASED_WRAP.

FSDP2 vs FSDP1: FSDP2 uses torch.distributed.fsdp.fully_shard (PyTorch ≥ 2.4). Parameters are always exposed as originals (no FlatParameter), making it compatible with LoRA. Backward prefetch, forward prefetch, and limit-all-gathers are handled automatically. Sharding strategy is replaced by the single Reshard After Forward toggle.

DeepSpeed Options

ZeRO Stage Higher stage saves more VRAM but can increase communication overhead.

Optimizer Offload Device

Optimizer NVMe Path

Parameter Offload Device

Parameter NVMe Path

Enable ZeRO-3 Init

Save 16-bit Model with ZeRO-3

FP16 Master Weights and Gradients Primarily useful for ZeRO-Offload configurations that support this mode.

DeepSpeed note: This mode uses Accelerate DeepSpeed launch options plus existing training arguments.

TP/SP Options

TP Degree Number of GPUs for tensor parallelism. Must equal the number of selected GPUs above.

Backend Use Gloo on native Windows. Use NCCL from WSL/Linux. Auto lets the TP/SP script choose.

Sequence Parallel (SP) Always enabled for this mode. Spatial tokens are split across GPUs alongside weight sharding.

Disable Fused QKV Debug option. Leaves attention q/k/v projections unfused instead of using the packed QKV/KV TP path.

TP/SP notes: All standard training configs (LR, batch size, optimizer, network dim, etc.) work. Sample generation during training is disabled. Backend is selected above. Torch Compile is incompatible.

Diagnostics

Enable Step Profiling Prints per-step timing breakdown (forward, backward, communication, optimizer, Python overhead) to the training log.

Track Microbatches Also print per-microbatch fwd/bwd times within each step (only available with step profiling enabled).

Generated Samples

No sample images yet. They will appear here during training.

Training Console

Waiting for training to start...

TensorBoard

Not running

📊

TensorBoard

Click "Launch" to start TensorBoard and view training metrics.

Logs directory: <job>/logs/

Job Maintenance

Open the job's directory in file explorer.

Logging

Delete all TensorBoard event files for this job.

Danger Zone

Revert all settings to template defaults.

No Job Selected

Optimization

Training Schedule

Performance

Caching

Global Dataset Settings

Bucketing

Dataset Folders

LoRA Configuration

Full Finetune Options

Resume Training

In-Training Sampling

Test Generation

Sample Prompts

Hardware Allocation

Multi-GPU Optimization

Diagnostics

Generated Samples

Training Console

TensorBoard

TensorBoard

Job Maintenance

Logging

Danger Zone

Create New Training Job

Clone Job

Confirm

⚙️ Global Settings