Model Types
Compare SAEHD, AMP, Quick96, and XSeg models to choose the right architecture for your project.
Overview
Recaster supports four model architectures from DeepFaceLab, each designed for different use cases. Choosing the right model is one of the most important decisions in your face replacement workflow, as it determines the balance between quality, speed, and resource requirements.
Model Comparison
| Feature | SAEHD | AMP | Quick96 | XSeg |
|---|---|---|---|---|
| Quality | Excellent | Very Good | Fair | N/A (masks) |
| Training Speed | Slow | Fast | Very Fast | Fast |
| Min VRAM | 4 GB | 4 GB | 2 GB | 2 GB |
| Recommended VRAM | 8-12 GB | 6-8 GB | 4 GB | 4 GB |
| Resolution | 64-640px | 64-640px | 96px (fixed) | Match face size |
| Configurable Arch | Yes | Yes | No | Limited |
| Typical Iterations | 100K-500K | 50K-200K | 20K-100K | 10K-50K |
| Best For | Final output | Quick projects | Testing | Custom masks |
SAEHD
SAEHD (Styled AutoEncoder High Definition) is the flagship model architecture in DeepFaceLab. It offers the most configuration options and produces the highest quality results, making it the preferred choice for professional and production work.
Strengths
- Highest quality output -- Produces the most detailed and accurate face replacements when trained sufficiently.
- Fully configurable -- Resolution, encoder/decoder dimensions, mask type, and learning rate can all be tuned for your specific needs.
- Resolution flexibility -- Supports resolutions from 64px up to 640px. Higher resolutions preserve more facial detail.
- Multiple mask types -- Supports different mask architectures for the blending boundary.
Key Configuration Options
Resolution
The face resolution for training. Common values: 128, 192, 224, 256, 320. Higher values produce sharper faces but require more VRAM and longer training.
Encoder Dimensions
The width of the encoder network. Larger values (64, 128, 256) allow the model to capture more facial detail but use more VRAM. Start with 64 and increase if you have VRAM headroom.
Decoder Dimensions
The width of the decoder network. Should generally match or be larger than the encoder dimensions. Common values: 64, 128, 256.
Batch Size
Number of face pairs processed per iteration. Higher batch sizes lead to smoother training but require more VRAM. Start with 4-8 and adjust based on available memory.
SAEHD Recommended Settings
AMP
AMP (Adaptive Multi-Purpose) is designed for faster training while maintaining good quality. It uses a simplified architecture that converges more quickly than SAEHD, making it ideal for projects where time is a factor.
Strengths
- Fast convergence -- Produces usable results in roughly half the iterations needed for SAEHD at comparable settings.
- Good quality -- While not matching SAEHD at maximum settings, AMP produces very good results that are suitable for most applications.
- Lower VRAM usage -- The simplified architecture uses less GPU memory than SAEHD at the same resolution.
- Adaptive features -- The model adapts to the characteristics of your specific dataset during training.
When to Use AMP
- You need results faster than SAEHD can deliver.
- You have limited GPU VRAM (4-6 GB).
- The project does not require maximum possible quality.
- You are working on multiple face swap projects concurrently.
AMP vs SAEHD
Quick96
Quick96 is a fixed-configuration model that trains at 96px resolution. It is specifically designed for rapid testing and prototyping, allowing you to verify your dataset quality and face alignment before committing to a longer training run.
Characteristics
- Fixed 96px resolution -- No configuration needed. The resolution is locked for maximum training speed.
- Minimal VRAM -- Runs on GPUs with as little as 2 GB of VRAM, making it accessible on older or lower-end hardware.
- Fastest training -- Produces visible results in 20-30 minutes on a modern GPU.
- Lower quality -- The 96px resolution limits detail. Fine facial features may be blurry in the output. Not suitable for production use.
Quick96 as a Validation Tool
XSeg
XSeg (eXtended Segmentation) is not a face swap model -- it is a segmentation model that learns to generate high-quality face masks for your specific dataset. Once trained, XSeg can automatically create masks tailored to your source and destination faces.
Purpose
Unlike BiSeNet (which is pre-trained on a general face dataset), XSeg is trained on your specific faces and learns the unique characteristics that should be included in or excluded from the mask. This produces more accurate masks for challenging situations like:
- Faces with consistent occlusions (microphones, glasses).
- Unusual face angles or extreme expressions.
- Complex hair patterns that need consistent handling.
- Scenes with varying lighting conditions.
XSeg Training Workflow
- Label sample faces -- Manually paint accurate masks on 30-50 representative face images from your dataset using the Face Editor.
- Train XSeg -- Select XSeg as the model type and train for 10K-50K iterations. The model learns from your manually labeled examples.
- Apply XSeg masks -- Run the trained XSeg model on the entire dataset to generate masks for all faces automatically.
- Review results -- Check the generated masks in the Face Browser using the XSeg overlay. Fix any remaining issues manually.
XSeg Requires Manual Labels
Choosing a Model
Use this decision guide to select the right model for your project:
"I need the best possible quality"
Use SAEHD at 224px or higher resolution with large encoder/decoder dimensions. Plan for 200K-500K iterations and 8+ GB of VRAM.
"I need good results quickly"
Use AMP at 192-224px. You will get good quality in roughly half the time of SAEHD. Good for projects with time constraints.
"I want to test my dataset first"
Use Quick96 for rapid validation. If the Quick96 result looks promising, switch to SAEHD or AMP for the final output.
"I need better masks for a difficult scene"
Train XSeg first on labeled examples, then apply the trained masks to your dataset before running SAEHD or AMP training.
Common Configuration Options
These configuration parameters are shared across model types (where applicable):
| Parameter | Description | Recommended |
|---|---|---|
| Resolution | Training face resolution in pixels | 192-256px for quality, 96-128px for speed |
| Batch Size | Faces per training iteration | 4-8 (lower if out-of-memory) |
| Learning Rate | How aggressively the model learns | Default values recommended |
| Face Type | Face extraction coverage | whole_face for most projects |
| Masked Training | Train only within mask boundary | Yes, when masks are clean |
| Random Warping | Data augmentation for robustness | Yes, for the first 50K iterations |
| GAN Training | Adversarial training for sharper output | Enable after 100K iterations |
Was this page helpful?