Model Types

Model Types

Compare SAEHD, AMP, Quick96, and XSeg models to choose the right architecture for your project.

Overview

Recaster supports four model architectures from DeepFaceLab, each designed for different use cases. Choosing the right model is one of the most important decisions in your face replacement workflow, as it determines the balance between quality, speed, and resource requirements.

Model Comparison

FeatureSAEHDAMPQuick96XSeg
QualityExcellentVery GoodFairN/A (masks)
Training SpeedSlowFastVery FastFast
Min VRAM4 GB4 GB2 GB2 GB
Recommended VRAM8-12 GB6-8 GB4 GB4 GB
Resolution64-640px64-640px96px (fixed)Match face size
Configurable ArchYesYesNoLimited
Typical Iterations100K-500K50K-200K20K-100K10K-50K
Best ForFinal outputQuick projectsTestingCustom masks

SAEHD

SAEHD (Styled AutoEncoder High Definition) is the flagship model architecture in DeepFaceLab. It offers the most configuration options and produces the highest quality results, making it the preferred choice for professional and production work.

Strengths

  • Highest quality output -- Produces the most detailed and accurate face replacements when trained sufficiently.
  • Fully configurable -- Resolution, encoder/decoder dimensions, mask type, and learning rate can all be tuned for your specific needs.
  • Resolution flexibility -- Supports resolutions from 64px up to 640px. Higher resolutions preserve more facial detail.
  • Multiple mask types -- Supports different mask architectures for the blending boundary.

Key Configuration Options

Resolution

The face resolution for training. Common values: 128, 192, 224, 256, 320. Higher values produce sharper faces but require more VRAM and longer training.

Encoder Dimensions

The width of the encoder network. Larger values (64, 128, 256) allow the model to capture more facial detail but use more VRAM. Start with 64 and increase if you have VRAM headroom.

Decoder Dimensions

The width of the decoder network. Should generally match or be larger than the encoder dimensions. Common values: 64, 128, 256.

Batch Size

Number of face pairs processed per iteration. Higher batch sizes lead to smoother training but require more VRAM. Start with 4-8 and adjust based on available memory.

SAEHD Recommended Settings

For a good balance of quality and speed on an 8GB GPU, start with: Resolution 224, Encoder 64, Decoder 64, Batch Size 6. Train for 150K-300K iterations. Increase dimensions if you have 12GB+ VRAM.

AMP

AMP (Adaptive Multi-Purpose) is designed for faster training while maintaining good quality. It uses a simplified architecture that converges more quickly than SAEHD, making it ideal for projects where time is a factor.

Strengths

  • Fast convergence -- Produces usable results in roughly half the iterations needed for SAEHD at comparable settings.
  • Good quality -- While not matching SAEHD at maximum settings, AMP produces very good results that are suitable for most applications.
  • Lower VRAM usage -- The simplified architecture uses less GPU memory than SAEHD at the same resolution.
  • Adaptive features -- The model adapts to the characteristics of your specific dataset during training.

When to Use AMP

  • You need results faster than SAEHD can deliver.
  • You have limited GPU VRAM (4-6 GB).
  • The project does not require maximum possible quality.
  • You are working on multiple face swap projects concurrently.

AMP vs SAEHD

Think of AMP as the "80/20 rule" model -- it achieves 80% of SAEHD's quality in about 50% of the training time. For many use cases, this trade-off is worthwhile.

Quick96

Quick96 is a fixed-configuration model that trains at 96px resolution. It is specifically designed for rapid testing and prototyping, allowing you to verify your dataset quality and face alignment before committing to a longer training run.

Characteristics

  • Fixed 96px resolution -- No configuration needed. The resolution is locked for maximum training speed.
  • Minimal VRAM -- Runs on GPUs with as little as 2 GB of VRAM, making it accessible on older or lower-end hardware.
  • Fastest training -- Produces visible results in 20-30 minutes on a modern GPU.
  • Lower quality -- The 96px resolution limits detail. Fine facial features may be blurry in the output. Not suitable for production use.

Quick96 as a Validation Tool

Use Quick96 to validate your dataset before starting a long SAEHD run. If the Quick96 output looks wrong -- wrong face shape, color mismatch, or artifacts -- there is likely a problem with your extracted faces or masks that should be fixed before investing hours in SAEHD training.

XSeg

XSeg (eXtended Segmentation) is not a face swap model -- it is a segmentation model that learns to generate high-quality face masks for your specific dataset. Once trained, XSeg can automatically create masks tailored to your source and destination faces.

Purpose

Unlike BiSeNet (which is pre-trained on a general face dataset), XSeg is trained on your specific faces and learns the unique characteristics that should be included in or excluded from the mask. This produces more accurate masks for challenging situations like:

  • Faces with consistent occlusions (microphones, glasses).
  • Unusual face angles or extreme expressions.
  • Complex hair patterns that need consistent handling.
  • Scenes with varying lighting conditions.

XSeg Training Workflow

  1. Label sample faces -- Manually paint accurate masks on 30-50 representative face images from your dataset using the Face Editor.
  2. Train XSeg -- Select XSeg as the model type and train for 10K-50K iterations. The model learns from your manually labeled examples.
  3. Apply XSeg masks -- Run the trained XSeg model on the entire dataset to generate masks for all faces automatically.
  4. Review results -- Check the generated masks in the Face Browser using the XSeg overlay. Fix any remaining issues manually.

XSeg Requires Manual Labels

XSeg cannot train from scratch -- it needs a set of manually labeled example masks. Invest time in creating 30-50 high-quality label masks that cover the range of face angles, expressions, and lighting conditions in your dataset.

Choosing a Model

Use this decision guide to select the right model for your project:

"I need the best possible quality"

Use SAEHD at 224px or higher resolution with large encoder/decoder dimensions. Plan for 200K-500K iterations and 8+ GB of VRAM.

"I need good results quickly"

Use AMP at 192-224px. You will get good quality in roughly half the time of SAEHD. Good for projects with time constraints.

"I want to test my dataset first"

Use Quick96 for rapid validation. If the Quick96 result looks promising, switch to SAEHD or AMP for the final output.

"I need better masks for a difficult scene"

Train XSeg first on labeled examples, then apply the trained masks to your dataset before running SAEHD or AMP training.

Common Configuration Options

These configuration parameters are shared across model types (where applicable):

ParameterDescriptionRecommended
ResolutionTraining face resolution in pixels192-256px for quality, 96-128px for speed
Batch SizeFaces per training iteration4-8 (lower if out-of-memory)
Learning RateHow aggressively the model learnsDefault values recommended
Face TypeFace extraction coveragewhole_face for most projects
Masked TrainingTrain only within mask boundaryYes, when masks are clean
Random WarpingData augmentation for robustnessYes, for the first 50K iterations
GAN TrainingAdversarial training for sharper outputEnable after 100K iterations