Photorealism Roadmap V3
Research-grounded path: 29 dB → 35+ dB. Every decision has 2+ paper citations. Every dB estimate references an ablation study.
Roadmap: ROADMAP_V3.md | Status: Phase 2 (auto-detected from fleet: PSNR 37.99 dB, LPIPS 0.0464)
Overall Progress1/6 phases · 25%
Best PSNR:37.99 dB
Best LPIPS:0.0464
Best SSIM:0.8808
Updated:2026-04-05T15:27:46Z
Quality Ladder
| Milestone | PSNR | SSIM | LPIPS | What It Looks Like |
|---|---|---|---|---|
| Current | 29.2 dB | ~0.88 | ~0.06 | Good face shape, soft details, weak expression |
| Phase 1 | 31-32 dB | 0.92 | 0.04 | Sharp skin, accurate expressions, visible pores |
| Phase 2 | 33-34 dB | 0.95 | 0.03 | Photorealistic face, correct specular/wrinkles |
| Phase 3 | 35+ dB | 0.97 | 0.02 | Indistinguishable from mirror at arm's length |
PHASE 1Expression Appearance + Optimization
29→31-32 dB1-2 weeksCompleted
- Freeze base Gaussian optimization for first 5K steps of expression conditioning (NPGA pattern)
- Expression MLP learning rate: 4e-5 (NPGA) to 5e-5 (our proven DeformMLP rate)
- Add Laplacian smoothness regularization on per-Gaussian features (NPGA: prevents overfitting)
- Oversample rare expressions in training data (MonoGaussianAvatar: prevents ignoring extreme expressions)
- Start LPIPS at step 0 but with weight 0.0
- Ramp exponentially from step 50K to weight 0.1 by step 100K
- Rationale: early LPIPS fights Gaussian optimization; deferred LPIPS refines appearance after geometry converges
- **Citation:** TeGA (SIGGRAPH 2025) defers VGG loss to 50% of training with exponential ramp to 0.1
- Replace simple Sobel L1 with patch-variance gradient matching
- **Citation:** "Gradient Variance Loss for Structure-Enhanced Image Super-Resolution" (2022) — GV loss outperforms TV loss in all configurations, +0.5 dB over L2-only on DIV2K
- Weight: λ=0.1 (same as current edge loss)
- **GO to Phase 2:** PSNR ≥ 31.0 dB avg20, AND visual improvement in expression tracking (wrinkle detail, mouth interior)
- **ITERATE Phase 1:** PSNR 30-31 dB — tune expression conditioning, try regional approach
- **REASSESS:** PSNR < 30 dB — expression conditioning may need different architecture
PHASE 2Multi-Session Data + Decoder Upgrade
31-32→33-34 dB1-2 weeksIn Progress
- More expression diversity → better expression conditioning generalization
- More viewpoint variation (natural head movement) → better novel-view synthesis
- More lighting variation across frames → more robust appearance model
- More data = more gradient signal = better optimization landscape
- FLAME-fit both sessions (session 1: already done, session 2: needs processing)
- Retractor-filter session 2 (apply same allowlist guard from session 1)
- Joint training on combined dataset with session-aware frame sampling
- Expected clean frames: 751 (session 1) + estimated 5000-15000 (session 2, pending filtering) = 6000-16000 total
- HeadGAP's 6-layer CNN showed the strongest ablation gain: **+0.86 dB PSNR** (3DV 2025)
- Cost: ~18K additional params, ~0.3ms latency
- Our 5-layer (2+2+1) design benefits from one more conv in Block 3
- Broadcast FLAME expression params as spatial features, concat with neural features before first conv
- Or: FiLM conditioning — modulate intermediate conv features with expression-dependent scale/bias
- **Citation 1:** FlashAvatar (CVPR 2024) — expression broadcast concat
- **Citation 2:** Gaussian Head Avatar (CVPR 2024) — expression-conditioned super-resolution decoder
- NPGA does NOT condition decoder on view direction — but they have 15 cameras
- With 2 cameras, view conditioning may help generalize between views
- Test: add 27-dim positional-encoded view direction (47K extra params)
- **GO to Phase 3:** PSNR ≥ 33 dB avg20, AND visual: sharp skin texture, correct wrinkles, clean teeth/lips
- **ITERATE Phase 2:** PSNR 31-33 dB — tune multi-session sampling, try different decoder configs
- **PUSH HARDER:** If multi-session data shows strong gains, invest in additional capture sessions (more expressions, lighting conditions)
PHASE 3Super-Resolution + Final Quality Push
33-34 dB → 35+ dB at 4K1-2 weeksUpcoming
- Parameters: 0.4-0.7M
- Inference: 1-3ms on RTX 5080 (TensorRT FP16)
- VRAM: <500MB additional
- Training: Joint with Gaussian pipeline OR separate on face crops
- Approach: Render → SR (Total Latency: Quality)
- 960x540 + 4x SR: ~2-3ms → ~1-2ms (**3-5ms**: Good)
- 1920x1080 + 2x SR: ~8-12ms → ~0.5-1ms (**8.5-13ms**: Better)
- 1440x810 + ~2.7x SR: ~4-6ms → ~1ms (**5-7ms**: Balanced)
- 3DGS Blendshapes hit 39.6 dB self-reenactment (SIGGRAPH 2024) — the representation CAN get there
- GaussianStyle hit 34.43 dB from mono with just a StyleGAN decoder (2024) — decoder quality alone approaches 35
- We'll have: better data (stereo, multi-session) + better decoder (CNN with expression FiLM) + better losses (full frequency stack) + better expression conditioning (per-patch) + stereo depth. Each is proven. The combination is novel.
- **GO to Phase 4:** PSNR ≥ 35 dB at training resolution, AND 4K output at <10ms, AND visual quality passes arm's-length mirror test
- **ITERATE:** If 33-34 dB, the visual quality with adversarial + 4K SR may already be sufficient — visual verification is ground truth
- **INVESTIGATE:** If stuck at 33 dB, research additional capture sessions (different lighting, more extreme expressions) or pseudo-view augmentation (GAF-style, +3.4 dB for novel views)
PHASE 4Novel View Synthesis for Mirror Effect
1-2 weeksUpcoming
- Camera tracks face position and pose (already have: FLAME fitting)
- Compute virtual mirror viewpoint (reflected camera across display plane)
- Render Gaussians from mirror viewpoint (gsplat supports arbitrary viewmat)
- Gaze correction so eyes "meet" in the mirror
- **SHIP:** Mirror viewpoint render quality matches training-view quality (no visible artifacts)
- **ITERATE:** If novel-view artifacts are visible, consider GAF-style view augmentation or increase camera count
PHASE 5Per-Patient Fast Fitting
2-4 weeksUpcoming
- 2-3 minute guided expression session (talk, smile, neutral, exaggerated)
- Dual 4K@60fps cameras capture ~3600-5400 frames
- Filter for quality → 500-1000 clean frames
- FLAME fitting in parallel during capture
PHASE 6Reflect Face Model
Ongoing after Phase 5Upcoming
- **Total Variation loss** — proven inferior to GV loss in all configurations
- **DISTS replacing LPIPS** — no evidence of improvement for face avatars
- **FFL alpha > 1.0** — degrades quality per ablation (FID rises to 102.3)
- **Adversarial loss from step 0** — destabilizes per-subject optimization
- **L1-only or MSE-only** — every 30+ dB method uses multi-term loss
- NPGA (Kirschstein et al.): SIGGRAPH Asia 2024 → Per-Gaussian latent features + CNN, multi-view (37.68 NVS: 15)
- ScaffoldAvatar (Disney): SIGGRAPH 2025 → Per-patch expression MLPs, 37.03 dB (37.03 self: 16)
- GaussianAvatars (Qian et al.): CVPR 2024 → Triangle binding, position/scale reg (31.60 NVS: 16)
- Gaussian Head Avatar (Xu et al.): CVPR 2024 → 32-dim features + U-Net SR decoder (~28 self: 16)
- FlashAvatar (Xiang et al.): CVPR 2024 → UV Gaussians, monocular, fast (32.33 self: 1)
- GeoAvatar: ICCV 2025 → Adaptive rigid/flexible densification (32.70 self: 1)
- RGBAvatar: CVPR 2025 → MLP→K=20 blendshapes, 80s fitting (33.89 self: 1)
- 3DGS Blendshapes: SIGGRAPH 2024 → Linear blendshape basis (33-39.6 self: 1-16)
- MeGA: CVPR 2025 → Hybrid mesh-Gaussian, UV decoders (34.11 NVS: 16)
- GaussianStyle (Abdal et al.): 2024 → StyleGAN decoder (34.43 self: 1)
- TexAvatars: 2024 → Neural texture Gaussians (35.15 NVS: 16)
- TeGA: SIGGRAPH 2025 → UV-space U-Net, 4M Gaussians (24.4 (4K res): 13)
- MonoGaussianAvatar: SIGGRAPH 2024 → Monocular Gaussian deformation (27-32.5 self: 1)
- HeadGAP: 3DV 2025 → Few-shot Gaussian priors + CNN (22.87 self: few)
- AIS 2024 URPNet: Reparam CNN → 4x (540→4K) (1.04ms (RTX 4090): 0.62M)
- AIS 2024 RepTCN: Reparam CNN → 4x (540→4K) (1.0ms (RTX 4090): 0.69M)
- NTIRE 2025 EMSR: ConvLora + distill → 4x (<10ms: 0.131M)
- SPAN (2024): Param-free attention → 4x (7.08ms: ~0.5M)
- Gaussian Head Avatar: Bilinear + CNN → 4x (512→2048) (~15ms total: 3.1M)
- NSRD (CVPR 2024): Radiance demodulation → 4x (12.41ms (TRT FP16): 1.61M)
- SqueezeMe (SIGGRAPH 2025): Distilled linear → UV SR (0.45ms (Quest 3): ~60K)
Key Research References
| Paper | Venue | Key Contribution | PSNR |
|---|---|---|---|
| NPGA (Kirschstein et al.) | SIGGRAPH Asia 2024 | Per-Gaussian latent features + CNN, multi-view | 37.68 NVS |
| ScaffoldAvatar (Disney) | SIGGRAPH 2025 | Per-patch expression MLPs, 37.03 dB | 37.03 self |
| GaussianAvatars (Qian et al.) | CVPR 2024 | Triangle binding, position/scale reg | 31.60 NVS |
| Gaussian Head Avatar (Xu et al.) | CVPR 2024 | 32-dim features + U-Net SR decoder | ~28 self |
| FlashAvatar (Xiang et al.) | CVPR 2024 | UV Gaussians, monocular, fast | 32.33 self |
| GeoAvatar | ICCV 2025 | Adaptive rigid/flexible densification | 32.70 self |
| RGBAvatar | CVPR 2025 | MLP→K=20 blendshapes, 80s fitting | 33.89 self |
| 3DGS Blendshapes | SIGGRAPH 2024 | Linear blendshape basis | 33-39.6 self |
| MeGA | CVPR 2025 | Hybrid mesh-Gaussian, UV decoders | 34.11 NVS |
| GaussianStyle (Abdal et al.) | 2024 | StyleGAN decoder | 34.43 self |
| TexAvatars | 2024 | Neural texture Gaussians | 35.15 NVS |
| TeGA | SIGGRAPH 2025 | UV-space U-Net, 4M Gaussians | 24.4 (4K res) |
| MonoGaussianAvatar | SIGGRAPH 2024 | Monocular Gaussian deformation | 27-32.5 self |
| HeadGAP | 3DV 2025 | Few-shot Gaussian priors + CNN | 22.87 self |