Photorealism Roadmap V3

Research-grounded path: 29 dB → 35+ dB. Every decision has 2+ paper citations. Every dB estimate references an ablation study.

Roadmap: ROADMAP_V3.md | Status: Phase 2 (auto-detected from fleet: PSNR 37.99 dB, LPIPS 0.0464)

Overall Progress1/6 phases · 25%

Best PSNR:37.99 dB

Best LPIPS:0.0464

Best SSIM:0.8808

Updated:2026-04-05T15:27:46Z

Quality Ladder

Milestone	PSNR	SSIM	LPIPS	What It Looks Like
Current	29.2 dB	~0.88	~0.06	Good face shape, soft details, weak expression
Phase 1	31-32 dB	0.92	0.04	Sharp skin, accurate expressions, visible pores
Phase 2	33-34 dB	0.95	0.03	Photorealistic face, correct specular/wrinkles
Phase 3	35+ dB	0.97	0.02	Indistinguishable from mirror at arm's length

PHASE 1Expression Appearance + Optimization

29→31-32 dB1-2 weeksCompleted

Freeze base Gaussian optimization for first 5K steps of expression conditioning (NPGA pattern)
Expression MLP learning rate: 4e-5 (NPGA) to 5e-5 (our proven DeformMLP rate)
Add Laplacian smoothness regularization on per-Gaussian features (NPGA: prevents overfitting)
Oversample rare expressions in training data (MonoGaussianAvatar: prevents ignoring extreme expressions)
Start LPIPS at step 0 but with weight 0.0
Ramp exponentially from step 50K to weight 0.1 by step 100K
Rationale: early LPIPS fights Gaussian optimization; deferred LPIPS refines appearance after geometry converges
**Citation:** TeGA (SIGGRAPH 2025) defers VGG loss to 50% of training with exponential ramp to 0.1
Replace simple Sobel L1 with patch-variance gradient matching
**Citation:** "Gradient Variance Loss for Structure-Enhanced Image Super-Resolution" (2022) — GV loss outperforms TV loss in all configurations, +0.5 dB over L2-only on DIV2K
Weight: λ=0.1 (same as current edge loss)
**GO to Phase 2:** PSNR ≥ 31.0 dB avg20, AND visual improvement in expression tracking (wrinkle detail, mouth interior)
**ITERATE Phase 1:** PSNR 30-31 dB — tune expression conditioning, try regional approach
**REASSESS:** PSNR < 30 dB — expression conditioning may need different architecture

PHASE 2Multi-Session Data + Decoder Upgrade

31-32→33-34 dB1-2 weeksIn Progress

More expression diversity → better expression conditioning generalization
More viewpoint variation (natural head movement) → better novel-view synthesis
More lighting variation across frames → more robust appearance model
More data = more gradient signal = better optimization landscape
FLAME-fit both sessions (session 1: already done, session 2: needs processing)
Retractor-filter session 2 (apply same allowlist guard from session 1)
Joint training on combined dataset with session-aware frame sampling
Expected clean frames: 751 (session 1) + estimated 5000-15000 (session 2, pending filtering) = 6000-16000 total
HeadGAP's 6-layer CNN showed the strongest ablation gain: **+0.86 dB PSNR** (3DV 2025)
Cost: ~18K additional params, ~0.3ms latency
Our 5-layer (2+2+1) design benefits from one more conv in Block 3
Broadcast FLAME expression params as spatial features, concat with neural features before first conv
Or: FiLM conditioning — modulate intermediate conv features with expression-dependent scale/bias
**Citation 1:** FlashAvatar (CVPR 2024) — expression broadcast concat
**Citation 2:** Gaussian Head Avatar (CVPR 2024) — expression-conditioned super-resolution decoder
NPGA does NOT condition decoder on view direction — but they have 15 cameras
With 2 cameras, view conditioning may help generalize between views
Test: add 27-dim positional-encoded view direction (47K extra params)
**GO to Phase 3:** PSNR ≥ 33 dB avg20, AND visual: sharp skin texture, correct wrinkles, clean teeth/lips
**ITERATE Phase 2:** PSNR 31-33 dB — tune multi-session sampling, try different decoder configs
**PUSH HARDER:** If multi-session data shows strong gains, invest in additional capture sessions (more expressions, lighting conditions)

PHASE 3Super-Resolution + Final Quality Push

33-34 dB → 35+ dB at 4K1-2 weeksUpcoming

Parameters: 0.4-0.7M
Inference: 1-3ms on RTX 5080 (TensorRT FP16)
VRAM: <500MB additional
Training: Joint with Gaussian pipeline OR separate on face crops
Approach: Render → SR (Total Latency: Quality)
960x540 + 4x SR: ~2-3ms → ~1-2ms (**3-5ms**: Good)
1920x1080 + 2x SR: ~8-12ms → ~0.5-1ms (**8.5-13ms**: Better)
1440x810 + ~2.7x SR: ~4-6ms → ~1ms (**5-7ms**: Balanced)
3DGS Blendshapes hit 39.6 dB self-reenactment (SIGGRAPH 2024) — the representation CAN get there
GaussianStyle hit 34.43 dB from mono with just a StyleGAN decoder (2024) — decoder quality alone approaches 35
We'll have: better data (stereo, multi-session) + better decoder (CNN with expression FiLM) + better losses (full frequency stack) + better expression conditioning (per-patch) + stereo depth. Each is proven. The combination is novel.
**GO to Phase 4:** PSNR ≥ 35 dB at training resolution, AND 4K output at <10ms, AND visual quality passes arm's-length mirror test
**ITERATE:** If 33-34 dB, the visual quality with adversarial + 4K SR may already be sufficient — visual verification is ground truth
**INVESTIGATE:** If stuck at 33 dB, research additional capture sessions (different lighting, more extreme expressions) or pseudo-view augmentation (GAF-style, +3.4 dB for novel views)

PHASE 4Novel View Synthesis for Mirror Effect

1-2 weeksUpcoming

Camera tracks face position and pose (already have: FLAME fitting)
Compute virtual mirror viewpoint (reflected camera across display plane)
Render Gaussians from mirror viewpoint (gsplat supports arbitrary viewmat)
Gaze correction so eyes "meet" in the mirror
**SHIP:** Mirror viewpoint render quality matches training-view quality (no visible artifacts)
**ITERATE:** If novel-view artifacts are visible, consider GAF-style view augmentation or increase camera count

PHASE 5Per-Patient Fast Fitting

2-4 weeksUpcoming

2-3 minute guided expression session (talk, smile, neutral, exaggerated)
Dual 4K@60fps cameras capture ~3600-5400 frames
Filter for quality → 500-1000 clean frames
FLAME fitting in parallel during capture

PHASE 6Reflect Face Model

Ongoing after Phase 5Upcoming

**Total Variation loss** — proven inferior to GV loss in all configurations
**DISTS replacing LPIPS** — no evidence of improvement for face avatars
**FFL alpha > 1.0** — degrades quality per ablation (FID rises to 102.3)
**Adversarial loss from step 0** — destabilizes per-subject optimization
**L1-only or MSE-only** — every 30+ dB method uses multi-term loss
NPGA (Kirschstein et al.): SIGGRAPH Asia 2024 → Per-Gaussian latent features + CNN, multi-view (37.68 NVS: 15)
ScaffoldAvatar (Disney): SIGGRAPH 2025 → Per-patch expression MLPs, 37.03 dB (37.03 self: 16)
GaussianAvatars (Qian et al.): CVPR 2024 → Triangle binding, position/scale reg (31.60 NVS: 16)
Gaussian Head Avatar (Xu et al.): CVPR 2024 → 32-dim features + U-Net SR decoder (~28 self: 16)
FlashAvatar (Xiang et al.): CVPR 2024 → UV Gaussians, monocular, fast (32.33 self: 1)
GeoAvatar: ICCV 2025 → Adaptive rigid/flexible densification (32.70 self: 1)
RGBAvatar: CVPR 2025 → MLP→K=20 blendshapes, 80s fitting (33.89 self: 1)
3DGS Blendshapes: SIGGRAPH 2024 → Linear blendshape basis (33-39.6 self: 1-16)
MeGA: CVPR 2025 → Hybrid mesh-Gaussian, UV decoders (34.11 NVS: 16)
GaussianStyle (Abdal et al.): 2024 → StyleGAN decoder (34.43 self: 1)
TexAvatars: 2024 → Neural texture Gaussians (35.15 NVS: 16)
TeGA: SIGGRAPH 2025 → UV-space U-Net, 4M Gaussians (24.4 (4K res): 13)
MonoGaussianAvatar: SIGGRAPH 2024 → Monocular Gaussian deformation (27-32.5 self: 1)
HeadGAP: 3DV 2025 → Few-shot Gaussian priors + CNN (22.87 self: few)
AIS 2024 URPNet: Reparam CNN → 4x (540→4K) (1.04ms (RTX 4090): 0.62M)
AIS 2024 RepTCN: Reparam CNN → 4x (540→4K) (1.0ms (RTX 4090): 0.69M)
NTIRE 2025 EMSR: ConvLora + distill → 4x (<10ms: 0.131M)
SPAN (2024): Param-free attention → 4x (7.08ms: ~0.5M)
Gaussian Head Avatar: Bilinear + CNN → 4x (512→2048) (~15ms total: 3.1M)
NSRD (CVPR 2024): Radiance demodulation → 4x (12.41ms (TRT FP16): 1.61M)
SqueezeMe (SIGGRAPH 2025): Distilled linear → UV SR (0.45ms (Quest 3): ~60K)

Key Research References

Paper	Venue	Key Contribution	PSNR
NPGA (Kirschstein et al.)	SIGGRAPH Asia 2024	Per-Gaussian latent features + CNN, multi-view	37.68 NVS
ScaffoldAvatar (Disney)	SIGGRAPH 2025	Per-patch expression MLPs, 37.03 dB	37.03 self
GaussianAvatars (Qian et al.)	CVPR 2024	Triangle binding, position/scale reg	31.60 NVS
Gaussian Head Avatar (Xu et al.)	CVPR 2024	32-dim features + U-Net SR decoder	~28 self
FlashAvatar (Xiang et al.)	CVPR 2024	UV Gaussians, monocular, fast	32.33 self
GeoAvatar	ICCV 2025	Adaptive rigid/flexible densification	32.70 self
RGBAvatar	CVPR 2025	MLP→K=20 blendshapes, 80s fitting	33.89 self
3DGS Blendshapes	SIGGRAPH 2024	Linear blendshape basis	33-39.6 self
MeGA	CVPR 2025	Hybrid mesh-Gaussian, UV decoders	34.11 NVS
GaussianStyle (Abdal et al.)	2024	StyleGAN decoder	34.43 self
TexAvatars	2024	Neural texture Gaussians	35.15 NVS
TeGA	SIGGRAPH 2025	UV-space U-Net, 4M Gaussians	24.4 (4K res)
MonoGaussianAvatar	SIGGRAPH 2024	Monocular Gaussian deformation	27-32.5 self
HeadGAP	3DV 2025	Few-shot Gaussian priors + CNN	22.87 self