R
ReflectIris

Photorealism Roadmap V3

Research-grounded path: 29 dB → 35+ dB. Every decision has 2+ paper citations. Every dB estimate references an ablation study.

Roadmap: ROADMAP_V3.md | Status: Phase 2 (auto-detected from fleet: PSNR 37.99 dB, LPIPS 0.0464)

Overall Progress1/6 phases · 25%
Best PSNR:37.99 dB
Best LPIPS:0.0464
Best SSIM:0.8808
Updated:2026-04-05T15:27:46Z

Quality Ladder

MilestonePSNRSSIMLPIPSWhat It Looks Like
Current29.2 dB~0.88~0.06Good face shape, soft details, weak expression
Phase 131-32 dB0.920.04Sharp skin, accurate expressions, visible pores
Phase 233-34 dB0.950.03Photorealistic face, correct specular/wrinkles
Phase 335+ dB0.970.02Indistinguishable from mirror at arm's length
PHASE 1Expression Appearance + Optimization
29→31-32 dB1-2 weeksCompleted
  • Freeze base Gaussian optimization for first 5K steps of expression conditioning (NPGA pattern)
  • Expression MLP learning rate: 4e-5 (NPGA) to 5e-5 (our proven DeformMLP rate)
  • Add Laplacian smoothness regularization on per-Gaussian features (NPGA: prevents overfitting)
  • Oversample rare expressions in training data (MonoGaussianAvatar: prevents ignoring extreme expressions)
  • Start LPIPS at step 0 but with weight 0.0
  • Ramp exponentially from step 50K to weight 0.1 by step 100K
  • Rationale: early LPIPS fights Gaussian optimization; deferred LPIPS refines appearance after geometry converges
  • **Citation:** TeGA (SIGGRAPH 2025) defers VGG loss to 50% of training with exponential ramp to 0.1
  • Replace simple Sobel L1 with patch-variance gradient matching
  • **Citation:** "Gradient Variance Loss for Structure-Enhanced Image Super-Resolution" (2022) — GV loss outperforms TV loss in all configurations, +0.5 dB over L2-only on DIV2K
  • Weight: λ=0.1 (same as current edge loss)
  • **GO to Phase 2:** PSNR ≥ 31.0 dB avg20, AND visual improvement in expression tracking (wrinkle detail, mouth interior)
  • **ITERATE Phase 1:** PSNR 30-31 dB — tune expression conditioning, try regional approach
  • **REASSESS:** PSNR < 30 dB — expression conditioning may need different architecture
PHASE 2Multi-Session Data + Decoder Upgrade
31-32→33-34 dB1-2 weeksIn Progress
  • More expression diversity → better expression conditioning generalization
  • More viewpoint variation (natural head movement) → better novel-view synthesis
  • More lighting variation across frames → more robust appearance model
  • More data = more gradient signal = better optimization landscape
  • FLAME-fit both sessions (session 1: already done, session 2: needs processing)
  • Retractor-filter session 2 (apply same allowlist guard from session 1)
  • Joint training on combined dataset with session-aware frame sampling
  • Expected clean frames: 751 (session 1) + estimated 5000-15000 (session 2, pending filtering) = 6000-16000 total
  • HeadGAP's 6-layer CNN showed the strongest ablation gain: **+0.86 dB PSNR** (3DV 2025)
  • Cost: ~18K additional params, ~0.3ms latency
  • Our 5-layer (2+2+1) design benefits from one more conv in Block 3
  • Broadcast FLAME expression params as spatial features, concat with neural features before first conv
  • Or: FiLM conditioning — modulate intermediate conv features with expression-dependent scale/bias
  • **Citation 1:** FlashAvatar (CVPR 2024) — expression broadcast concat
  • **Citation 2:** Gaussian Head Avatar (CVPR 2024) — expression-conditioned super-resolution decoder
  • NPGA does NOT condition decoder on view direction — but they have 15 cameras
  • With 2 cameras, view conditioning may help generalize between views
  • Test: add 27-dim positional-encoded view direction (47K extra params)
  • **GO to Phase 3:** PSNR ≥ 33 dB avg20, AND visual: sharp skin texture, correct wrinkles, clean teeth/lips
  • **ITERATE Phase 2:** PSNR 31-33 dB — tune multi-session sampling, try different decoder configs
  • **PUSH HARDER:** If multi-session data shows strong gains, invest in additional capture sessions (more expressions, lighting conditions)
PHASE 3Super-Resolution + Final Quality Push
33-34 dB → 35+ dB at 4K1-2 weeksUpcoming
  • Parameters: 0.4-0.7M
  • Inference: 1-3ms on RTX 5080 (TensorRT FP16)
  • VRAM: <500MB additional
  • Training: Joint with Gaussian pipeline OR separate on face crops
  • Approach: Render → SR (Total Latency: Quality)
  • 960x540 + 4x SR: ~2-3ms → ~1-2ms (**3-5ms**: Good)
  • 1920x1080 + 2x SR: ~8-12ms → ~0.5-1ms (**8.5-13ms**: Better)
  • 1440x810 + ~2.7x SR: ~4-6ms → ~1ms (**5-7ms**: Balanced)
  • 3DGS Blendshapes hit 39.6 dB self-reenactment (SIGGRAPH 2024) — the representation CAN get there
  • GaussianStyle hit 34.43 dB from mono with just a StyleGAN decoder (2024) — decoder quality alone approaches 35
  • We'll have: better data (stereo, multi-session) + better decoder (CNN with expression FiLM) + better losses (full frequency stack) + better expression conditioning (per-patch) + stereo depth. Each is proven. The combination is novel.
  • **GO to Phase 4:** PSNR ≥ 35 dB at training resolution, AND 4K output at <10ms, AND visual quality passes arm's-length mirror test
  • **ITERATE:** If 33-34 dB, the visual quality with adversarial + 4K SR may already be sufficient — visual verification is ground truth
  • **INVESTIGATE:** If stuck at 33 dB, research additional capture sessions (different lighting, more extreme expressions) or pseudo-view augmentation (GAF-style, +3.4 dB for novel views)
PHASE 4Novel View Synthesis for Mirror Effect
1-2 weeksUpcoming
  • Camera tracks face position and pose (already have: FLAME fitting)
  • Compute virtual mirror viewpoint (reflected camera across display plane)
  • Render Gaussians from mirror viewpoint (gsplat supports arbitrary viewmat)
  • Gaze correction so eyes "meet" in the mirror
  • **SHIP:** Mirror viewpoint render quality matches training-view quality (no visible artifacts)
  • **ITERATE:** If novel-view artifacts are visible, consider GAF-style view augmentation or increase camera count
PHASE 5Per-Patient Fast Fitting
2-4 weeksUpcoming
  • 2-3 minute guided expression session (talk, smile, neutral, exaggerated)
  • Dual 4K@60fps cameras capture ~3600-5400 frames
  • Filter for quality → 500-1000 clean frames
  • FLAME fitting in parallel during capture
PHASE 6Reflect Face Model
Ongoing after Phase 5Upcoming
  • **Total Variation loss** — proven inferior to GV loss in all configurations
  • **DISTS replacing LPIPS** — no evidence of improvement for face avatars
  • **FFL alpha > 1.0** — degrades quality per ablation (FID rises to 102.3)
  • **Adversarial loss from step 0** — destabilizes per-subject optimization
  • **L1-only or MSE-only** — every 30+ dB method uses multi-term loss
  • NPGA (Kirschstein et al.): SIGGRAPH Asia 2024 → Per-Gaussian latent features + CNN, multi-view (37.68 NVS: 15)
  • ScaffoldAvatar (Disney): SIGGRAPH 2025 → Per-patch expression MLPs, 37.03 dB (37.03 self: 16)
  • GaussianAvatars (Qian et al.): CVPR 2024 → Triangle binding, position/scale reg (31.60 NVS: 16)
  • Gaussian Head Avatar (Xu et al.): CVPR 2024 → 32-dim features + U-Net SR decoder (~28 self: 16)
  • FlashAvatar (Xiang et al.): CVPR 2024 → UV Gaussians, monocular, fast (32.33 self: 1)
  • GeoAvatar: ICCV 2025 → Adaptive rigid/flexible densification (32.70 self: 1)
  • RGBAvatar: CVPR 2025 → MLP→K=20 blendshapes, 80s fitting (33.89 self: 1)
  • 3DGS Blendshapes: SIGGRAPH 2024 → Linear blendshape basis (33-39.6 self: 1-16)
  • MeGA: CVPR 2025 → Hybrid mesh-Gaussian, UV decoders (34.11 NVS: 16)
  • GaussianStyle (Abdal et al.): 2024 → StyleGAN decoder (34.43 self: 1)
  • TexAvatars: 2024 → Neural texture Gaussians (35.15 NVS: 16)
  • TeGA: SIGGRAPH 2025 → UV-space U-Net, 4M Gaussians (24.4 (4K res): 13)
  • MonoGaussianAvatar: SIGGRAPH 2024 → Monocular Gaussian deformation (27-32.5 self: 1)
  • HeadGAP: 3DV 2025 → Few-shot Gaussian priors + CNN (22.87 self: few)
  • AIS 2024 URPNet: Reparam CNN → 4x (540→4K) (1.04ms (RTX 4090): 0.62M)
  • AIS 2024 RepTCN: Reparam CNN → 4x (540→4K) (1.0ms (RTX 4090): 0.69M)
  • NTIRE 2025 EMSR: ConvLora + distill → 4x (<10ms: 0.131M)
  • SPAN (2024): Param-free attention → 4x (7.08ms: ~0.5M)
  • Gaussian Head Avatar: Bilinear + CNN → 4x (512→2048) (~15ms total: 3.1M)
  • NSRD (CVPR 2024): Radiance demodulation → 4x (12.41ms (TRT FP16): 1.61M)
  • SqueezeMe (SIGGRAPH 2025): Distilled linear → UV SR (0.45ms (Quest 3): ~60K)

Key Research References

PaperVenueKey ContributionPSNR
NPGA (Kirschstein et al.)SIGGRAPH Asia 2024Per-Gaussian latent features + CNN, multi-view37.68 NVS
ScaffoldAvatar (Disney)SIGGRAPH 2025Per-patch expression MLPs, 37.03 dB37.03 self
GaussianAvatars (Qian et al.)CVPR 2024Triangle binding, position/scale reg31.60 NVS
Gaussian Head Avatar (Xu et al.)CVPR 202432-dim features + U-Net SR decoder~28 self
FlashAvatar (Xiang et al.)CVPR 2024UV Gaussians, monocular, fast32.33 self
GeoAvatarICCV 2025Adaptive rigid/flexible densification32.70 self
RGBAvatarCVPR 2025MLP→K=20 blendshapes, 80s fitting33.89 self
3DGS BlendshapesSIGGRAPH 2024Linear blendshape basis33-39.6 self
MeGACVPR 2025Hybrid mesh-Gaussian, UV decoders34.11 NVS
GaussianStyle (Abdal et al.)2024StyleGAN decoder34.43 self
TexAvatars2024Neural texture Gaussians35.15 NVS
TeGASIGGRAPH 2025UV-space U-Net, 4M Gaussians24.4 (4K res)
MonoGaussianAvatarSIGGRAPH 2024Monocular Gaussian deformation27-32.5 self
HeadGAP3DV 2025Few-shot Gaussian priors + CNN22.87 self