Network Structure
If we extract the features of each frame from the driving video and use them as input to the HMControlModule, we can generate a video, but flickering may occur between frames. After introducing the Animatediff module, the continuity of the generated video improved, albeit at the expense of fidelity. To address this, we fine-tuned the Animatediff module further, ultimately enhancing both the continuity and fidelity.
The input conditions for the HMControlModule can be generated by a head model that has the ARKit Face Blendshapes bound to it. So we can use ARKit blendshape values to control the generation of facial expressions.
Our framework is a hot-swappable adapter built on SD1.5 which does not compromise the generalization capability of the T2I model itself. Consequently, any stylization models developed on the SD1.5 foundation can be seamlessly integrated with our solution.
An unexpected benefit is that, due to the Fidelity-Rich Conditions introduced by the HMReferenceModule, we can achieve high-fidelity results with fewer sampling steps.
@misc{zhang2024hellomemeintegratingspatialknitting,
title={HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models},
author={Shengkai Zhang and Nianhong Jiao and Tian Li and Chaojie Yang and Chenhui Xue and Boya Niu and Jun Gao},
year={2024},
eprint={2410.22901},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.22901},
}