HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models

HelloVision | HelloGroup Inc.

Method

Network Structure

Network Structure

Image Generate

Image Generate

Motion Module

If we extract the features of each frame from the driving video and use them as input to the HMControlModule, we can generate a video, but flickering may occur between frames. After introducing the Animatediff module, the continuity of the generated video improved, albeit at the expense of fidelity. To address this, we fine-tuned the Animatediff module further, ultimately enhancing both the continuity and fidelity.

Expression Edit

The input conditions for the HMControlModule can be generated by a head model that has the ARKit Face Blendshapes bound to it. So we can use ARKit blendshape values to control the generation of facial expressions.

With SD1.5 based Lora or Checkpoint

Our framework is a hot-swappable adapter built on SD1.5 which does not compromise the generalization capability of the T2I model itself. Consequently, any stylization models developed on the SD1.5 foundation can be seamlessly integrated with our solution.

With LCM

An unexpected benefit is that, due to the Fidelity-Rich Conditions introduced by the HMReferenceModule, we can achieve high-fidelity results with fewer sampling steps.

Comparison with other Methods

BibTeX

@misc{zhang2024hellomemeintegratingspatialknitting,
title={HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models}, 
author={Shengkai Zhang and Nianhong Jiao and Tian Li and Chaojie Yang and Chenhui Xue and Boya Niu and Jun Gao},
year={2024},
eprint={2410.22901},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.22901}, 
}