HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models

Shengkai Zhang , Nianhong Jiao , Tian Li , Chaojie Yang , Chenhui Xue , Boya Niu , Jun Gao

HelloVision | HelloGroup Inc.

Technical Report arXiv Code ComfyUI Demo

Method

Network Structure

Motion Module

If we extract the features of each frame from the driving video and use them as input to the HMControlModule, we can generate a video, but flickering may occur between frames. After introducing the Animatediff module, the continuity of the generated video improved, albeit at the expense of fidelity. To address this, we fine-tuned the Animatediff module further, ultimately enhancing both the continuity and fidelity.

Expression Edit

The input conditions for the HMControlModule can be generated by a head model that has the ARKit Face Blendshapes bound to it. So we can use ARKit blendshape values to control the generation of facial expressions.

With SD1.5 based Lora or Checkpoint

Our framework is a hot-swappable adapter built on SD1.5 which does not compromise the generalization capability of the T2I model itself. Consequently, any stylization models developed on the SD1.5 foundation can be seamlessly integrated with our solution.

With LCM

An unexpected benefit is that, due to the Fidelity-Rich Conditions introduced by the HMReferenceModule, we can achieve high-fidelity results with fewer sampling steps.

BibTeX

@misc{zhang2024hellomemeintegratingspatialknitting,
title={HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models}, 
author={Shengkai Zhang and Nianhong Jiao and Tian Li and Chaojie Yang and Chenhui Xue and Boya Niu and Jun Gao},
year={2024},
eprint={2410.22901},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.22901}, 
}