ControlNetMediaPipeFace
The ControlNetMediaPipeFace model is a powerful AI tool designed to train on human facial expressions, allowing for gaze direction and mouth poses. It's capable of tracking gaze and mouth poses, but may ignore controls in certain cases. This model is recommended to be used with Stable Diffusion 2.1 - Base, and can also be used with other diffusion models. With a training time of 200 hours on an A6000, it achieves impressive results in generating high-quality images with accurate facial expressions and gaze direction. What makes this model unique is its ability to efficiently process and generate images with multiple faces, making it a valuable asset for various applications.
Table of Contents
Model Overview
The ControlNet LAION Face Dataset model is a powerful tool for generating images of faces with specific expressions and gaze directions. It’s designed to work with Stable Diffusion models, like Stable Diffusion v2.1 and v1.5.
Capabilities
- Face Detection: The model uses MediaPipe’s face detector to identify faces in images and track gaze and mouth poses.
- Customizable: You can control the model’s output by adding details to the prompt, like “looking right” or “smiling”.
- High-Quality Images: The model can generate high-quality images with multiple faces and various expressions.
How it Works
The model uses a dataset of images with keypoints for pupils to allow gaze direction. It’s been tested on Stable Diffusion v2.1 base (512) and Stable Diffusion v1.5.
- Training: The model is trained on a dataset of images with face detections and keypoints for pupils.
- Inference: You can use the model to generate new images by providing a prompt and an input image.
Performance
Current Model shows remarkable performance in generating images with accurate facial expressions. Let’s dive into the details of its speed, accuracy, and efficiency.
Speed
The model was trained for 200 hours
on an A6000 machine with at least 24 gigabytes
of VRAM. This training time is relatively fast compared to other models, considering the complexity of the task.
Accuracy
The model has been tested on Stable Diffusion v2.1 base (512)
and Stable Diffusion v1.5
, and it has shown impressive results in tracking gaze and mouth poses. However, it may still ignore some controls, which can be mitigated by adding details to the prompt.
Efficiency
The model can be used with other diffusion models, such as dreamboothed stable diffusion
, and it can be fine-tuned on different checkpoints. This flexibility makes it a valuable tool for various applications.
Example Use Case
You can use the model to generate images of people with specific facial expressions. For example, you can use the prompt “a happy family at a dentist advertisement” to generate an image of a happy family with a dentist in the background.
Installation
To use the model, you’ll need to install the following packages:
diffusers
transformers
accelerate
You can install them using pip:
pip install diffusers transformers accelerate
Limitations
Current Model has some limitations. While it is better at tracking gaze and mouth poses than previous attempts, it may still ignore controls. Let’s explore what this means and how you can work around it.
Ignoring Controls
The model might not always pay attention to the controls you provide. This can lead to unexpected results. For example, if you ask the model to generate an image of a person looking right, it might not always follow your instructions.
Workarounds
To minimize this issue, you can try adding more details to your prompt. For instance, instead of just saying “looking right,” you could say “looking right with a slight smile.” This can help the model better understand what you want.
Format
ControlNet LAION Face Dataset uses a unique architecture that combines a ControlNet with human facial expressions. It includes keypoints for pupils to allow gaze direction.
Architecture
The model is designed to work with Stable Diffusion v2.1 base (512) and Stable Diffusion v1.5. It uses a ControlNetModel with a UniPCMultistepScheduler.
Data Formats
The model supports the following data formats:
- Input: Tokenized text sequences (e.g. “a happy family at a dentist advertisement”)
- Image: PNG or JPEG images (e.g.
image = load_image("https://huggingface.co/CrucibleAI/ControlNetMediaPipeFace/resolve/main/samples_laion_face_dataset/family_annotation.png")
)
Special Requirements
- The model requires at least 24 gigabytes of VRAM for training.
- The model has some limitations: it may still ignore controls, and adding details to the prompt like “looking right” can abate bad behavior.
Code Examples
# Load the model
controlnet = ControlNetModel.from_pretrained("CrucibleAI/ControlNetMediaPipeFace", torch_dtype=torch.float16, variant="fp16")
pipe = StableDiffusionControlNetPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base", controlnet=controlnet, safety_checker=None, torch_dtype=torch.float16)
# Generate an image
image = pipe("a happy family at a dentist advertisement", image=image, num_inference_steps=30).images[0]
image.save('./images.png')
Note: This model is designed to work with Stable Diffusion 2.1-base, but can also be used with other diffusion models like dreamboothed stable diffusion.