https://lumiere-video.github.io/

 Text-to-Video Image-to-Video Stylized Generation Video Stylization
Cinemagraphs Inpainting

Google Research

LUMIERE

A Space-Time Diffusion Model for Realistic Video Generation

Read Paper

Text-to-Video


* Hover over the video to see the input prompt.

Aerial Around Young Hiker Man Standing on Mountain Peak Summit At
Sunrise
Aurora Borealis Green Loop Winter Mountain Ridges Northern Lights
Astronaut on the planet Mars making a detour around his base
A dog driving a car on a suburban street wearing funny sunglasses
Back view on young woman dressed in a bright yellow jacket walk in
outdoor forest
Golden retriever puppy running in the park. Autumn. Beautiful leaves
on the ground.
Chocolate syrup pouring on vanilla ice cream
Bloomming cherry tree in the garden beautiful sun light
Sailboat sailing on a sunny day in a mountain lake
A young couple walking in a heavy rain
A panda eating bamboo on a rock
A musk ox grazing on a beautiful wildflowers
A knight riding on a horse through the countryside
Flying through a temple in ruins
Camera moving through dry grass at autumn morning
Aerial view of colorful fireworks exploding in the night sky
US flag waving on massive sunrise clouds
Beer pouring into glass
Funny cute pug dog feeling good listening to music with big
headphones and swinging head
Sunset timelapse at the beach
Yellow flowers swing in the wind
Jack russel terrier snowboarding. GoPro shot.
A red lamborghini aventador coming around abend in a mountain road
Bright underwater world orange fish
Confident teddy bear surfer rides the wave in the tropics
Toy poodle dog rides a penny board outdoors
Chocolate muffin video clip
A cute mouse typing on a keyboard
360 camera shot of a sushi roll in a restaurant
Colorful fish swimming underwater
Panda play ukelele at home
Silhouette of a wolf against a twilight sky

Image-to-Video


* Hover over the video to see the input image and prompt.

A sad cat in a striped navy blue shirt
A teddybear dancing in the snow
A spooky skeleton
A turtle swimming
Flying through an intense battle between pirate ships in a stormy
ocean
An escaped panda eating popcorn in the park
A bee holding a jar of honey
A monkey drinking coffee while working on his laptop
Campfire at night in a snowy forest with starry sky in the background
Bigfoot walking through the woods
A factory robot assembling intricate electronic components with
precision
A teddy bear running in New York city
A panda eating bamboo on a rock
A giraffe eating grass
A relaxed ocean waves video
A car driving on the beach
Sailboat sailing on a sunny day in a mountain lake
A panda bear driving a car
A happy elephant wearing a birthday hat walking under the sea
A snowflake falls from the sky
Melting pistachio ice cream dripping down the cone
A teddy bear skating in Times Square
A fluffy baby sloth with an orange knitted hat trying to figure out a
laptop
A cat playing the piano
A girl winking and smiling
A man smiling and waving
Soldiers raising the united states flag on a windy day.
Zooming through a nebula with many twinkling stars
A timelapse oil painting of a starry night with clouds moving
A big ocean wave
A woman looking tired and yawning
Ancient pharaoh smiling and shaking his head

Stylized Generation

Using a single reference image, Lumiere can generate videos in the
target style by utilizing fine-tuned text-to-image model weights.
* Hover over the video to see the prompt.

[20710341_3]
A family of ducks swimming in a pond
A butterfly fluttering from flower to flower
A colorful parrot showing off its vibrant feathers
An owl perched on a branch
A koala munching on eucalyptus leaves
A cute bunny nibbling on a carrot
A squirrel gathering acorns
A fox frolicking in the forest

Style reference image

"Sticker"

[reference_]
An owl perched on a branch
A flower blooming
A dolphin leaping out of the water
A swan swimming in a lake
A chubby panda munching on bamboo shoots
A drifting dandelion seed in the breeze
A lion with a majestic mane
An elephant trumpeting joyfully

Style reference image

"3D Melting Gold"

[flat_carto]
A bear twirling with delight
A cute bunny nibbling on a carrot
A dolphin leaping out of the water
A sunflower blooming
A hot air balloon inflating and taking off
A wise owl perched on a tree branch
A group of penguins waddling in the snow
A family of ducks swimming in a pond

Style reference image

"Flat cartoon"

[3d_ref]
A bear dancing
A colorful parrot showing off its vibrant feathers
An owl perched on a branch
A bunny hopping in a meadow
A butterfly fluttering from flower to flower
A family of ducks swimming in a pond
A dolphin leaping out of the water
A graceful swan griding across a serene lake

Style reference image

"3D Rendering"

[cartoon_li]
An owl perched on a branch
A lion with a majestic mane
A koala munching on eucalyptus leaves
A fox frolicking in the forest
A squirrrel gathering acorns
A majestic horse
A bear
A monkey

Style reference image

"Line drawing"

[glow_ref]
A colorful parrot showing off its vibrant feathers
A bear dancing
A butterfly fluttering from flower to flower
A horse galloping
A dragon
A penguin dancing
A giraffe
A lion with a majestic mane roaring

Style reference image

"Glowing"

[watercolor]
A raccoon dancing
A cat drinking milk from a bowl
A chubby panda munching on bamboo shoots
A horse galloping across a field
A girl with a beanie dancing
A family of ducks swimming in a pond
A dog walking
A dolphin leaping out of the water

Style reference image

"Watercolor painting"

Introduction

We introduce Lumiere -- a text-to-video diffusion model designed for
synthesizing videos that portray realistic, diverse and coherent
motion -- a pivotal challenge in video synthesis. To this end, we
introduce a Space-Time U-Net architecture that generates the entire
temporal duration of the video at once, through a single pass in the
model. This is in contrast to existing video models which synthesize
distant keyframes followed by temporal super-resolution -- an
approach that inherently makes global temporal consistency difficult
to achieve. By deploying both spatial and (importantly) temporal
down- and up-sampling and leveraging a pre-trained text-to-image
diffusion model, our model learns to directly generate a
full-frame-rate, low-resolution video by processing it in multiple
space-time scales. We demonstrate state-of-the-art text-to-video
generation results, and show that our design easily facilitates a
wide range of content creation tasks and video editing applications,
including image-to-video, video inpainting, and stylized generation.

[architectu]

Video Stylization

With Lumiere, off-the-shelf text-based image editing methods can be
used for consistent video editing.

Source Video
"Made of wooden blocks"
"Origami folded paper art"
"Made of colorful toy bricks"
"Made of flowers"
Source Video
"Made of stacked wooden blocks"
"Origami folded paper art"
"Made of colorful toy bricks"
"Made of flowers"
Source Video
"Made of stacked wooden blocks"
"Origami folded paper art"
"Made of colorful toy bricks"
"Made of flowers"
Source Video
"Made of stacked wooden blocks"
"Origami folded paper art"
"Made of colorful toy bricks"
"Made of flowers"

Cinemagraphs

The Lumiere model is able to animate the content of an image within a
specific user-provided region.

Input Image + Mask
Output Video
Input Image + Mask
Output Video
[butterfly_]
[fire_with_]
[lake_with_]
[train_with]

Video Inpainting

Source Masked Video
Output Video
Source Masked Video
Output Video
Source Video
"wearing a gold strapless gown"
"wearing a striped strapless dress"
"wearing a purple strapless dress"
"wearing a black strapless gown"
Source Video
"wearing a crown"
"wearing sunglasses"
"wearing a red scarf"
"wearing a purple tie"
Source Video
"wearing a bath robe"
"wearing a party hat"
"Standing on a stool"
"wearing rain boots"

Authors

 
Omer Bar-Tal^*,1,2
 
Hila Chefer^*,1,3
 
Omer Tov^*,1
 
Charles Herrmann^+,1
 
Roni Paiss^+,1
 
Shiran Zada^+,1
 
Ariel Ephrat^+,1
 
Junhwa Hur^+,1
 
Yuanzhen Li^1
 
Tomer Michaeli^1,4
 
Oliver Wang^1
 
Deqing Sun^1
 
Tali Dekel^1,2
 
Inbar Mosseri^+,1
^1Google Research
^2Weizmann Institute
^3Tel-Aviv University
^4Technion

(*): Equal first co-author, (+) Core technical contribution
Work was done while O. Bar-Tal, H. Chefer were interns at Google.

Acknowledgements

We would like to thank Miki Rubinstein, Amit Raj, Ronny Votel, Orly
Liba, Bryan Seybold, David Ross, Guanghui Liu, Dan Goldman, Hartwig
Adam, Xuhui Jia, Xiuye Gu, Rachel Hornung, Oran Lang, Jess Gallegos,
William T. Freeman and David Salesin for their collaboration, helpful
discussions, feedback and support.
We thank owners of images and videos used in our experiments (links
for attribution) for sharing their valuable assets.


* References:
Mona Lisa, public domain.
Pillars of Creation, public domain.
Raising the Flag on Iwo Jima, public domain.
Mask of Tutankhamun, CC BY-SA 3.0.
Girl with a Pearl Earring, public domain.
Isaac Newton, public domain.
Starry Night, public domain.
The Great Wave of Kanagava, public domain.