Google's LUMIERE (LUM) is a new artificial intelligence system for generating realistic and coherent videos from text prompts or images.
At its core, Lumiere is an AI system that can generate high-quality, realistic videos directly from text descriptions. This represents a massive leap forward compared to previous text-to-video models.
Architecture:
- It uses a Space-Time U-Net (STUNet): Instead of creating video frame-by-frame, Lum generates an entire video at once using unique "spacetime units" that handle both spatial image features and temporal video motion efficiently. This results in smooth, realistic videos.
- The model incorporates spatial downsampling/upsampling modules from a pretrained Imagen text-to-image diffusion model. It "inflates" Imagen by adding temporal downsampling and upsampling modules to handle videos.
- It has both convolution-based blocks and temporal attention blocks to process videos across multiple space-time scales.
Functioning:
- Lumiere generates full-length video clips in one pass rather than first generating distant keyframes and then interpolating between them like other models. This allows more coherent motion.
- It builds on an Imagen model pretrained for text-to-image generation. Only the new temporal parameters are trained while keeping Imagen's weights fixed.
- Leveraging Image Diffusion Models: Lum adapts existing advanced image generation models that use diffusion for the video domain. This lets it create sharp, high-fidelity video frames.
- For high-res videos, it applies spatial super-resolution on overlapping temporal windows using multidiffusion. This prevents inconsistencies across windows.
Capabilities:
- Text-to-video: It can generate short video clips based on text prompts. For example, if you type "astronaut on Mars", it will generate a video of an astronaut walking on Mars.
- Image-to-video: It can take a still image and animate elements of it into a short video clip. For example, it can take an image of a bear and generate a video of the bear walking.
- Stylization: It can take a reference image and match the style, creating videos with a specific artistic style.
- Cinemagraphs: It can animate only certain chosen portions of an image, leaving rest static.
- Video inpainting: It can fill in missing parts of a video, guessing at what should occur based on context.
Performance:
- Temporal Consistency: One major challenge in synthentic video is making motions look natural throughout the clip. Lum is designed to specifically address this through temporal downsampling/upsampling and training procedures that enforce coherence over time. The videos it generates have smooth, realistic, and coherent motions and transformations over time, with objects retaining consistent forms.
- Lumiere achieves state-of-the-art video generation quality as assessed by both automated metrics and human evaluation.
- It generates 5 second, 16 fps videos with coherent motion from text prompts or conditioned on images.
- State-of-the-art quality: In Google's tests, it performed better than other models like DALL-E 2 and PICA on metrics like text alignment, motion quality, and user preference.
Human Evaluations
The paper showed through comparative user studies that Lumiere significantly outperforms other state-of-the-art video generation models such as iMIN, PECO Labs, ZeroScope, and Runway Gen 2 in key quality metrics:
User studies are conducted on Amazon Mechanical Turk by showing participants pairs of videos - one generated by Lumiere and one by a baseline model.
- The evaluations assess both text-to-video and image-to-video generation capabilities.
- For text-to-video, participants are asked "Which video better matches the text prompt?" and "Which has better quality and motion?".
- For image-to-video, the question focuses only on video quality since text is not a factor.
- Lumiere achieved higher "video quality" scores that looked at factors like resolution, coherence, and realism. Its videos simply looked more natural and high-fidelity.
- Lumiere outperforms ImagenVideo and Pika labs in preference scores despite them having high per-frame quality. Their outputs tend to have very limited motion.
- It surpasses ZeroScope and AnimateDiff which produce noticeable artifacts even though they show more motion. Their shorter durations likely contribute - 2 to 3.6s versus Lumiere's 5s duration.
- Lumiere also beats Gen-2 and Stable Video Diffusion in the image-to-video assessments with users favoring the quality and motion of its generated videos.
- Lumiere was preferred by users both for text-to-video generation as well as image-to-video generation. This means starting just from a text description or existing image, Lumiere's videos were found to be higher quality.
- Lumiere exceeded the other models in "text alignment" scores. This metric measures how well the generated video actually aligns to the input text description - rather than exhibiting visual artifacts or deviations. Lum's videos stayed truer to the source text.
Human evaluations unanimously show Lumiere produces better quality and motion compared to other T2V models like ImagenVideo, Pika, ZeroScope etc.
Stylized Video Generation
One of the most visually stunning capabilities Lumiere demonstrates beyond realistic video generation is stylized video generation - the ability to match different artistic styles.
Lumiere adapts the "StyleDrop" model previously published by Google Research for images to instead transfer visual styles into video domains. This lets it produce animated clips that mimic specified art genres.
During video generation, Lumiere uses a reference style image, analyses its artistic features like colors, strokes, textures etc, and incorporates those elements into the output video through neural style transfer.
Some examples that showcase artistic style transfer videos from Lumiere:
- An ocean wave video rendered to match the aesthetics of famous painter Claude Monet's impressionist artworks. The generated clip contains familiar soft brush strokes and color blending.
- A video of a dancing cat transferred to emulate Van Gogh's iconic swirly Starry Night style. The cat seamlessly takes on a dreamy animated quality with swirling backgrounds.
- Flowers blooming in the style of an anime cartoon. Clean lines, exaggerated features, and animated flashes give the video a distinctly Japanese animation look.
The stylization works by using a reference image exemplifying the desired art genre, then propagating the style into the video generation process. This borrowing comes from StyleDrop - showing Google effectively expanding research into new areas like video.
Under the hood, style transfer relies on texture synthesis and feature transforms to map styles between domains. Lumiere adapts these algorithms for coherent video results.
While realistic video mimicking our tangible world has many applications on its own, the ability to also inject different art aesthetics vastly expands the creative possibilities through AI video generation. The synthesized videos make otherwise impossible scenes come to life stylishly.
How it Works
- StyleDrop is a text-to-image diffusion model that can generate images matching various styles like watercolor, oil-painting etc. based on an example style image.
- It has multiple fine-tuned models each customized for a particular artistic style by training on data matching that style.
- Lumiere utilizes this by interpolating between the fine-tuned weights of a StyleDrop model for a particular style, and the original weights of the base Lumiere model pre-trained on natural videos.
- The interpolation coefficient α controls the degree of stylization. Values between 0.5 to 1.0 work well empirically.
- Using these interpolated weights in Lumiere's model thus stylizes the generated videos to match styles like watercolor painting, line drawing etc. based on the reference style image.
- Interestingly, some styles like line drawing also translate to unique motion priors with content looking like it's being sketched across frames.
Via weight interpolation inspired by StyleDrop models, Lumiere can perform artistic stylized video generation that matches different styles provided as example images. The global coherence from generating full videos directly translates well into the stylized domain too.
Video-in-Painting
One nifty feature that Lumiere demonstrates beyond fundamental video generation is video in-painting. This refers to the ability to take an incomplete video clip with missing sections, and fill in the gaps with realistic imagery to complete the scene.
Some examples that show off the video in-painting capabilities:
- A video of chocolate syrup being poured onto ice cream, but the clip only shows the start and end states. Lumiere is able to plausibly generate the middle phase of syrup fluidly falling onto the dessert.
- A scene of a woman walking down stairs, but the middle section of steps is erased. The model convincingly fills it in with smooth footsteps down the stairwell.
- A laborer working on construction beams, but part of his arm swings are edited out. Lum completes the repetitive motions matching the style.
The system is able to infer plausible motions fitting the trajectories, physics, and styles of what happens before and after the removed section. This could for example help editors salvage useful parts of damaged old video footage while fixing glaring omissions.
On the engineering side, the in-painting works by having Lumiere analyze spatial semantic information from the existing imagery, then leverage temporal consistency networks to propagate sequential frames that realistically fit the start and end points.
Advanced fluid and physics simulation techniques also allow properly animating liquids, object interactions, and human motions to stitch together severed sections believably.
Lumiere demonstrates high-quality video inpainting capabilities to fill in missing or masked sections in an input video in a coherent manner:
- Lumiere performs inpainting by conditioning the model on:
(a) Input video with masked region to fill
(b) Corresponding binary mask indicating the fill region - The model learns to animate the masked region based on context from unmasked areas to produce realistic, seamless results.
- Being able to generate the full video directly rather than via temporal super-resolution helps ensure coherence in inpainting.
- For example, if a video requires depicting a beach background behind a person, the model learns continuity in elements like water waves, people walking across frames etc.
- Inpainting can also enable applications like object removal, insertion, video editing by conditioning the generation on edited input videos.
In summary, Lumiere demonstrates high-fidelity video inpainting to fill in masked areas of input videos with realistic animations in a globally coherent manner across frames. This expands its applicability for downstream editing and post-production applications.
Cinemagraphs Creation
Another creative video manipulation capability showcased by Lumiere is cinemagraph creation. Cinemagraphs are a special medium of having mostly still photographs with only minor repeated motions animated. Common examples are having video of waving flags or flowing water integrated into a static scene.
Cinemagraphs involve identifying a region of interest in a still image, like a face, and animating motion just within that region, for example blinking eyes.
- The surrounding areas remain static to give the effect of a photo coming to life partially.
- Lumiere can generate cinemagraphs from an input image by conditioning the video generation on:
(a) Input image duplicated across all video frames
(b) Mask where only the area to animate is unmasked - This encourages the model to:
(a) Replicate the input image in the first frame
(b) Animate only the unmasked region in subsequent frames
(c) Freeze the masked background similar to the first frame - An example is animating the wings of a butterfly in an input photo of it sitting on a flower. The flower and background remain static while only the wings flap.
- The conditioning provides control on the region to animate allowing easy cinemagraph creation.
Lumiere demonstrates the ability to automatically create cinemagraphs from normal video by selecting an image region the user wants animated. For example:
- Taking a backdrop cityscape image, then specifying a window section to be animated. Lumiere generates a vivid fire animation in the window seamlessly integrated into the static urban setting.
- Freezing a scene of climbers mid-mountain ascent, with only an embedded region of fluttering flags waved by wind. This composites dynamic motion into an otherwise frozen moment.
- Animating splashing water from a fountain statued into the scene of people standing in a plaza. This brings limited activity to a snapshot vignette.
The tool allows easily introducing customized motion that creates a hybrid between photos and video for visually arresting effects. On the engineering side, it involves spatially isolating image regions then applying selective generation and extrapolation exclusively to animated areas.
While a niche application today, tools like Lumiere's cinemagraph creation could enable new mixed media visual expression. Combining still snapshot moments as backdrops with AI-generated motions in the foreground could further enhance visual storytelling. It demonstrates the expanding creativity emanating from modern generative video research.
Image-to-Video Generation
In addition to generating video purely from text prompts, Lumiere also demonstrates strong capabilities at image-to-video generation - producing video clips extrapolating from input seed images. This expands the flexibility of the system.
Some examples that showcase Lumiere's image-to-video model:
- An input image of a toy bear on a beach expanded into a short clip of the bear running playfully through sand and splashing in the waves.
- A single photograph of elephants transported into a dynamic vignette of them happily bathing and squirting water, with temporal consistency.
- Turning a snapshot of firefighters putting out a blaze into a plausible video segment showing fluid motions of the people and flickering fire movement.
- Converting a Reykjavik cityscape into an emergent clip of Northern Lights flickering over the settlement with temporal variations.
The image-to-video generation works by first encoding spatial semantic details from the source image through convolutional neural layers. This captures information like textures, boundaries, and segmentation masks.
Then recurrent sequence modules extrapolate plausible temporal progressions - predicting likely next frames that build off the image. This leverages training on massive video datasets to simulate natural motions.
Additional consistency networks review output videos to prune undesirable artifacts and enforce seamless flow. This all combines to plausibly animate still images automatically.
While text-based video generation without any source assets fully unleashes imagination, image-to-video remains highly valuable. It allows creators to build off existing works, salvaging individual images into living scenes. The research showcases remarkable progress in AI-driven video manipulation capabilities.
Lumiere Commercial or Not?
One major outstanding question is whether Google will actually productize and release Lumiere commercially given the highly competitive generative AI landscape:
On one hand, Lumiere massively pushes state-of-the-art boundaries in controllable video generation - enabling new tools for creators as well as entertainment applications. Releasing it as a service could yield financial upside and cement Google's lead.
However, there may also be hesitancy to reveal its secret sauce. Some examples:
- OpenAI initially guarded access to Dall-E 2 despite high demand to retain its competitive advantage. Google may want to prevent rivals from replicating its technical breakthroughs underlying Lumiere.
- Quality video generation could provide leverage for Google Cloud if integrated with its suite of services. Offering Lumiere's capabilities could attract more customers from competitors.
- Direct monetization of synthetic video tools made by rivals like Runway ML showed market appetite. Google may still be determining go-to-market and positioning.
Additionally, while Lumiere's outputs are highly impressive, the system likely still requires product-level refinements for mass deployment:
- Model robustness and safety mechanisms need hardening for public visibility
- Industrial-grade stability, accessibility, developer tools required
So in summary - while interest for commercial applications of Lumiere seems immense, Google itself may still be evaluating competitive dynamics around releasing it. The company has a mixed history of productizing vs. open-sourcing its generative AI projects. We may gain more clarity this year if Lumiere capabilities start to emerge in Google Cloud or other tools.