VASA: Microsoft AI generates "talking face" from photo and voice recording

Microsoft has developed a video generator that creates a video from a photo and an audio recording of speech, but it's not planned to be released to the public.

(Bild: Microsoft Asia)

Apr 19, 2024 at 9:34 am CEST

2 min. read

By

Martin Holland

This article was originally published in German and has been automatically translated.

A research team at Microsoft has developed an AI tool that can generate amazingly realistic video clips from a photo and a voice recording, in which the photo appears to be speaking. They call the framework VASA, with the first version dubbed VASA-1, which refers to the "visual affective skills" of the generated avatars. The tool is not only able to create valuable synchronization between lips and sound, but can also simulate a wide range of expressive facial expressions and natural head movements. VASA can already handle audio files of any length and seamlessly generate talking videos of faces on a PC with an Nvidia RTX 4090.

Empfohlener redaktioneller Inhalt

Mit Ihrer Zustimmmung wird hier ein externes Video (Kaltura Inc.) geladen.

Videos immer laden

Ich bin damit einverstanden, dass mir externe Inhalte angezeigt werden. Damit können personenbezogene Daten an Drittplattformen (Kaltura Inc.) übermittelt werden. Mehr dazu in unserer Datenschutzerklärung.

Drei komplett synchron sprechende Frauengesichter

(Quelle: Microsoft Asia)

On a project page, Microsoft Asia employees have compiled a whole series of examples to demonstrate the tool's capabilities. Numerous square videos of different faces expressively reciting different texts can be seen. The team assures that all portraits are virtual, non-existent AI-generated representations - the only exception is an animation of Leonardo da Vinci's Mona Lisa. For some videos, there are juxtapositions of how the generated face recites the text with different emotions. Another example shows three female faces speaking a text in complete synchronization.

Empfohlener redaktioneller Inhalt

Mit Ihrer Zustimmmung wird hier ein externes Video (Kaltura Inc.) geladen.

Videos immer laden

Ich bin damit einverstanden, dass mir externe Inhalte angezeigt werden. Damit können personenbezogene Daten an Drittplattformen (Kaltura Inc.) übermittelt werden. Mehr dazu in unserer Datenschutzerklärung.

Die Mona Lisa etwas anders

(Quelle: Microsoft Asia)

No publication of the tool planned

The aim of the research project is to develop a technique for animating photorealistic avatars in real time, the team writes. However, they admit that the technology can be misused to impersonate real people. Despite this, the group is convinced that the potential benefits justify the research. For these reasons, there are currently no plans to publish an online demo, provide development access or even release a product based on it. This will only be tackled once they can be sure that the technology will only be used responsibly. For now, they just wanted to present the research.

Empfohlener redaktioneller Inhalt

Mit Ihrer Zustimmmung wird hier ein externes Video (Kaltura Inc.) geladen.

Videos immer laden

Ich bin damit einverstanden, dass mir externe Inhalte angezeigt werden. Damit können personenbezogene Daten an Drittplattformen (Kaltura Inc.) übermittelt werden. Mehr dazu in unserer Datenschutzerklärung.

Vorführung von VASA-1 in Echtzeit

(Quelle: Microsoft Asia)

Read also

Bilder in Videos verwandeln: Runway Gen-2 im Test

(mho)

nach oben

Alle Angebote

Newsletter heise-Bot Push Push-Nachrichten

${intro} ${title}

${intro} ${title}

VASA: Microsoft AI generates "talking face" from photo and voice recording

Empfohlener redaktioneller Inhalt

Empfohlener redaktioneller Inhalt

No publication of the tool planned

Empfohlener redaktioneller Inhalt

Read also

Bilder in Videos verwandeln: Runway Gen-2 im Test

Spiele

1 Jahr nur 1,90 € pro Woche

Das digitale Abo für IT und Technik.