This is a new multimodal AI model open-sourced by Meta AI, which supports arbitrary conversions between six different modalities such as images, text, and audio. For example, it can automatically generate a photo, video, and text description of a train based on an audio clip of a train.