To an AI model, a picture is data, sound and music are data, as is traditional spoken or written language. That data is translatable, interchangeable, and, most importantly, linkable and actionable. That means that video, music, sound, movement, image can interact in common language.