Building upon our prior work on audio-only CSD, this repository presents a multimodal approach that incorporates visual information to enhance performance. This new model expands the capabilities of ...
Smart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self-supervised methods such ...
Abstract: There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing, and reading process of human ...
1 Neuropsychology Lab, Department of Psychology, Carl von Ossietzky University of Oldenburg, Oldenburg, Germany 2 Department of Medical Physics and Acoustics, Carl von Ossietzky University of ...