I have just decided my research topic!

It is “Sound Source Separation based on Audio-Visual Multimodal Learning”.

Let’s assume that we are the audiences of an orchestra and we can hear the sounds of violin and flute but cannot hear the sounds of piano well. At this time, we can see the performers playing three instruments respectively. Then we can understand that there are three instruments and the magnitude of piano sounds is little. So the visual information can be used to help to know what sounds there are and separate them.

There are the following things to implement for this task:

  • Detection of the objects in video for instance level
  • Classification of the sound of them in audio
  • Learning the audio-visual model jointly for separation of the sound sources

You can refer to the following figures to understand this task.

Michelsanti, Daniel, et al. "An overview of deep-learning-based audio-visual speech enhancement and separation." IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021): 1368-1396.

Zhao, Hang, et al. "The sound of pixels." Proceedings of the European conference on computer vision (ECCV). 2018.

I hope that my reserch will be well~ :)

Leave a comment