Search

MuLaN

Music audio를 unconstrained natural music description과 연결
Joint audio-text embedding model과 weakly-associated free-form text annotation으로 구성
Zero-shot fucttionalities를 보임

1. Introduction

Music-text joint embedding은 transfer learning, cross-modal retrieval, automatic captionaing 등에 강점을 보임
Image에 비해 audio는 caption이 부족함
Musical concept을 music audio와 연결 짓는 모델 개발
Text prompt로 적절한 장르, 아티스트, 분위기, 구조를 유도
인터넷의 4,400만 개 뮤직 비디오에 포함된 텍스트 정보 활용

2. Related Work

2. 1. Audio Representation Learning

기존에는 Image Net과 AuidoSet으로 pretrain 된 Audio Spectrogram Transformer(AST)가 state-of-the-art tagging 모델
Discriminative training: 같은 recording에 높은 similarity 부여
Generative model: Intermediate embedding이 representation으로 의미 있음
본 연구에서는 음악과 weakly associated 된 text 활용

2. 2. Cross-modal Contrastive Learning

Wrod2Vec으로 generalized 되어있는 embeding구해보려 함
AudioSet, ESC-50 데이터 활용됨
본 연구에서는 더 넓은 범위의 데이터를 취득하여 transformer based audio and contextual language encoder 구축

2. 3. Music Text Joint Embedding Models

Semantic in music을 multi-label classifiation으로 풀려고 함
MuLaP은 early fusion을 채택하여 transfer learning application이 제한됨
본 연구는 arbitrary music audio에도 natural language interface 제공 가능

3. Proposed Approach

3. 1. Learning Framework

Audio embedding network: f:RF×TRdf: \mathbb{R}^{F \times T} \rightarrow \mathbb{R}^d
F-channel log mel spectrogram
T-frame context windows
Text embedding network: f:AnRdf: \mathbb{A}^{n} \rightarrow \mathbb{R}^d
Null-padded text token sequence of length n
{(x(i),t(i))}i=1B\left\{\left(\mathbf{x}^{(i)}, \mathbf{t}^{(i)}\right)\right\}_{i=1}^B
x(i)RF×T\mathbf{x}^{(i)} \in \mathbb{R}^{F \times T}
t(i)An\mathbf{t}^{(i)} \in \mathcal{A}^n
Mini batch B
InfoNCE, NT-Xnet Loss와 같은 contrasitive multi-view coding loss fuction 제시
i=1Blog[h[f(x(i)),g(t(i))]jih[f(x(i)),g(t(j))]+h[f(x(j)),g(t(i))]]\sum_{i=1}^B-\log \left[\frac{h\left[f\left(\mathbf{x}^{(i)}\right), g\left(\mathbf{t}^{(i)}\right)\right]}{\sum_{j \neq i} h\left[f\left(\mathbf{x}^{(i)}\right), g\left(\mathbf{t}^{(j)}\right)\right]+h\left[f\left(\mathbf{x}^{(j)}\right), g\left(\mathbf{t}^{(i)}\right)\right]}\right]
h[a,b]=exp(aTb/τ)h[\mathbf{a}, \mathbf{b}]=exp(\mathbf{a}^{T}\mathbf{b}/\tau)
Hyperparameter τ\tau
Large positive value for target audio-text pairs
Small value close to zero for all non-target pairs

3. 2. Audio Embedding Network

ResNet-50 architecture
(F=64) x (T=1000) spectrogram patches
d=128 units
Audio Spectrogram Transformer (AST)
12 Transformer blocks
(F=128) x (T=1000) log mel spectogram context windows

3. 3. Text Embedding Network

BERT with 12 Transformer blocks
String to n=512 tokens
Shared audio-text embedding space of d=128

3. 4. Training Dataset Mining

From soundtrack of each music video, extract 30-second clip from 30 second mark
4,400만 개 30-second clips
Short-form: video titles and tags
Long-form: text including video descriptions and comments
Playlits: title of playlists
BERT로 text가 음악과 연관이 있는지 체크하도록 함. 이를 활용하여 data cleanse
AudioSET(ASET) 데이터도 함께 활용
2:2:1:1 for SF:LF:PL:ASET으로 batch 구축

4. Experiments

4. 1. Evaluation Tasks

4. 1. 1. Zero-shot Music Tagging

(i) The use of a contextual text encoder로부터 prediction space 학습
(ii) The use of cross-modal contrastive learning로부터 language semantics to an audio representation 학습

4. 1. 2. Transfer Learning with Linear Probes

MagnaTagATue, AudioSet으로 transfer learning

4. 1. 3. Music Retrieval From Text Queries

Playlist titles and descriptions이 더 많은 정보를 담고 있어 이를 바탕으로 음악의 cosine similarity 측정

4. 1. 4. Text Triplet Classification

3 text strings of (anchor, pos, neg)
Considered correct if pos is closer than neg to anchor

4. 2. Results and Discussion

4. 2. 1. Music Tagging

4. 2. 2. Music Retrieval from Text Queries

Training is surprisingly robust to annotation noise, achieving similar performance using unfiltered training text

4. 2. 3. Text Triplet Classification

Mel Spectrogram