MuLaN

날짜

Blog/HuggingFace

paper

깃헙

리뷰어

학회

•

Link To Paper: https://arxiv.org/pdf/2208.12415.pdf

◦

Music audio를 unconstrained natural music description과 연결

◦

Joint audio-text embedding model과 weakly-associated free-form text annotation으로 구성

◦

Zero-shot fucttionalities를 보임

1. Introduction

2. Related Work

2. 1. Audio Representation Learning

2. 2. Cross-modal Contrastive Learning

2. 3. Music Text Joint Embedding Models

3. Proposed Approach

3. 1. Learning Framework

3. 2. Audio Embedding Network

3. 3. Text Embedding Network

3. 4. Training Dataset Mining

4. Experiments

4. 1. Evaluation Tasks

4. 1. 1. Zero-shot Music Tagging

4. 1. 2. Transfer Learning with Linear Probes

4. 1. 3. Music Retrieval From Text Queries

4. 1. 4. Text Triplet Classification

4. 2. Results and Discussion

4. 2. 1. Music Tagging

4. 2. 2. Music Retrieval from Text Queries

4. 2. 3. Text Triplet Classification

1. Introduction

•

Music-text joint embedding은 transfer learning, cross-modal retrieval, automatic captionaing 등에 강점을 보임

•

Image에 비해 audio는 caption이 부족함

•

Musical concept을 music audio와 연결 짓는 모델 개발

•

Text prompt로 적절한 장르, 아티스트, 분위기, 구조를 유도

•

인터넷의 4,400만 개 뮤직 비디오에 포함된 텍스트 정보 활용

2. Related Work

2. 1. Audio Representation Learning

•

기존에는 Image Net과 AuidoSet으로 pretrain 된 Audio Spectrogram Transformer(AST)가 state-of-the-art tagging 모델

•

Discriminative training: 같은 recording에 높은 similarity 부여

•

Generative model: Intermediate embedding이 representation으로 의미 있음

•

본 연구에서는 음악과 weakly associated 된 text 활용

2. 2. Cross-modal Contrastive Learning

•

Wrod2Vec으로 generalized 되어있는 embeding구해보려 함

•

AudioSet, ESC-50 데이터 활용됨

•

본 연구에서는 더 넓은 범위의 데이터를 취득하여 transformer based audio and contextual language encoder 구축

2. 3. Music Text Joint Embedding Models

•

Semantic in music을 multi-label classifiation으로 풀려고 함

•

MuLaP은 early fusion을 채택하여 transfer learning application이 제한됨

•

본 연구는 arbitrary music audio에도 natural language interface 제공 가능

3. Proposed Approach

3. 1. Learning Framework

•

Audio embedding network: f:RF×T→Rdf: \mathbb{R}^{F \times T} \rightarrow \mathbb{R}^df:RF×T→Rd

◦

F-channel log mel spectrogram

◦

T-frame context windows

•

Text embedding network: f:An→Rdf: \mathbb{A}^{n} \rightarrow \mathbb{R}^df:An→Rd

◦

Null-padded text token sequence of length n

•

{(x(i),t(i))}i=1B\left\{\left(\mathbf{x}^{(i)}, \mathbf{t}^{(i)}\right)\right\}_{i=1}^B{(x(i),t(i))}i=1B​

◦

x(i)∈RF×T\mathbf{x}^{(i)} \in \mathbb{R}^{F \times T}x(i)∈RF×T

◦

t(i)∈An\mathbf{t}^{(i)} \in \mathcal{A}^nt(i)∈An

◦

Mini batch B

•

InfoNCE, NT-Xnet Loss와 같은 contrasitive multi-view coding loss fuction 제시

◦

∑i=1B−log⁡[h[f(x(i)),g(t(i))]∑j≠ih[f(x(i)),g(t(j))]+h[f(x(j)),g(t(i))]]\sum_{i=1}^B-\log \left[\frac{h\left[f\left(\mathbf{x}^{(i)}\right), g\left(\mathbf{t}^{(i)}\right)\right]}{\sum_{j \neq i} h\left[f\left(\mathbf{x}^{(i)}\right), g\left(\mathbf{t}^{(j)}\right)\right]+h\left[f\left(\mathbf{x}^{(j)}\right), g\left(\mathbf{t}^{(i)}\right)\right]}\right]∑i=1B​−log[∑j=i​h[f(x(i)),g(t(j))]+h[f(x(j)),g(t(i))]h[f(x(i)),g(t(i))]​]

◦

h[a,b]=exp(aTb/τ)h[\mathbf{a}, \mathbf{b}]=exp(\mathbf{a}^{T}\mathbf{b}/\tau)h[a,b]=exp(aTb/τ)

◦

Hyperparameter τ\tauτ 

•

Large positive value for target audio-text pairs

•

Small value close to zero for all non-target pairs

3. 2. Audio Embedding Network

•

ResNet-50 architecture

◦

(F=64) x (T=1000) spectrogram patches

◦

d=128 units

•

Audio Spectrogram Transformer (AST)

◦

12 Transformer blocks

◦

(F=128) x (T=1000) log mel spectogram context windows

3. 3. Text Embedding Network

•

BERT with 12 Transformer blocks

•

String to n=512 tokens

•

Shared audio-text embedding space of d=128

3. 4. Training Dataset Mining

•

From soundtrack of each music video, extract 30-second clip from 30 second mark

•

4,400만 개 30-second clips

•

Short-form: video titles and tags

•

Long-form: text including video descriptions and comments

•

Playlits: title of playlists

•

BERT로 text가 음악과 연관이 있는지 체크하도록 함. 이를 활용하여 data cleanse

•

AudioSET(ASET) 데이터도 함께 활용

•

2:2:1:1 for SF:LF:PL:ASET으로 batch 구축

4. Experiments

4. 1. Evaluation Tasks

4. 1. 1. Zero-shot Music Tagging

•

(i) The use of a contextual text encoder로부터 prediction space 학습

•

(ii) The use of cross-modal contrastive learning로부터 language semantics to an audio representation 학습

4. 1. 2. Transfer Learning with Linear Probes

•

MagnaTagATue, AudioSet으로 transfer learning

4. 1. 3. Music Retrieval From Text Queries

•

Playlist titles and descriptions이 더 많은 정보를 담고 있어 이를 바탕으로 음악의 cosine similarity 측정

4. 1. 4. Text Triplet Classification

•

3 text strings of (anchor, pos, neg)

•

Considered correct if pos is closer than neg to anchor

4. 2. Results and Discussion

4. 2. 1. Music Tagging

4. 2. 2. Music Retrieval from Text Queries

•

Training is surprisingly robust to annotation noise, achieving similar performance using unfiltered training text

4. 2. 3. Text Triplet Classification

•

Mel Spectrogram

◦

https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53