LLaVA

태그

공식 블로그

https://llava-vl.github.io/

깃헙

https://github.com/haotian-liu/LLaVA

설명

Visual Instruction Tuning

KoCLIP - https://github.com/jaketae/koclip

GPT-assisted Visual Instruction Data Generation

•

Specifically,in order to encode an image into its visual features to prompt a text-only GPT, we use two types ofsymbolic representations:

•

a) caption, b) bounding boxes

•

encoding 해서 gpt-4에 가이드 주고 (For each type, we first manually design a few examples.They are the only human annotations we have during data collection, and are used as seed examplesin in-context-learning to query GPT-4.)

•

158k의 instruction data를 생성함

2. Training

2.1. Pre-training

•

LLM, visual encoder frozen

•

projection layer만 학습

•

이미지 넣고 간단한 이미지 묘사만 하도록 

•

595K filtered images from CC3M 을 앞의 data generation 과정으로 brief desctription을 답하는 데이터를 만들어 사용함

2.2. Fine-tuning

•

visual encoder frozen

•

pre-trained LLM, projection layer update

•

앞서 생성한 158K language-image instruction-following datat로 학습

3. Model

LM : Vicuna

Input

Output