Search

LLaVA

GPT-assisted Visual Instruction Data Generation

Specifically,in order to encode an image into its visual features to prompt a text-only GPT, we use two types ofsymbolic representations:
a) caption, b) bounding boxes
encoding 해서 gpt-4에 가이드 주고 (For each type, we first manually design a few examples.They are the only human annotations we have during data collection, and are used as seed examplesin in-context-learning to query GPT-4.)
158k의 instruction data를 생성함

2. Training

2.1. Pre-training
LLM, visual encoder frozen
projection layer만 학습
이미지 넣고 간단한 이미지 묘사만 하도록
595K filtered images from CC3M 을 앞의 data generation 과정으로 brief desctription을 답하는 데이터를 만들어 사용함
2.2. Fine-tuning
visual encoder frozen
pre-trained LLM, projection layer update
앞서 생성한 158K language-image instruction-following datat로 학습

3. Model

LM : Vicuna
Input
Output