KoCLIP - https://github.com/jaketae/koclip
GPT-assisted Visual Instruction Data Generation
•
Specifically,in order to encode an image into its visual features to prompt a text-only GPT, we use two types ofsymbolic representations:
•
a) caption, b) bounding boxes
•
encoding 해서 gpt-4에 가이드 주고 (For each type, we first manually design a few examples.They are the only human annotations we have during data collection, and are used as seed examplesin in-context-learning to query GPT-4.)
•
158k의 instruction data를 생성함
2. Training
2.1. Pre-training
•
LLM, visual encoder frozen
•
projection layer만 학습
•
이미지 넣고 간단한 이미지 묘사만 하도록
•
595K filtered images from CC3M 을 앞의 data generation 과정으로 brief desctription을 답하는 데이터를 만들어 사용함
2.2. Fine-tuning
•
visual encoder frozen
•
pre-trained LLM, projection layer update
•
앞서 생성한 158K language-image instruction-following datat로 학습
3. Model
LM : Vicuna
Input
Output