rw-book-cover

Metadata

Highlights

  • Instruction tuning improves the zero and few-shot generalization abilities of Large Language Models (LLMs). The NLP community has successfully leveraged this idea to build LLMs like InstructGPT, FLAN-T5, and FLAN-PaLM, which can act as general-purpose assistants. The user provides task instructions and guides the model to solve it through conversation. (View Highlight)
  • But we humans use more than just words. We use visuals to enrich our conversations, explain complex ideas, and understand the world. Recognizing this need, various language-augmented2 “foundation” vision models have emerged in the last two years. However, in these models, language was used primarily as a tool to describe the image content. The result was often a single large vision model with a fixed interface. It wasn’t great at adapting to user preferences or interactivity. (View Highlight)
  • Using “visual” instruction tuning, researchers have built a model that operates in the language-image multimodal space. It rivals GPT-4, excelling in chat capabilities and scientific question answering. What’s even more impressive is that the dataset used to train the model, weights, and code have been open-sourced. (View Highlight)
  • The authors of LLaVA thus resort to using the language-only GPT-4 to create instruction-following data. (View Highlight)
  • For an image Xv and its associated caption Xc, it is natural to create a set of questions Xq with the intent to instruct the assistant to describe the image content. We prompt GPT-4 to curate such a list of questions (see details in Appendix). Therefore, a simple way to expand an image-text pair to its instruction-following version is Human : Xq Xv Assistant : Xc. Though cheap to construct, this simple expanded version lacks diversity and in-depth reasoning in both the instructions and responses. (View Highlight)
  • To mitigate this issue, we leverage language-only GPT-4 or ChatGPT as the strong teacher (both accept only text as input), to create instruction-following data involving visual content. Specifically, in order to encode an image into its visual features to prompt a text-only GPT, we use two types of symbolic representations: (i) Captions typically describe the visual scene from various perspectives; (ii) Bounding boxes usually localize the objects in the scene, and each box encodes the object concept and its spatial location. (View Highlight)
  • Let’s look at the model next. LLaVA has three components:
    1. Visual Encoder: This converts the image into visual features. The encoder itself is a pre-trained CLIP ViT-L/14. 
    2. Language Model: A pre-trained Vicuna, known for its instruction-following capabilities, converts text into embeddings.
    3. Projection: This is a simple linear layer that “projects” the visual features from the encoder into the word embedding space.  The LLaVA architecture - From [2] (View Highlight)
  • Although LLaVA is trained on a smaller instruction-following dataset, it performs surprisingly well compared to GPT-4. For example, these are the results on two images that are out-of-domain for LLaVa. It still understands the scenes well and follows the questions. BLIP-2 and OpenFlamingo focus on just describing the image. (View Highlight)