A Gentle Introduction to Vector Databases

rw-book-cover

Metadata

Highlights

  • s the internet grew and evolved, unstructured data (magazine articles, shared photos, short videos, etc.) became increasingly common. Unlike structured data, there is no easy way to store the contents of unstructured data within a relational database (View Highlight)
  • he increasing ubiquity of unstructured data has led to a steady rise in the use of machine learning models trained to understand such data (View Highlight)
  • Armed with this knowledge, it’s now clear what vector databases are used for: searching across images, video, text, audio, and other forms of unstructured data via their content rather than keywords or tags (View Highlight)
  • Now that we’ve seen the representational power of vector embeddings, let’s take a bit of time to briefly discuss indexing the vectors. Like relational databases, vector databases need to be searchable in order to be truly useful — just storing the vector and its associated metadata is not enough. This is called nearest neighbor search, or NN search for short, and alone can be considered a subfield of machine learning and pattern recognition due to the sheer number of solutions proposed. (View Highlight)
  • Vector search is generally split into two components - the similarity metric and the index. The similarity metric defines how the distance between two vectors is evaluated, while the index is a data structure that facilitates the search process (View Highlight)
  • Now that we understand the representational power of embedding vectors and have a good general overview of how vector search works, it’s now time to put the two concepts together — welcome to the world of vector databases. A vector database is purpose-built to store, index, and query across embedding vectors generated by passing unstructured data through machine learning models. (View Highlight)
  • When scaling to huge numbers of vector embeddings, searching across embedding vectors (even with indices) can be prohibitively expensive. Despite this, the best and most advanced vector databases will allow you to insert and search across millions or even billions of target vectors, in addition to specifying an indexing algorithm and similarity metric of your choosing. (View Highlight)
  • hen scaling to billions of embedding vectors and beyond, storage and compute quickly become unmanageable for a single machine. Sharding can solve this problem, but this requires splitting the indexes across multiple machines as well. (View Highlight)
  • es, query and write speeds are important, even for vector databases. An increasingly common use case for vector databases is processing and indexing input data in real-time (View Highlight)