Feature Engineering for Personalized Search

rw-book-cover

Metadata

Highlights

  • At the simplest level, search is the process of finding matches to a user-provided query string. While this base-level version of search is sometimes still utilized in simpler applications, for the most part, the results from queries that users perform are personalized to them, in order to surface documents that are more likely to be relevant to the user’s interests. (View Highlight)
  • Search engines have typically relied on a reverse index-based information retrieval system, which maps individual words or terms in the documents or records, and helps drastically reduce the candidate sets by several orders of magnitude. (View Highlight)
  • Elasticsearch stores data in an index and searches the inverted indices for query terms when a query is submitted. When a match for the query term is found, the corresponding document is returned. Breaking down the documents and identifying them by their key terms allows for faster retrieval of results. (View Highlight)
  • Semantic Retrieval While there are many ways to perform semantic retrieval, TTSN with BERT is frequently used in combination with KNN indices when working with personalized search. (View Highlight)
  • For personalized search, a KNN index is used to model the user’s interests and preferences, based on the user’s past search queries and the search results that the user has clicked on (which are modeled by the TTSN). (View Highlight)
  • feature engineering for personalized search is more focused on designing features that can be used to rank and retrieve search results (documents) that are relevant to the user. (View Highlight)
  • search has a much more severe long tail than recommendations. (View Highlight)
  • NLP-Type Features – Features that understand the semantic content of the query. These might come from SOTA models (such as BERT), FastText (either directly fed or the dot product of the query), or document embeddings. (View Highlight)
  • Reputation Features – Features representing how reputable of a source the document comes from. These might include the author of the document, the number of inbound links, the page rank, etc. (View Highlight)
  • latency is a big factor with search of any kind. (View Highlight)