Introduce Semantic Search in 10 minutes
Preface
Work from a while ago, I’ve opportunity to implement the semantic search system and it’s a topic from NLP. To give the conclusion of the project I done, so I take the time to write this article.
Keyword Search vs Semantic Search
- Keyword Search: Need to match the word exactly, if not then target would not be selected.
- Semantic Search: Find the target which contains the close semantic meaning, the target may not has the exactly same word but semantically closed.
Vector-Based Search
Vector-based search provide a intuitive way to perform semantic search. In order to index the documents we store in database, first the documents have to be tokenized into the keywords and the keywords in each documents would be treat as indexing vector. As we want to query for the documents, our query text would be first tokenized into the keywords vector and match the documents store in the database by the vector similarity. Here I will introduce three ways to achieve semantic search by indexing the vector.
TF-IDF
To introduce TF-IDF, first we separate TF and IDF individually
- TF (Term Frequency): Term frequency of our query text keywords appear in the target document.
- IDF (Inverse Document Frequency): The Inverse document frequency measure for the keyword frequency come across from all documents in the database.
After understanding the separated meaning, we can combine to see it together. Every time we search the documents by the query text, the query will be converted to the tokenized keyword vector and compute the similarity score with the document vector and take both TF and IDF into the consideration. The computed score would be reward by the TF and penalized by the IDF.
BM25
After knowing the TF-IDF, we can think further about how to compute the score with the way more reflect on the semantic
?
TF-IDF reflect the score linearly according to terms of the frequency on the keywords frequency appears to the document, but is it true if “A” article has 8 times “Fruit” has more semantic mean on fruit rather than “B” article which has 4 times “Fruit”?
So, here is why BM25 come out. As the frequency of the keywords increase, the score which keyword got according to the
target document would become more irrelative to the frequency. In short, BM25 decouple the semantic from the keyword frequency
in some extent and it is also the default fulltext search
scoring algorithm in Elasticsearch
.
Sentence Embedding with SBERT
The previous two algorithm convert the sentence to the tokenized keywords as the vector, but if we choose the SBERT way,
the sentence would be first convert the tokenized keywords and then the keywords will be convert from multiple sparse vectors
to one dense vector. The process from sentence to dense vector done by SBERT we called Sentence Embedding
, each dimension
of the dense vector takes the context meaning of each keyword into the consideration by self-attention
mechanism from the
Transformer Encoder
. The embedding model is pre-trained by a large dataset in order to give the dense vector more semantic
meaning.
After performing the sentence embedding, we put the query text vector and target vector together to compute the similarity score by cosine similarity. This way has better performance on semantic comparison but also cost more computing resource to embedding sentence and additional space to load the embedding model to the memory and disk storage to store the dense vectors.
SBERT System Design Example
Semantic Search Service: The service to accept client request for CRUD purpose, after receive the query text, which would go to sentence embedding service to get the dense vector. After getting the dense vector, we can interact with the vector database.
Sentence Embedding Service: The web server host for the embedding model, it is recommend to decouple the model hosting from semantic search service.
Vector Database: Store the dense vectors which can perform similarity computation with queried dense vector.