Find Duplicates with SBERT

Finding Duplicates

A community of volunteers have been culling through questions about AGI Safety from multiple sources including YouTube video comments and a Discord server for the stampy.ai project. Questions marked as high interest are then answered and evaluated within the community. Due to the distributed origins, we realized there’s likely to be quite a number of duplicate or at least semantically very similar questions in the database. The concern is, of course, that this could be a potentially time-consuming and resource-heavy operation to compare every question in the database with every other, resulting in the dreaded O(n²).

Sentence-BERT

After a bit of research, I found that Sentence-BERT (SBERT), a modification of BERT, is optimized for generating accurate and useful sentence level embeddings. It uses a Siamese network with triplet loss function to derive embeddings that can be compared efficiently using cosine similarity. This reduces the time for finding the most similar pairs among 10,000 sentences from 65 hours with BERT or RoBERTa down to about 5 seconds, without sacrificing accuracy!

Here are the slides from the presentation I gave on the original journal paper, which goes into depth about the training architecture setup and various methods & datasets for evaluation. It turns outs, natural language inference (MNLI & SNLI) datasets, which are fairly large and indicate entailment, serve as helpful training sets for fine-tuning semantic simiarity tasks. However, training on the smaller, specific semantic textual similarity (STS-b) dataset, further improved accuracy.

Pretrained Models

Within the sentence-transformers framework now, there’s an evergrowing number of pretrained model checkpoints ranked by size, speed, and other performance metrics. A model can be initialized by passing it a checkpoint that indicates a combination of both the architecture plus the specific trained weights.

!pip install sentence-transformers
from sentence_transformers import SentenceTransformer
# choose from list of pretrained models at sbert.net/docs/pretrained_models.html
checkpoint = "paraphrases-multi-qa-mpn"
model = SentenceTransformer(checkpoint)

Since our goal was to identify pairs of most similar questions or sentences, we tried some checkpoints below that performed best on semantic search leaderboards.

multi-qa-mpnet-base-dot-v1 trained on 315M StackExchange, Yahoo Answers, Google & Bing questions. Scored highest on semantic similarity benchmarks.
distilbert-base-nli-stsb-quora-ranking trained on 500K Quora duplicate questions.
all-MiniLM-L6-v2 general purpose model 1B+ training pairs. Considered generically fast & good but not great quality.
paraphrases-multi-qa-mpn gave us best results on our dataset ascertained by expert human evaluation.

Paraphrase Mining

The super-handy paraphrase_mining utility returns a list of tuples sorted by descending similarity scores along with the indices of 2 sentences from the original list of input sentences. A score of 1.0 means the 2 sentences are semantically identical, while a score of 0.0 means they are semantically unrelated.

from sentence_transformers import util
# Single list of sentences - Possible +10,000 of sentences
sentences = df['text'].values.tolist()
paraphrases = util.paraphrase_mining(model, sentences)
for paraphrase in paraphrases[0:100]:
    score, i, j = paraphrase
    print(f'{sentences[i]}\n{sentences[j]}\nscore:{score:.2f}\n')  

Here’s a sample output of the top scoring duplicates when we first ran paraphrase_mining through our list of questions. They mostly seem pretty reasonable. We still kept a human in the loop to decide which version to keep or whether questions should be reworded if they were indeed semantically dissimilar.

Question1	Question2	Score
Who helped create Stampy?	Who created Stampy?	0.98
Is humanity doomed?	How doomed is humanity?	0.95
What is a canonical question on Stampy’s Wiki?	What is a canonical version of a question on Stampy’s Wiki?	0.93
Why can’t we just “put the AI in a box” so it can’t influence the outside world?	Couldn’t we keep the AI in a box and never give it the ability to manipulate the external world?	0.92
How might a superintelligence technologically manipulate humans?	How might a superintelligence socially manipulate humans?	0.92
Why is AI Safety important?	Why is safety important for smarter-than-human AI?	0.91
Can we tell an AI just to figure out what we want, then do that?	Can we just tell an AI to do what we want?	0.90
What is AI Safety?	Why is AI Safety important?	0.90

The complete functional code in this notebook. As you will see, the usage is straightforward for any list of sentences from your own dataset. Pick a model and try it out!

Resources

Cover Photo by Ralph Mayhew on Unsplash

← Previous Post Next Post →