A community of volunteers have been culling through questions about AGI Safety from multiple sources including YouTube video comments and a Discord server for the stampy.ai project. Questions marked as high interest are then answered and evaluated within the community. Due to the distributed origins, we realized there’s likely to be quite a number of duplicate or at least semantically very similar questions in the database. The concern is, of course, that this could be a potentially time-consuming and resource-heavy operation to compare every question in the database with every other, resulting in the dreaded O(n2).
After a bit of research, I found that Sentence-BERT (SBERT), a modification of BERT, is optimized for generating accurate and useful sentence level embeddings. It uses a Siamese network with triplet loss function to derive embeddings that can be compared efficiently using cosine similarity. This reduces the time for finding the most similar pairs among 10,000 sentences from 65 hours with BERT or RoBERTa down to about 5 seconds, without sacrificing accuracy!
Here are the slides from the presentation I gave on the original journal paper, which goes into depth about the training architecture setup and various methods & datasets for evaluation. It turns outs, natural language inference (MNLI & SNLI) datasets, which are fairly large and indicate entailment, serve as helpful training sets for fine-tuning semantic simiarity tasks. However, training on the smaller, specific semantic textual similarity (STS-b) dataset, further improved accuracy.
Within the sentence-transformers
framework now, there’s an evergrowing number of pretrained model checkpoints ranked by size, speed, and other performance metrics. A model can be initialized by passing it a checkpoint that indicates a combination of both the architecture plus the specific trained weights.
!pip install sentence-transformers
from sentence_transformers import SentenceTransformer
# choose from list of pretrained models at sbert.net/docs/pretrained_models.html
checkpoint = "paraphrases-multi-qa-mpn"
model = SentenceTransformer(checkpoint)
Since our goal was to identify pairs of most similar questions or sentences, we tried some checkpoints below that performed best on semantic search leaderboards.
multi-qa-mpnet-base-dot-v1
trained on 315M StackExchange, Yahoo Answers, Google & Bing questions. Scored highest on semantic similarity benchmarks.distilbert-base-nli-stsb-quora-ranking
trained on 500K Quora duplicate questions.all-MiniLM-L6-v2
general purpose model 1B+ training pairs. Considered generically fast & good but not great quality.paraphrases-multi-qa-mpn
gave us best results on our dataset ascertained by expert human evaluation.The super-handy paraphrase_mining
utility returns a list of tuples sorted by descending similarity scores along with the indices of 2 sentences from the original list of input sentences. A score of 1.0 means the 2 sentences are semantically identical, while a score of 0.0 means they are semantically unrelated.
from sentence_transformers import util
# Single list of sentences - Possible +10,000 of sentences
sentences = df['text'].values.tolist()
paraphrases = util.paraphrase_mining(model, sentences)
for paraphrase in paraphrases[0:100]:
score, i, j = paraphrase
print(f'{sentences[i]}\n{sentences[j]}\nscore:{score:.2f}\n')
Here’s a sample output of the top scoring duplicates when we first ran paraphrase_mining through our list of questions. They mostly seem pretty reasonable. We still kept a human in the loop to decide which version to keep or whether questions should be reworded if they were indeed semantically dissimilar.
The complete functional code in this notebook. As you will see, the usage is straightforward for any list of sentences from your own dataset. Pick a model and try it out!