I’ve been working on a website for stampy.ai, the community generated questions and answers about AGI Safety. At the moment, the database contains several hundred questions, but the goal is to eventually grow to be thousands or more.
Since we had access to OpenAI’s very powerful GPT-3, the original plan was to use its semantic search capabilities. Instead of only matching keywords, semantic search allows users to ask questions and search for already answered questions that are semantically similar or essentially mean the same thing. In order accomplish that, questions are converted in sentence embeddings or vectors which can then be compared using cosine similarity. If the vectors are already normalized, cosine similarity is calculated by a quick matrix multiplication. Due to the cost and the large files associated with embedding all of our questions, we realized we’d need to limit how often we made calls to OpenAI’s API and only submit requests once the user has completed typing their question.
Having used features like Google search, we’re familiar with its ability to dynamically update the list of related search options as the user types. Since we’re not Google, I hadn’t dreamt that we could do then same. However, in my free time, I’m quite a tutorial junkie. I ran across a tutorial to create your own Teachable Machine and a number of other demos with tensorflow.js (TFJS) allowing models to make AI predictions in the browser. I suddenly realized that if it’s possible to train a model to identify objects from the desktop video camera on a browser in realtime, we must be able to run our semantic search in the browser, dynamically updating results as the user types in the search bar.
To keep things simple, I started with a pre-trained model that was already optimized for web use. I found a promising model called the Universal Sentence Encoder (USE) which uses a 512-dimension vector to embed the sentences in contrast to GPT-3 which uses 12,288-dimensions. As a result the embeddings file for USE was only around 1.5MB compared to GPT-3’s embeddings which was over 150MB. That was for OpenAI’s largest Davinci model; going with a lighter model would lessen the gap but we’d still have expense issues. We obviously sacrificed some accuracy but the responsiveness really enhanced the user experience. To use the TFJS and USE libraries, we first imported the following scripts within the HTML:
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs" type="text/javascript"/>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/universal-sentence-encoder"/>
To begin, make sure both libraries and the model have been loaded. Then, you’re ready to create sentence embeddings for your questions. The USE stores embeddings as a 2D tensors with 512 dimensions or 512 separate values. Think of the embeddings as a numerical representation of a sentence or (in our case a question) where each dimension represents some feature or meaningful information about the sentence.
// load Tensorflow's universal sentence encoder model
const langModel = await use.load()
// embed list of all questions in database for later search
const questions = ["What is your first question?", "Ask another question?"]
const allEncodings = (await model.embed(questions)).arraySync()
We had a few hundred questions, so on my desktop, embeddings all the questions only took a few seconds. But on laptops and other machines, the initial embedding of all the questions took several minutes, which was intolerable. So I saved the embeddings so it was just a matter of loading them. Soon, things were working smoothly, even on mobile phones.
Now we’re ready to process the query when a user ask a new question. First, we must create embeddings for that new question in order to find the nearest embedding from within our existing list of all questions.
const runSemanticSearch = (searchQuery) => {
encoding = await langModel.embed(searchQuery)
// numerator of cosine similar is dot prod since vectors are normalized
const scores = tf.matMul(encoding, allEncodings, false, true).dataSync()
// Tensorflow requires explicit memory management to avoid memory leaks
encoding.dispose()
// sort by scores then return top 5 results
const questionsScored = questions.map((question, index) => ({ question, score }))
questionsScored.sort((a,b) => b.score - a.score)
const searchResults = questionsScored.slice(0, 5)
}
The USE also has a QnA model for question-answering. I tried the embedding all of the database answers to see if it would help with the search. Unfortunately, I didn’t see a marked improvement, so I’ll need to play around with that a bit more.
As we were getting ready to launch the website, there were concerns about GUI blocking issues. The semantic search was constantly being called with each character being typed. That impacted the GUI rendering causing jitteriness or lag, obviously a yucky user experience.
After a bit of research, I discovered this to be a known issue and a major concern within the tensorflow.js community. Although the Javascript calls are marked as asynchronous, it isn’t truly multi-threaded. Fortunately, the solution using Web Workers isn’t too complicated. Essentially, we put all of our computationally intensive Tensorflow code in a separate file which runs in another thread, independently of the main GUI rendering JavaScript thread. The main thread and the separate Tensorflow thread communicate by passing messages using the postMessage
method and responding to messages via the onmessage
handler.
In the main thread, we first create a Worker. When user makes a search request, we’ll call postMessage
to dispatch the Worker. Then, an event handler listens for onmmesage
responses from the Worker to get the search results.
// create a Web Worker and add event listener for search results
if (self.Worker && !tfWorkerRef.current) {
tfWorkerRef.current = new Worker('/tfWorker.js')
tfWorkerRef.current.addEventListener('message', handleWorker)
} else {
console.debug('Sorry! No Web Worker support.')
}
// postMessage to call semantic search in Worker thread passing user's search query
console.debug('postMessage to tfWorker:', searchQuery)
tfWorkerRef.current?.postMessage(searchQuery)
// listen for onmessage response from Worker thread with search results
const handleWorker = (event) => {
const {data} = event
console.debug('onmessage from tfWorker:', data)
if (data.searchResults) {
setSearchResults(data.searchResults)
}
}
The tfWorker.js
file must be stored in a public folder and written in plain JavaScript, not TypeScript. It will likewise need an onmessage
event handler to listen for event calls from the main thread, run semantic search then return the search results by calling postMessage
.
// listening for message from main thread to call semantic search
self.onmessage = (e) => {
runSemanticSearch(e.data)
}
const runSemanticSearch = (searchQuery) => {
...
// instead of returning searchResults, send a postMessage back to main thread
self.postMessage({searchResults})
}
That’s the general setup. In the meantime, feel free to play around with the prototype demo or take a peak at the code. There’s some extra code to load and save the embeddings. You can also see the deployed Stampy website and its full implementation using TypeScript, React, Remix, and Cloudflare. Although we ended up not using GPT-3 for the website’s semantic search, the Stampy chatbot still uses impressive and entertaining generative text capabilities.
I’ve been studying machine learning for a while now, but it’s been mostly theory and isolated exercises. This was refreshing to finally get back to some hands on work, applying the technology in a practical setting. I was lucky to work on this project with smart people who set up the fundamental framework and gave me freedom to explore on my own. Eventually, I’d love to train my own model and tweak some other parameters to see if we can get better results. So many possibilities for continued tinkering and learning!