Caching OpenAI Embeddings API Calls In-memory

I recently posted about a job posting search engine I prototyped that used OpenAI’s Embeddings API.

As I tested this out in a Google Colab notebook, I guessed that the same text would always result in the same embedding. I compared embeddings of the same text and found that they were indeed identical. I also added a space or words to the text and saw that it resulted in a different embedding.

I started by saving the embeddings in the dataframe. This worked, but I would have to call the API again if I wanted the same embedding again (which happened a couple of times as my code was not robust enough the first couple of runs.) I also wanted a way to also have search queries that were previously requested return faster.

Since I was going to embed many job postings and might run the notebook multiple times, I wanted to cache the results to save a little money and increase the speed of future runs. This was helpful when I was iterating on the code and running over many postings, since some of the postings caused my code to error.

One solution to this is to store the embeddings in a database, perhaps a vector database. This would be more persistent, and would be a more production-friendly approach. For the time being, I decided to keep things simple and just cache the results in memory until I saw that the overall approach would work.

After some research, I found that there are some decorators that can be used to cache the results of a function. In Python 3.9+, the functools module has a @cache decorator. However, I was using Python 3.8. The docs note that this is equivalent to using the lru_cache decorator with maxsize=None, so I tried that instead and it seemed to work.

# Python < 3.9 version
@lru_cache(maxsize=None)
def cached_get_embedding(string: str, engine: str):
  # print first 50 characters of string
  print(f'Hitting OpenAI embedding endpoint with "{string[0:50]}..."')
  return get_embedding(string, engine=engine)
# Python >= 3.9 version
@cache
def cached_get_embedding(string: str, engine: str):
  # print first 50 characters of string
  print(f'Hitting OpenAI embedding endpoint with "{string[0:50]}..."')
  return get_embedding(string, engine=engine)

Then you can replace any get_embedding calls with cached_get_embedding calls. The first time you call the function, it will print and hit the API. Subsequent calls will return the cached result and not print anything.

Another way of doing this would be to use OpenAI’s get_embedding inside your own function called get_embedding that uses the cache decorators or looks up the result in a database. Then you don’t need to change any other code in your project and get the benefits of caching. (Has a slightly higher chance of being surprising/confusing though.)

Since the embeddings seemed whitespace-sensitive, you may also want to remove leading/trailing/inner whitespace before calling the API if that whitespace would not be meaningful for your case to reduce cache misses.

Overall this worked well for my use case. Wanted to share it since it seemed like an elegant or Pythonic way of caching API calls.

Creating A Job Posting Search Engine Using OpenAI Embeddings

I recently worked on a job posting search engine and wanted to share how I approached it and some findings.

Motivation

I had a data set of job postings and wanted to provide a way to find jobs using natural language queries. So a user could say something like “job posting for remote Ruby on Rails engineer at a startup that values diversity” and the search engine would return relevant job postings.

This would enable the user to search for jobs without having to know what filters to use. For example, if you wanted to search for remote jobs, typically you would have to check the “remote” box. But if you could just say “remote” in your query, that would be much easier. Also, you could query for more abstract terms like “has good work/life balance” or some of the attributes that something like { key: values } would give.

Approach

We could potentially use something like Elasticsearch or create our own job search engine with rules, but I wanted to see how well embeddings would work. These models are typically trained on internet-scale data, so they might capture some nuances of job postings that would be difficult for us to model.

When you embed a string of text, you get a vector that represents the meaning of the text. You can then compare the embeddings of two strings to see how similar they are. So my approach was to first get embeddings for a set of job postings. This could be done once per posting. Then, when a user enters a query, I would embed the user’s query and find the job posting vectors that were closest using cosine similarity.

One nice thing about ordering by similarity is that the most relevant job posting should be first, and then other similar job postings would be next. This matches how other search engines work.

OpenAI recently came out with the text-embedding-ada-002 embedding engine, which is significantly cheaper and higher performing than previous versions. Notably, the token length was also increased to 8191 tokens, which meant we can embed whole job postings. So I decided to use this for creating the embeddings.

The job postings data set that I have had some additional data, like company name. So I wanted to embed that so we can use that information when comparing to the user’s query:

# truncate to 8000 characters since more is not likely to yield signal and makes it less likely we'll run into token length issues
# could also do this by using tiktoken and truncating to 8191 tokens for that engine
df['for_embedding'] = df \
  .apply(lambda x: f"Job posting\nCompany: {x['company_name']}\nTitle: {x['title']}\nBody: {x['body'].strip()}"[:8000],
         axis=1)
df['embedding'] = df['for_embedding'].apply(lambda x: cached_get_embedding(x, engine='text-embedding-ada-002'))

Results

For my example query at the beginning of the post (“job posting for remote Ruby on Rails engineer at a startup that values diversity”), the search engine returned the following job posting body as the top result (emphasis mine):

… We are a fast-paced, user-first, technology company that’s passionate about building responsibly. We believe the future of work is a regenerative corporate environment where giving and receiving is in balance. When we build we don’t just think about maximizing profit, we believe you can be wildly profitable while also being socially and environmentally conscious. Our fully-remote team is comprised of 13 awesome people (and quickly growing!) in New York, Texas, and North Carolina. We are committed to developing diverse teams. Our current team is 35% POC and 60% women, and we continuously strive to add more diversity on our team.

Job Requirements and Responsibilities:

  • Strong front end experience and familiarity working in a Rails system
  • Design, build and test end-to-end features using Rails

Candidate Qualifications:

  • Familiarity with our stack: Rails and Angular sitting on top of Heroku using Postgres, Elasticsearch, Redis, and a variety of AWS services.
  • You have startup experience and you enjoy working in small teams

What You Get:

  • Fully remote role, so you can work from home
  • Stock Options

Pretty great fit! (Here’s a link to it, in case you’re interested!)

Some other interesting queries I ran:

“job posting for software engineer at consultancy in Washington State”

The first result was a job posting for consultant in Bellevue, which is in Washington State. The posting didn’t mention Washington State specifically anywhere. This is a good example of something that would be hard to do with traditional document search, but works well with embeddings trained on internet data. There must be some signal in the embeddings that captures the fact that Bellevue is located in Washington State.

“job posting for software engineer at <company name>”

The top results for this were indeed job postings for that company. This reinforces the decision to embed some metadata about the job posting.

“remote machine learning and product engineer”

One useful result had “You’d work on product-oriented research for generative natural language detection, and tackle cutting-edge deep learning and NLP problems with an emphasis on classification and adversarial methods.” Seems interesting!

Queries around eligibility (visa, citizenship, etc.)

Seemed to work OK. It was sometimes hard to tell if it was filtering these or if it just mentioned this. Also was hard to tell sometimes what country the citizenship was referring to.

Asking for specific salary ranges

This didn’t seem to consistently work well. Many postings didn’t list salary information. Also, it would sometimes get confused by other compensation numbers or revenue numbers (“$10M ARR”).

Overall

Overall, this was a fun project and I was impressed with the results. It only cost me a few dollars to create the embeddings, and the search engine was pretty fast. Also, it only took a couple of hours thanks to using an off-the-shelf embedding engine.

Resources

I found the following resources helpful for implementing this approach:

Using TamperMonkey to Clean Up Websites

I’ve written a few Tampermonkey userscripts to improve websites that I regularly use, and I wanted to share some patterns that I have found useful.

Generally I iterate on the scripts in the Tampermonkey Chrome Extension editor, and then push to GitHub for versioning and backup.

Example

A good example is the script to clean up my preferred weather site (Weather Underground). The script removes ads, as well as removing a sidebar that takes up room but doesn’t add much value.

Before:

Before the Tampermonkey userscript

After:

After the Tampermonkey userscript

Setup

Most of the time in these scripts, I’m finding DOM elements to hide or manipulate them. I could use element selectors, but I typically import jQuery to make this easier and more powerful:

// @require      https://code.jquery.com/jquery-3.6.0.min.js

Apparently $ as an alias for jQuery doesn’t automatically work in Tampermonkey, so add it:

const $ = window.$;

The Tampermonkey template for scripts uses an IIFE (Immediately Invoked Function Expression) to avoid polluting the global namespace. I like to add a use strict directive to avoid some simple JavaScript mistakes and log out that the script is running to make debugging a little easier.

(function() {
  console.log('in wunderground script');
  'use strict';
  ...

Hiding page elements

Almost every script I make has a hideStuff() function. As the name implies, it hides elements on the page. Usually this is going to be for elements that I don’t want or need, or for ads that aren’t blocked by my ad blocker.

function hideStuff() {
  // use whole screen for hourly forecast table
  $('.has-sidebar').removeClass('has-sidebar');
  $('.region-sidebar').hide();

  // hide ads
  $('ad-wx-ws').hide();
  $('ad-wx-mid-300-var').hide();
  $('ad-wx-mid-leader').hide();

  // bottom ad content
  $('lib-video-promo').hide();
  $('lib-cat-six-latest-article').hide();
}

I usually call it in a setInterval. This helps handle cases where the page takes a while to load, or in case elements are loaded asynchronously. This could also work well for single-page apps where the page doesn’t reload.

setInterval(hideStuff, 250);

Sometimes if the page loads quickly I’ll put a couple of setTimeouts with small timeouts at the beginning and then a longer setInterval. It doesn’t really cost much either way, so I usually play around with the timing until it works well.

Keyboard shortcuts

I enjoy using keyboard shortcuts to zip around, but many sites don’t have them. In some more advanced scripts, I’ll add key handlers for custom keyboard shortcuts.

For example, here I’ve added shortcuts for the next and previous day, and to switch between the hourly and 10-day forecasts:

$("body").keypress(function(e) {
  if (e.key === '>' || e.key === '.') {
    $('button[aria-label="Next Day"]').click();
  } else if (e.key === '<' || e.key === ',') {
    $('button[aria-label="Previous Day"]').click();
  } else if (e.key === 'd') {
    $('a span:contains("10-Day")').click();
  } else if (e.key === 'h') {
    $('a span:contains("Hourly")').click();
  }
});

This could break if the page structure changes, but most pages don’t change that often. If they do, I’ll just update the script. Overall I feel like this is pretty easy to read.

My Shortcut.com script has a more involved example of this for adding labels and creating stories, including overriding some existing keybindings. For Feedbin, I implemented a way to scroll stories down half a page (only when the keyboard focus is in the “story” pane).

Conclusion

Overall I think this approach works well to make some of my favorite sites more usable.

It would be great to be able to automatically sync Tampermonkey and the Github repo. Has anyone seen an approach that works well for this?

Using a Redlock Mutex to Avoid Duplicate Requests

I somewhat recently ran into an issue where our system was incorrectly creating duplicate records. Here’s a writeup of how we found and fixed it.

Creating duplicate records

After reading through the request logs, I saw that we were receiving intermittent duplicate requests from a third-party vendor (applicant tracking system) for certain webhook events. We already had a check to see if records exist in the database before creating them, but this check didn’t seem to prevent the problem. After looking closely, the duplicate requests were coming in very short succession (< 100 milliseconds apart) and potentially processed by different processes, so the simple check would not reliably catch the duplicate requests.

In effect, we were seeing the following race condition:

t1: receive request 1: create new record (id: 123)
t2: receive request 2: create new record (id: 123)
t3: process request 1: does record 123 already exist? no, so create it
t4: process request 2: does record 123 already exist? no, so create it  <-- race condition
t5: process request 1: create record 123
t6: process request 2: create record 123  <-- duplicate record created

We could not determine whether this webhook behavior was due to a customer misconfiguration or some bug in the applicant tracking system’s webhook notifications. But it was impacting our customers so we needed to find a way to handle it.

Fixing the problem

I decided that using a mutex would be a good way to handle this. This way we could reliably lock between processes.

I found Redlock, a distributed lock algorithm that uses Redis as a data store for mutex locks. Since we’re already using Redis and Ruby in our system, I decided to use the redlock-rb library.

The basic algorithm would be:

redlock_client.lock('unique_id_from_request', 5_000) do |lock|
  if lock
    # we could successfully acquire the lock, so
    # process the request...
  else
    # we couldn't acquire the lock, so
    # assume that this is a duplicate request and drop it
    return
  end
end

When we receive a request, we check to see if we’ve seen the same request recently by using a unique identifier from the request. If so, discard the current request. If not, we acquire a lock and process the request. Once the request is processed, we release the lock.

I made this change and deployed it, and it seemed to successfully reduce the number of duplicate requests!

We ended up seeing this issue for other applicant tracking systems, so implemented this in their webhook handlers as well.

Side quest

I will often look through the issues and pull requests of a new project before adopting it to see how active the project is and whether there are any known issues with it. As I read through the Redlock issues list, I found an issue where the lock would potentially not be acquired if there was a problem with the Redis connection.

Thinking about it, this would be a problem for us because it could lead to requests being dropped if our Redis connection had issues. We would think that another process already had the lock, but in fact, this was a different kind of issue.

This was a rare enough and recoverable instance that I thought continuing to use the mutex was worth the risk, but I wanted to see if I could fix the issue.

I responded to the issue with a specific case that illustrated the issue and asked if the maintainers would be open to a pull request to fix the issue. I got some positive feedback and then dove into the code and submitted a pull request to fix the issue.

The issue took a little while to merge, due to the review process and probably because it changed the behavior of the library. Instead of returning false when a connection error occurred, we would raise the connection exception. It’s possible that someone would be relying on the previous behavior, but it seemed more correct to raise an error for an exceptional case than to have the same value as a lock not being able to be acquired. So the change was approved and merged and released in version 1.3.1 of the library. I then updated our code to use this new version (we were previously pointing to my fork of the changes since it seemed correct and to test it out more.)

Conclusion

Overall, I thought this was a good approach. I first made sure to understand the underlying cause of the problem, and then I found a solution that would work for us and fixed a small issue that could potentially cause data loss. The maintainers of the library were very accommodating and communicative throughout the process.

Using iTerm Automatic Profile Switching to Make Fewer Mistakes In Production

Today I will tell you some stories of how I made mistakes in our production environment, and how I am trying to help prevent future mistakes using iTerm.

Mistakes were made

At work we are mid-journey to having more automation around our deployments, provisioning, backups, monitoring, and so forth. But at the moment, we have some things that are typically done manually. Within recent memory, I was SSHed into our QA (staging) box and for some reason wanted to rename the database. A few minutes later, someone came down and said “production’s down!” 1 (Production is the end-user visible environment, the one thing that we don’t want to be down.) I was thinking, “hmm, we haven’t changed anything recentl… wait, was I actually on the QA box?” Sure enough, what I renamed was the production database on the production environment! A minute later service was restored, but this was the most downtime this quarter during the day (a handful of minutes.)

As part of our postmortem on this issue, we identified that switching my terminal profile whenever I thought I would be in a production-like environment would be useful. For example, if I am going to be SSHing into a QA box, I might create a new profile that has a different background color. This would help disambiguate the two environments.

The other day after hours, I was switching back and forth between QA and production SSH environments to try to debug a problem on the QA side. I again thought that I had SSHed into the QA environment but I didn’t read my SSH command well enough when cycling between those environments (using Ctrl+r in the terminal will give you previous commands2). I turned off the production load balancer. Fortunately it was after hours, so I could easily revert it, but I needed a better solution.

Enough is enough

There are two problems with the profile switching approach: I need to remember to switch profiles when I am SSHing, and I need to be SSHing into the right environment for the given profile. These are error-prone enough that I don’t think the manual profile switching approach is workable long-term. Again, in a perfect world, we would have everything already automated and some way of making all of our changes through well-tested or peer-reviewed means. But there has to be a stopgap solution.

I had read a bit about automatic profile switching in iTerm after the database rename debacle. This iTerm feature provides the ability to know when we have changed servers and change the profile accordingly. At first, it seems to require shell integration, which means that you curl a script to each of your boxes to be able to use it. This seemed both potentially insecure and cumbersome as we add more servers to our environment, so I didn’t want to use it.

Triggers and automatic profile switching

Digging a bit deeper, it seems that you can also use triggers and automatic profile switching to mostly accomplish the same thing. There are two components we can work with to make this happen.

The first is a trigger. Triggers look at your terminal output and run actions when the output matches a given regular expression. There are a variety of interesting actions you can take based on a trigger, but we’ll use them to set the internal iTerm variables for username and hostname. Basically iTerm keeps track of these somewhere and you can use it to switch your profile automatically when it changes.

When the iTerm hostname or username changes you can use automatic profile switching for each profile to say when that profile should be used. If we change to a production host, then we should activate the production iTerm profile. Of course, when we exit out of that, we’d like to return to the default profile.

An example setup

Here’s a high level view of what we want to do. When we recognize something that means we are on:

  • QA box, we switch to the QA profile (dark blue background)
  • production box, we switch to the production profile (dark red background)
  • localhost, we switch the default profile (black background)

I set up the following profiles, with rationale:

Default

Triggers
  1. Set the iTerm username and host for either QA or production when we see it in an SSH prompt. The regex would match username@host-name directory_name $. If that were the prompt, this trigger would set username to username and host to host-name (the \1 pulls back the first match group of the regex.) Typically you’d have qa-web or prod-web or something like that as your hostname. You would want to match those for the next two parameters since you need the QA and production profiles to be based on the hostname (see below.)

    • Regular Expression: ^(\w+)@([\w\-]+):.*\$
    • Action: “Report User & Host”
    • Parameters: \1@\2
    • Instant: yes (explanation in its own section below)
  2. Set the iTerm host to a QA-host when we recognize that it is a QA Rails prompt:

    • Regular Expression: ^Loading qa environment \(Rails [\d\.]+\)$
    • Action: “Report User & Host”
    • Parameters: @some-qa-host
    • Instant: not needed
  3. Set the iTerm host to a production host when we recognize that it is a production Rails prompt. Similar to the previous trigger, but substitute production for instances of QA.

Automatic Profile Switching

Automatically switch to this profile when the hostname changes to our local host (hydration, in the case of my computer.3)

QA / Production

These are basically identical to each other, except for the automatic profile switching hostname. I copied these from the default profile and then changed the background color and name. The specific colors you use are not important as long as you can clearly differentiate the colors between environments and the production color strikes some sort of fear into you when you see it.

Trigger

When you see my special local prompt character (♦), set the iTerm host to the local machine name (hydration), since we want to switch back to the default profile at that point.

  • Regular Expression:
  • Action: “Report User & Host”
  • Parameters: @hydration
  • Instant: yes (explanation in its own section below)

Note: Having some sort of special local prompt is important to being able to use this approach. My guess is that you have customized your local prompt in some way so that you can either see the hostname in it or have some characters or patterns that are not typically encountered.

Automatic Profile Switching

Automatically switch to this profile when the iTerm hostname changes to the environment that we want. We would use qa-web for the QA profile, or prod-web for the production profile.

Testing

I usually work slowly and try to get one environment working first, and then try to get switching back to my default environment after that. You’ll know when you have things hooked up correctly when the colors change.

At first I was testing by actually SSHing into the boxes, but this was a bit slower than needed. Since iTerm does this matching based on looking at your terminal output, you can just echo a test string and you should be able to see the profile change (or flash for a little bit if you have switching back to the default profile configured.)

Instant or not?

“Instant” in the trigger definition refers to whether iTerm will wait for a newline before checking the output or not. Generally if something is in an interactive prompt, you probably want instant. If you don’t have instant enabled, then your profile won’t change until the second time the prompt is loaded because a newline won’t be provided until you press return/enter to finish inputting your command. I’d imagine that using instant is slightly slower since it constantly looks at the output, so I’d recommend not using it unless you are in an interactive prompt situation.

Wrapping up

I think that the iTerm documentation is not yet perfect for this feature, so setting this up for my environment took a little time. But now that it’s written up, hopefully you can see how a setup like this works and can customize it for your environment with less effort. It’s not a perfect solution, but it has already been helpful. Also, it’s just cool to see your background color change when you run a command. I’d say the fifteen minute investment is worth the effort to not do something silly in a live server.


  1. See earlier note about having insufficient monitoring. If someone physically tells you your service is down or broken before you know about it, you don’t have enough monitoring in place! 

  2. Searching through previous history is especially awesome with fzf. I highly recommend it. 

  3. It subtly reminds me to drink more water.