How I Fix Issues On Open Source Projects

Here's a post detailing how I typically think about fixing issues for open source projects.

Identify an issue

There are two main ways that I identify issues on code repos.

When I evaluate a new (to me) project, I need to figure out whether it will meet my needs. Often this will be functionality, but I also want to determine how well supported the project is. I will usually look at:

  • the README / wiki
  • when the repo was created and when it was last updated
  • a few of the most recent commits
  • the repo's issues / pull requests tab (including recently closed ones)
Read on →

Caching OpenAI Embeddings API Calls In-memory

I recently posted about a job posting search engine I prototyped that used OpenAI's Embeddings API.

As I tested this out in a Google Colab notebook, I guessed that the same text would always result in the same embedding. I compared embeddings of the same text and found that they were indeed identical. I also added a space or words to the text and saw that it resulted in a different embedding.

I started by saving the embeddings in the dataframe. This worked, but I would have to call the API again if I wanted the same embedding again (which happened a couple of times as my code was not robust enough the first couple of runs.) I also wanted a way to also have search queries that were previously requested return faster.

Since I was going to embed many job postings and might run the notebook multiple times, I wanted to cache the results to save a little money and increase the speed of future runs. This was helpful when I was iterating on the code and running over many postings, since some of the postings caused my code to error.

One solution to this is to store the embeddings in a database, perhaps a vector database. This would be more persistent, and would be a more production-friendly approach. For the time being, I decided to keep things simple and just cache the results in memory until I saw that the overall approach would work.

Read on →

Creating A Job Posting Search Engine Using OpenAI Embeddings

I recently worked on a job posting search engine and wanted to share how I approached it and some findings.

Motivation

I had a data set of job postings and wanted to provide a way to find jobs using natural language queries. So a user could say something like "job posting for remote Ruby on Rails engineer at a startup that values diversity" and the search engine would return relevant job postings.

This would enable the user to search for jobs without having to know what filters to use. For example, if you wanted to search for remote jobs, typically you would have to check the "remote" box. But if you could just say "remote" in your query, that would be much easier. Also, you could query for more abstract terms like "has good work/life balance" or some of the attributes that something like { key: values } would give.

Approach

We could potentially use something like Elasticsearch or create our own job search engine with rules, but I wanted to see how well embeddings would work. These models are typically trained on internet-scale data, so they might capture some nuances of job postings that would be difficult for us to model.

When you embed a string of text, you get a vector that represents the meaning of the text. You can then compare the embeddings of two strings to see how similar they are. So my approach was to first get embeddings for a set of job postings. This could be done once per posting. Then, when a user enters a query, I would embed the user's query and find the job posting vectors that were closest using cosine similarity.

Read on →

Using a Redlock Mutex to Avoid Duplicate Requests

I somewhat recently ran into an issue where our system was incorrectly creating duplicate records. Here's a writeup of how we found and fixed it.

Creating duplicate records

After reading through the request logs, I saw that we were receiving intermittent duplicate requests from a third-party vendor (applicant tracking system) for certain webhook events. We already had a check to see if records exist in the database before creating them, but this check didn't seem to prevent the problem. After looking closely, the duplicate requests were coming in very short succession (< 100 milliseconds apart) and potentially processed by different processes, so the simple check would not reliably catch the duplicate requests.

Read on →