I somewhat recently ran into an issue where our system was incorrectly creating duplicate records. Here’s a writeup of how we found and fixed it.
Creating duplicate records
After reading through the request logs, I saw that we were receiving intermittent duplicate requests from a third-party vendor (applicant tracking system) for certain webhook events. We already had a check to see if records exist in the database before creating them, but this check didn’t seem to prevent the problem. After looking closely, the duplicate requests were coming in very short succession (< 100 milliseconds apart) and potentially processed by different processes, so the simple check would not reliably catch the duplicate requests.
In effect, we were seeing the following race condition:
t1: receive request 1: create new record (id: 123)
t2: receive request 2: create new record (id: 123)
t3: process request 1: does record 123 already exist? no, so create it
t4: process request 2: does record 123 already exist? no, so create it <-- race condition
t5: process request 1: create record 123
t6: process request 2: create record 123 <-- duplicate record created
We could not determine whether this webhook behavior was due to a customer misconfiguration or some bug in the applicant tracking system’s webhook notifications. But it was impacting our customers so we needed to find a way to handle it.
Fixing the problem
I decided that using a mutex would be a good way to handle this. This way we could reliably lock between processes.
I found Redlock, a distributed lock algorithm that uses Redis as a data store for mutex locks. Since we’re already using Redis and Ruby in our system, I decided to use the redlock-rb library.
The basic algorithm would be:
redlock_client.lock('unique_id_from_request', 5_000) do |lock|
if lock
# we could successfully acquire the lock, so
# process the request...
else
# we couldn't acquire the lock, so
# assume that this is a duplicate request and drop it
return
end
end
When we receive a request, we check to see if we’ve seen the same request recently by using a unique identifier from the request. If so, discard the current request. If not, we acquire a lock and process the request. Once the request is processed, we release the lock.
I made this change and deployed it, and it seemed to successfully reduce the number of duplicate requests!
We ended up seeing this issue for other applicant tracking systems, so implemented this in their webhook handlers as well.
Side quest
I will often look through the issues and pull requests of a new project before adopting it to see how active the project is and whether there are any known issues with it. As I read through the Redlock issues list, I found an issue where the lock would potentially not be acquired if there was a problem with the Redis connection.
Thinking about it, this would be a problem for us because it could lead to requests being dropped if our Redis connection had issues. We would think that another process already had the lock, but in fact, this was a different kind of issue.
This was a rare enough and recoverable instance that I thought continuing to use the mutex was worth the risk, but I wanted to see if I could fix the issue.
I responded to the issue with a specific case that illustrated the issue and asked if the maintainers would be open to a pull request to fix the issue. I got some positive feedback and then dove into the code and submitted a pull request to fix the issue.
The issue took a little while to merge, due to the review process and probably because it changed the behavior of the library. Instead of returning false
when a connection error occurred, we would raise the connection exception. It’s possible that someone would be relying on the previous behavior, but it seemed more correct to raise an error for an exceptional case than to have the same value as a lock not being able to be acquired. So the change was approved and merged and released in version 1.3.1 of the library. I then updated our code to use this new version (we were previously pointing to my fork of the changes since it seemed correct and to test it out more.)
Conclusion
Overall, I thought this was a good approach. I first made sure to understand the underlying cause of the problem, and then I found a solution that would work for us and fixed a small issue that could potentially cause data loss. The maintainers of the library were very accommodating and communicative throughout the process.