Make django-cacheops work on a cluster

Redis in Parcel Perform

In Parcel Perform, we use Redis to cache API results, reducing the stress on other services or as an ORM caching layer to distribute the database’s workloads. The latter is the most essential to our system. The reason we choose to cache at ORM level is our database is usually busy. We have hundreds of transactions per second. These transactions do not only come from the front API but also workers. If we did not cache at ORM level, we would not be able to cover such huge traffic from workers and database queries aside from APIs. To implement the ORM caching, we use a library called django-cacheops for our Django backend. It’s a very decent library. It supports caching on selected tables, and auto-update/expire cache upon update/deletion. Unfortunately, when we were scaling up our system, django-cacheops slowly showed its weakness.

Problems of django-cacheops with scaling up

Parcel Perform has been growing very fast lately leading to the growth of the data size. With our traffic, Redis was managing to process more and more caching requests from the Backend. As a result, the CPU of our Redis kept reaching its maximum and slowed down our caching. But scaling up wasn’t straightforward. Redis is designed to work on a single CPU core. Even though having a computer with 10 CPU cores, only 1 will be used by Redis. To solve this problem, we could either get a new server with a strong CPU burst or switch to Redis cluster mode to distribute the CPU load to other machines. The option of scaling a single Redis node was extremely costly with minimal performance gain, so we favored the option of using a Redis cluster.

The problems started to appear when we decided to move a standalone Redis to a Redis Cluster. Django-cacheops didn’t support Redis Cluster. Thus, if we try to perform a Redis command which involves cluster redirection (i.e: SET key), a MovedError exception will be raised because the library doesn’t support it. We’ve submitted this issue to the library maintainer but he didn’t want to implement the cluster for his library due to many problems. However, with the support of Redis Hashtag syntax, we’ve managed to patch the library and make it work with the cluster mode. In short, if a key contains a hashtag, i.e: {ParcelPerform}:123, only the text inside the hashtag (ParcelPerform) will be used to calculate the hash slot. Hence, we can know which node from within the cluster to set this key without any redirection.

At first glance, our patch worked pretty well. However, during our monitoring, we acknowledged that some Redis clusters had a higher CPU usage than the others. Digging in, we found that the problem comes from the Redis’ hashtag. In our patch, we used the table name as a hash string, i.e: {user_tbl}:202cb962ac59075b964b07152d234b70. Some tables on our DB were hit hard, creating a hotspot in our database. That pattern was reflected in the Redis cluster where some nodes were hit more often than others. We didn’t want this, we wanted every key to be distributed evenly across all the Redis clusters.

A solution for distribution caching

In order to distribute the Redis keys evenly, we can either randomize the value in the Hashtag or remove the hashtag itself and let Redis decide. The idea indeed is very simple. However, the django-cacheops library doesn’t let us modify at this level so no patching would work. We tried to search for another replacement library but that’s the only one that fits our project’s requirements. With no option left, we decided to fork the django-cacheops repo and built our custom library.

Django-cacheops overview

Before diving deep into the Redis cluster implementation for django-cacheops, let’s first talk about django-cacheops structure to understand the library.

redis.py - this file contains the main Redis client that will be used to cache the query result. This is where we aim first to update this package.
invalidation.py - the logic is used to invalidate the cached data. We need to tackle this file to make everything works as intended.
query.py - the core of the package. This is where the package creates a bunch of hooks to attach to the query actions from Django Model.
lua folder - the folder contains the scripts used by invalidation.py and query.py to invalidate and cache data. These Lua scripts ensure that caching/invalidating data will be atomic. Hence, no race condition!
conf.py - this file hosts our custom cacheops configuration. The cluster mode can be enabled/disabled here.

All of those files/folders above are our targets to add the support for Redis cluster. There are more files but those files are not that necessary for this implementation so we skipped them. Feel free to discover them by yourself.

Updating django-cacheops to add cluster support

To add support for the library, first, we did a very simple solution. Replacing the Redis package with redis-cluster one. However, things aren’t that easy. The core of the django-cacheops which are the invalidation and cache Lua scripts didn’t work as intended due to its nature. Lua script only works with keys on the same server currently running that script. Adding redis-cluster means we will work with multiple servers. Hence breaking all the existing Lua scripts.

Luckily, the solution was relatively simple. All we needed to do was to re-implement the Lua script logic in plain Redis operators. This meant we would accept losing the atomicity of the Lua scripts. This was not the only drawback, more on that later. For example, the Lua script has this snippet:

redis.call('sadd', prefix .. 'schemes:' .. db_table, conj_schema(conj))
-- Add new cache_key to list of dependencies
local conj_key = conj_cache_key(db_table, conj)
redis.call('sadd', conj_key, key)

We would convert them into

redis.sadd(prefix + 'schemes:' + db_table, conj_schema(conj))
### Add new cache_key to list of dependencies
local conj_key = conj_cache_key(db_table, conj)
redis.sadd(conj_key, key)

Straightforward!

Things didn’t stop there. Remember the atomicity of the Lua script? We tried to achieve that by using Redis Distribution Locking. But, we didn’t get the expected result. The performance turned out to be 10x worse. The main reason behind this bad performance was we locked the whole library’s Redis due to the network latency. Beforehand, when the Lua script was executed, it would be executed as a single Redis command. Therefore, it was very fast since there is no request-response for each command in the Lua. Due to our lock, we only ensure that no other command can be executed while another command is running. It didn’t make the batch of these commands execute as a single execution. Hence, for each command, the library has to wait for the Redis server to respond to continue to the next command, creating a big performance block for the caching process. At the end of the day, we accept to forego atomicity in our library to boost the performance.

At this point, I could say that the hardest problems were solved. All we needed to do was to separate the patch into another file so that the original files stayed intact for later updates with the main repo. We added a new mode for Redis Cluster since not every project was going to use cluster mode. We also created a proxy for Cluster and Normal mode to switch between these modes easily. The idea was simple, we created 2 separate classes. I.e: Invalidation and InvalidtionCluster. Then we created another class called InvalidationProxy that also shared the same interface with Invalidation and InvalidationCluster. The InvalidationProxy held Invalidation and InvalidationCluster. Then, we changed the code from using the Invalidation class to using the InvalidationProxy class. Now, every time we invoke a function from a proxy, the proxy class will determine which function to call based on our configuration. Let’s check the code for a closer look.

class InvalidationProxy:
  def __init__(self):
    if settings.CLUSTER_ENABLED:
      self.invalidation_obj = InvalidationCluster()
    else:
      self.invalidation_obj = Invalidation()
  
  def invalidate(self, key):
    self.invalidation_obj.invalidate(key)

Finally, we added some more extensions and enhancements like key prefix, key TTL configuration, etc., fixed the unit tests and we were set to go!

Drawbacks

Removing the Lua script made our django-cacheops non-atomic which potentially creates lots of race condition situations. However, during the monitoring, we found that these race conditions do happen but don’t impact our cache results that much. Let’s tackle these race conditions:

Delete and then Insert/Update. The result of the race condition is that the deleted key will be created. It then pollutes the keyspace with these orphan keys. Since we are setting the TTL for each Redis key very small, usually 60 - 120s. These orphan keys will be automatically wiped out after 1-2 minutes.
Update/Insert and then Delete. We will lose the cached data. However, we don’t usually delete our database records, especially around those big enough that Cacheops is needed. Hence, this situation will be relatively rare to happen. If it does, all we need is to wait for another query and cache the result. Our queries are heavy, but not so heavy that they kill our database just because a few caches were missed.
Dirty writes/reads. This scenario happens when 2 updates or inserts happen at the same time, leading to one of the operators being replaced by the other. Right now we don’t have a solution for this. We accept this is our drawback and reduce the TTL for the keys. So that if the dirty write happens, it will stay for a short period.

Did we try to fix the race condition? Yes, we tried to implement the Redlock and tons of other locks to improve our distributed locking performance. The result wasn’t critically improved. Therefore, we concluded that it was not our algorithm but rather the lock mechanism itself that hurt the performance.

Conclusion

Our solution is serving its purpose well. Although there are still many drawbacks and room for improvement. We tried to create a PR to contribute to our work but sadly, it was rejected because our tweak has a different philosophy from the owner’s. We were willing to have dirty data as a trade-off for better performance and the owner preferred the opposite. However, shortly after that, the maintainer announced that he would add the support for Redis cluster. The package would soon replace our work but we did learn lots of things throughout the implementation. Hope this article will help you somehow and someway.

Make django-cacheops work on a cluster