Optimize application memory usage on Amazon ElastiCache for Redis and Amazon MemoryDB for Redis

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"[Amazon MemoryDB for Redis](https://aws.amazon.com/memorydb/) and [Amazon ElastiCache for Redis](https://aws.amazon.com/elasticache/redis/) are in-memory data stores. While ElastiCache is commonly used as a cache, MemoryDB is a durable database designed for applications with high performance requirements.\n\nCustomers [love Redis](https://insights.stackoverflow.com/survey/2021#most-loved-dreaded-and-wanted-database-love-dread) as an in-memory data engine. As data used and accessed grows exponentially, making the most of the memory available becomes increasingly important. In this post, I provide multiple strategies with code snippets to help you reduce your application’s memory consumption when using MemoryDB and ElastiCache for Redis. This helps to optimize costs and allows you to fit more data within your instances in your existing cluster.\n\nBefore going into these optimizations, remember ElastiCache for Redis supports [data tiering](https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/data-tiering.html), which automatically places data across memory and local, high-performance solid state drives (SSD). Data tiering is ideal for applications that access up to 20% of their datasets regularly. ElastiCache for Redis provides a convenient way to scale your clusters at a lower cost to up to a petabyte of data. It can enable [over 60% savings per GB of capacity](https://aws.amazon.com/blogs/database/scale-your-amazon-elasticache-for-redis-clusters-at-a-lower-cost-with-data-tiering/) while having minimal performance impact for workloads that access a subset of their data regularly. ElastiCache for Redis also supports [auto scaling to automatically adjust your cluster horizontally](https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/AutoScaling.html) by adding or removing shards or replica nodes.\n\n#### **Prerequisites**\n\nFor this walkthrough, you need the following prerequisites:\n\n1. An [AWS account](https://portal.aws.amazon.com/billing/signup) (you can use the [AWS Free Tier](https://aws.amazon.com/memorydb/))\n2. An ElastiCache for Redis or MemoryDB cluster (a single instance is enough)\n3. Your local machine or a remote environment such as [AWS Cloud9](https://aws.amazon.com/cloud9/) with connectivity to your cluster\n4. The redis-cli client to [connect remotely](https://docs.aws.amazon.com/memorydb/latest/devguide/getting-startedclusters.connecttonode.html) to your instance\n5. Python 3.5 or newer with the following libraries\n\nTo run some of the examples in this post, you need the following Python libraries:\n\n```\npip install redis-py-cluster # to connect to your elasticache or memorydb cluster\npip install faker # to simulate different types of data\npip install msgpack # to serialize complex data in binary format\npip install lz4 pyzstd # to compress long and short data types\n```\n \nTo know how much memory is used, [redis-cli](https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/GettingStarted.ConnectToCacheNode.html) has the ```memory usage``` command:\n\n```\nredis-cli -h <your_instance_endpoint> --tls -p 6379\n>>memory usage \"my_key\"\n(integer) 153\n```\n\nTo connect to our Redis cluster from Python, we use [redis-py-cluster](https://redis-py-cluster.readthedocs.io/en/stable/). In our examples, we ignore the creation of the ```RedisCluster``` object for simplicity. To check that you have connectivity to your Redis instance, you can use the following code:\n\n```\n>>from rediscluster import RedisCluster\n>>HOST=\"<Your host URL>\"\n>>redis_db = RedisCluster(host=HOST, port=6379, ssl=True)\n```\n\nIf you’re performing multiple operations, consider using pipelines. [Pipelines](https://redis-py-cluster.readthedocs.io/en/stable/pipelines.html) allow you to batch multiple operations and save multiple network trips for higher performance. See the following code:\n\n```\n>>pipe = redis_db.pipeline()\n>>pipe.set(key_1, value_1)\n...\n>>pipe.set(key_n, value_n)\n>>pipe.execute()\n```\n\nTo check the size of an item before inserting them in Redis, we use the following code:\n\n```\n>>import sys\n>>x = 2 # x can be any python object\n>>sys.getsizeof(x) # returns the size in bytes of the object x\n24\n```\n\nTo simulate more realistic data, we use the library Faker:\n\n```\n>>from faker import Faker\n>>fake = Faker()\n>>fake.name()\n'Lucy Cechtelar'\n>>fake.address()\n'426 Jordy Lodge'\n```\n\n#### **Basic optimizations**\n\nBefore going to advanced optimizations, we apply the basic optimizations. These are easy manipulations so we don’t provide the code in this post.\n\nIn our example, we assume we have a big list of key-value pairs. As the keys, we use the IPs of hypothetical visitors to our website. As the values, we have a counter of how many visits, their reported name, and their recent actions:\n\n```\nIP:123.82.92.12 → {\"visits\":\"1\", \"name\":\"John Doe\", \"recent_actions\": \"visit,checkout,purchase\"},\nIP:3.30.7.124 → {\"visits\":\"12\", \"name\":\"James Smith\", \"recent_actions\": \"purchase,refund\"},\n...\nIP:121.66.3.5 → {\"visits\":\"5\", \"name\":\"Peter Parker\", \"recent_actions\": \"visit,visit\"}\n```\n\nUse the following code to insert these programmatically:\n\n```\nredis_db.hset(\"IP:123.82.92.12\", {\"visits\":\"1\", \"name\":\"John Doe\", \"recent_actions\": \"visit,checkout,purchase\"})\nredis_db.hset(\"IP:3.30.7.124\", {\"visits\":\"12\", \"name\":\"James Smith\", \"recent_actions\": \"purchase,refund\"})\n...\nredis_db.hset(\"IP:121.66.3.5\", {\"visits\":\"5\", \"name\":\"Peter Parker\", \"recent_actions\": \"visit,visit\"})\n```\n\n#### **Reduce field names**\n\nRedis field names consume memory each time they’re used, so you can save space by keeping names as short as possible. In our previous example, instead of ```visits``` as the field name, we may want to use ```v```. Similarly, we can use ```n``` instead of ```name```, and ```r``` instead of ```recent_actions```. We can also shorten the key name to ```i``` instead of ```IP```.\n\nWithin the fields themselves, you can also reduce common words with symbols. We can switch recent actions to the initial character (```v``` instead of ```visit```).\n\nThe following code is how our previous example looks after this simplification:\n\n```\ni:123.82.92.12 → {\"v\":\"1\", \"n\":\"John Doe\", \"r\": \"vcp\"},\ni:3.30.7.124 → {\"v\":\"12\", \"n\":\"James Smith\", \"r\": \"pr\"},\n...\ni:121.66.3.5 → {\"v\":\"5\", \"n\":\"Peter Parker\", \"r\": \"vv\"}\nThis results in 23% memory savings in our specific example.\n```\n\n#### **Use position to indicate the data type**\n\nIf all fields exist, we can use a list instead of a hash, and the position tells us what the field name is. This allows us to remove the field names altogether. See the following code:\n\n```\ni:123.82.92.12 → [1, \"John Doe\", \"vcp\"],\ni:3.30.7.124 → [12, \"James Smith\", \"pr\"],\n...\ni:121.66.3.5 → [5, \"Peter Parker\", \"vv\"]\n```\n\nThis results in an additional 14% memory savings in our case.\n\n#### **Serialize complex types**\n\nThere are different ways to serialize complex objects, which allows you to store these in an efficient manner. Most languages have their own serialization libraries (pickle in Python, Serializable in Java, and so on). Some libraries work across languages and are often more space efficient, such as ProtoBuff or MsgPack.\n\nThe following code shows an example using [MsgPack](https://msgpack.org/):\n\n```\nimport msgpack\n\ndef compress(data: object) -> bytes:\n return msgpack.packb(data, use_bin_type=True)\n \ndef write(key, value):\n key_bytes = b'i:'+compress(key) # we can serialize the key too\n value_bytes = compress(value)\n redis_db.set(key_bytes, value_bytes)\n\nwrite([121,66,3,5] , [134,\"John Doe\",\"vcp\"])\n```\n\nIn this case, the original object is 73 bytes, whereas the serialized object is 49 bytes (33% space reduction).\n\nTo recover the value, MsgPack makes it very convenient, returning the Python object ready to use:\n\n```\ndef decompress(data: bytes) -> object:\n return msgpack.unpackb(data, raw=False)\n \ndef read(key):\n value_bytes = redis_db.get(key)\n return decompress(value_bytes)\n\n# now we can recover the value object\nvalue = read([121,66,3,5])\n```\n\n#### **Redis-specific optimizations**\n\nTo provide some of its functionalities like fast access and TTL, Redis may need to use additional space in memory besides the space the data itself needs. The next two sections help us reduce that additional overhead to a minimum. Then we show some probabilistic structures that can further help reduce the memory.\n\n#### **Move from strings or lists to hashes**\n\nIn our initial example, we have many small strings stored as independent lists. Each entry on Redis is between 60 (without expire) or 100 bytes (with expire) of additional space, which is meaningful if we store several million items (100 million entries x 100 bytes = 10 GBs of overhead). In the first example, these are stored as a Redis list:\n\n```\n\"i:123.82.92.12\" → [1, \"John Doe\", \"vcp\"], \n\"i:3.30.7.124\" → [12, \"James Smith\", \"pr\"],\n...\n\"i:121.66.3.5\" → [5, \"Peter Parker\", \"vv\"]\n```\n\nIn the resulting optimization, all fields are stored in a single hash:\n\n```\nmydata → {\n \"i:123.82.92.12\" : \"1, John Doe, vcp\",\n \"i:3.30.7.124\" : \"12, James Smith, pr\",\n ...\n \"i:121.66.3.5\" : \"5, Peter Parker, vv\"\n}\n```\n\nThis allows us to save 90 bytes per entry. In the preceding example, given each entry is relatively small—representing 40% less memory usage.\n\n#### **Convert to smaller hashes (using ziplist)**\n\nHashes in Redis can be encoded in memory either as a hash table or ziplist (listpacks in Redis 7). The decision of which to use is based on two parameters in your [parameter group](https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/ParameterGroups.html):\n\n- ```hash-max-ziplist-value``` (by default 64)\n- ```hash-max-ziplist-entries``` (by default 512)\n\nIf any value for a key exceeds the two configurations, it’s stored automatically as a hash table instead of a ziplist. A hash table uses twice as much memory as a ziplist, but it can be faster for big hashes. The idea is to use a ziplist while keeping the number of items in each hash to a reasonable number.\n\nTo store items efficiently in ziplists, we migrate our single big hash into many similarly sized small hashes:\n\n```\nmydata:001 → {\n i:12.82.92.12 : [1, \"John Doe\", \"vcp\"],\n i:34.30.7.124 : [12, \"James Smith\", \"pr\"],\n i:121.66.3.5 : [5, \"Peter Parker\", \"vv\"]\n}\nmydata:002 → {\n i:1.82.92.12 : [1, \"John Doe\", \"vcp\"],\n i:9.30.7.124 : [12, \"James Smith\", \"pr\"],\n i:11.66.3.5 : [5, \"Peter Parker\", \"vv\"]\n}\n...\nmydata:999 → {\n i:23.82.92.12 : [1, \"John Doe\", \"vcp\"],\n i:33.30.7.124 : [12, \"James Smith\", \"pr\"],\n i:21.66.3.5 : [5, \"Peter Parker\", \"vv\"]\n}\n```\n\nTo achieve this space efficiency, we use the following code:\n\n```\nimport binascii\nSHARDS = 1000 # number of buckets, aim at less than 1000 items per bucket\nPREFIX = \"mydata:\" # prefix to find easily the data in our database\ndef get_shard_key(key: bytes) -> str:\n \"\"\"\n Computes the shard_key for the given key, based on the CRC and number of shards.\n \"\"\"\n shard_id = binascii.crc32(key) % SHARDS # use modulo to get exactly 1000 buckets\n return PREFIX+str(shard_id)\n\ndef write(key, value):\n shard_key = get_shard_key(key) # the shard is a function of the key \n redis_db.hset(shard_key, key, value)\n\nwrite(b'i:21.66.3.5', b'1, John Doe, vcp')\n```\n\nTo read the values back, we use the following code:\n\n```\ndef read(key):\n shard_key = get_shard_key(sample_key)\n return redis_db.hget(shard_key, sample_key)\n \nvalue = read(sample_key)\n```\n\nThe following screenshot shows an example of how to edit a parameter group on the Amazon ElastiCache console for Redis.\n\n![image.png](https://dev-media.amazoncloud.cn/5bceea2f33394a528f3dd62e8e0cbcd3_image.png)\n\nTo make sure you’re actually using a ziplist, you can use the ```object encoding``` command:\n\n```\n>>object encoding \"mydata:001\"\n\"ziplist\"\n>>memory usage \"mydata:001\"\n(integer) 5337\n```\n\nMemory usage should also be about 40% less than if we stored as a hash list. If you don’t see the type as ziplist, check the two parameters and ensure both conditions are satisfied.\n\n#### **Use probabilistic structures**\n\nIf you need to count the number of items in a set, but don’t need to be exact, a [HyperLogLog](https://redis.com/redis-best-practices/counting/hyperloglog/) is a natively supported probabilistic data structure for counting unique items in a set. The [HyperLogLog algorithm](https://en.wikipedia.org/wiki/HyperLogLog) is able to estimate cardinalities of over 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory.\n\n[Bloom filters](https://en.wikipedia.org/wiki/Bloom_filter) are also a highly space- and time-efficient probabilistic data structure. Although not natively supported, it can be implemented. Bloom filters are used to test whether an element is a member of a set with a small chance of false positives (but no false negatives). It allows you to store items in less than 10 bits for a 1% false positive probability.\n\nIf you need to simply check for equivalence with minimal chances of collision, you can store a hash of the content. For example, you can use a fast, non-cryptographic hash function such as [xxHash](https://github.com/Cyan4973/xxHash):\n\n```\n>>import xxhash\n>>hashed_content = xxh32_intdigest(b'134, John William Doe, vcpvvcppvvcc')\n>>redis_db.set(\"my_key\",hashed_content)\n```\n\nThere is no way to recover the original value (unless you stored it elsewhere), but we can use the hashed version to check for existence:\n\n```\n>>new_content = xxh32_intdigest(b'131, Peter Doe, vcpvvcppvvcc')\n>>if redis_db.get(\"my_key\") == new_content:\n>> print(\"No need to store\")\n```\n\nThese techniques are very effective (often reducing the memory required by 99%) but have some trade-offs such as the possibility of false positives.\n\n#### **Data compression**\n\nOne of the easiest ways to reduce memory consumption is to reduce the size of keys and values. This isn’t specific to Redis but applies particularly well to it.\n\nIn general, compression for Redis (and for semistructured and structured databases) is very different than file compression. The reason is because in Redis, we’re usually storing short fields.\n\n#### **Compression of long data**\n\nTo compress long data, you can use popular compression algorithms such as Gzip, LZO, or Snappy. In this example, we use [LZ4](https://github.com/lz4/lz4), which has a good balance between speed and compression ratio:\n\n```\nimport lz4.frame\nfrom faker import Faker\nfake = Faker()\n\ndef compress(text: str)-> bytes:\n text_bytes = str.encode(text) # convert to bytes\n return lz4.frame.compress(text_bytes)\n \ntext = fake.paragraph(nb_sentences=100) # generate a paragraph with 100 sentences\ncompressed_bytes = compress(text)\n```\n\nTo recover the original value, we can simply pass the compressed bytes to the LZ4 library:\n\n```\ndef decompress(data: bytes) -> str:\n return lz4.frame.decompress(data)\n \noriginal = decompress(compressed_bytes)\n```\n\nIn this example, the compressed string is 20% smaller than the compressed one. Note that for small strings (more often stored in a database), this is unlikely to produce good results. If your data is very big, you may consider storing it in other places, such as [Amazon Simple Storage Service](http://aws.amazon.com/s3) (Amazon S3).\n\nAs mentioned earlier, for small strings (something more common in Redis), this doesn’t produce great results. We could try to compress groups of values (for example a shard of values), but it adds complexity to our code, so the next options are more likely to help us.\n\n#### **Custom base encoding**\n\nOne option to reduce the space used is to make use of the limited range of characters of a particular data type. For example, our application may enforce that user names can only contain lowercase letters (a-z) and numbers (0–9). If this is the case, [base36 encoding](https://en.wikipedia.org/wiki/Base36) is more efficient than simply storing these as strings.\n\nYou can also create your own base depending on your own alphabet. In this example, we assume we’re recording all the moves of a player of the game Rock, Paper, Scissors. To encode it efficiently, we store each move as a single character string. In our example, we represent rock as 0, paper as 1, and scissors as 2. We can then compress it further by storing it as an integer that uses all potential values:\n\n```\n>>BASE=3\n>>ALPHABET=\"012\"\n>>player_moves=b\"20110012200120000110\" # 0 is rock, 1 is paper, 2 is scissors\n>>compressed=int(player_moves, BASE) # convert to a base-3 number\n2499732264\n```\n\nAlthough the original ```player_moves``` string would take 20 bytes, the compressed integer can be stored in just 4 bytes (75% space reduction). To decompress it is slightly more complex:\n\n```\ndef decompress(s) -> str:\n res = \"\"\n while s:\n res+=ALPHABET[s%BASE]\n s//= BASE\n return res[::-1] or \"0\"\n \ndecompress(2499732264) # This will return '20110012200120000110'\n```\n\nUsing base-36 can provide around a 64% reduction in space used. There are some [commonly used encodings](https://en.wikipedia.org/wiki/Binary-to-text_encoding) and some [easy-to-use libraries](https://www.npmjs.com/package/base-x).\n\n#### **Compression of short strings with a domain dictionary**\n\nMost compression algorithms (such as Gzip, LZ4, Snappy) are not very effective to compress small strings (such a person name or a URL). This is because they learn about the data as they are compressing it.\n\nAn effective technique is to use a [pre-trained dictionary](https://pyzstd.readthedocs.io/en/latest/#dictionary). In the following example, we use Zstandard to train a dictionary that we use to compress and decompress a job name:\n\n```\nimport pyzstd \nfrom faker import Faker\nfake = Faker()\nFaker.seed(0)\n\ndef data() -> list:\n samples = []\n for i in range(1024): # use 1000+ jobs to train\n tmp_bytes = str.encode(fake.job())\n samples.append(tmp_bytes)\n return samples\n\njob_dictionary = pyzstd.train_dict(data(), 100*1024)\n```\n\nAfter you have the ```job_dictionary```, compressing any random job is straightforward:\n\n```\ndef compress(job_string) -> bytes:\n return pyzstd.compress(job_string, 10, job_dictionary)\n \ncompressed_job = compress(b'Armed forces technical officer') # we can compress any job\n```\n\nTo decompress is also straightforward, but you need the original dictionary:\n\n```\nuncompressed_job = pyzstd.decompress(compressed_job, job_dictionary)\n\n```\nIn this simple example, the original uncompressed job was 30 bytes, whereas the compressed job was 21 bytes (30% space savings).\n\n#### **Other encoding techniques for short fields**\n\nShort string compression is common for file compression, but you can use other techniques depending on the use case. You can get inspiration from [Amazon Redshift](http://aws.amazon.com/redshift) [compression encodings](https://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html):\n\n- [Byte-dictionary](https://docs.aws.amazon.com/redshift/latest/dg/c_Byte_dictionary_encoding.html) – If the cardinality of your data is small, you can encode an index for a symbol instead of the full string. An example of this is the country of a customer. Because there are a small number of countries, you can save a lot of space by storing only the index of the country.\n- [Delta](https://docs.aws.amazon.com/redshift/latest/dg/c_Delta_encoding.html) – Things like time of purchase might be stored using a full timestamp by the difference, which will be smaller.\n- [Mostly](https://docs.aws.amazon.com/redshift/latest/dg/c_MostlyN_encoding.html) – Mostly encodings are useful when the data type for a column is larger than most of the stored values require.\n- [Run length](https://docs.aws.amazon.com/redshift/latest/dg/c_Runlength_encoding.html) – Run length encoding replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences (the length of the run).\n- [Text255 and text32k](https://docs.aws.amazon.com/redshift/latest/dg/c_Text255_encoding.html) – Text255 and text32k encodings are useful for compressing string fields in which the same words recur often.\n\nIn general, short strings compression benefit from structural information constraints (for example, an IPv4 is composed of four numbers between 0–255), and distribution information (for example, many consumer emails have a free domain like gmail.com, yahoo.com, or hotmail.com).\n\n#### **Decision table**\n\nThe following table presents a summary and the necessary requirements to apply each of the techniques.\n\n![image.png](https://dev-media.amazoncloud.cn/86b2092ead58459ea3ce7196956e452c_image.png)\n\n#### **Conclusion**\n\nIn this post, I showed you multiple ways to optimize the memory consumption of your Redis instances. This can have a positive effect on cost and performance. With any optimization, there are trade-offs between space and time used for reading and writing. You may lose some native Redis capabilities (such as eviction policies) depending on how you store the data. Redis also has a good [memory optimization guide](https://docs.redis.com/latest/ri/memory-optimizations/) with some overlap with this post.\n\nHow do you efficiently store data in your Redis-compatible database? Let me know in the comments section.\n\n#### **About the Author**\n\n![image.png](https://dev-media.amazoncloud.cn/6f03b5d33f4b43d4a5a028f7336b20e7_image.png)\n\n**Roger Sindreu** is a Solutions Architect with Amazon Web Services. He was a database engineer and has led engineering teams for the last 20 years. His interests are about AI/ML, Databases, FSI and everything related to AWS.","render":"<p><a href=\"https://aws.amazon.com/memorydb/\" target=\"_blank\">Amazon MemoryDB for Redis</a> and <a href=\"https://aws.amazon.com/elasticache/redis/\" target=\"_blank\">Amazon ElastiCache for Redis</a> are in-memory data stores. While ElastiCache is commonly used as a cache, MemoryDB is a durable database designed for applications with high performance requirements.</p>\n<p>Customers <a href=\"https://insights.stackoverflow.com/survey/2021#most-loved-dreaded-and-wanted-database-love-dread\" target=\"_blank\">love Redis</a> as an in-memory data engine. As data used and accessed grows exponentially, making the most of the memory available becomes increasingly important. In this post, I provide multiple strategies with code snippets to help you reduce your application’s memory consumption when using MemoryDB and ElastiCache for Redis. This helps to optimize costs and allows you to fit more data within your instances in your existing cluster.</p>\n<p>Before going into these optimizations, remember ElastiCache for Redis supports <a href=\"https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/data-tiering.html\" target=\"_blank\">data tiering</a>, which automatically places data across memory and local, high-performance solid state drives (SSD). Data tiering is ideal for applications that access up to 20% of their datasets regularly. ElastiCache for Redis provides a convenient way to scale your clusters at a lower cost to up to a petabyte of data. It can enable <a href=\"https://aws.amazon.com/blogs/database/scale-your-amazon-elasticache-for-redis-clusters-at-a-lower-cost-with-data-tiering/\" target=\"_blank\">over 60% savings per GB of capacity</a> while having minimal performance impact for workloads that access a subset of their data regularly. ElastiCache for Redis also supports <a href=\"https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/AutoScaling.html\" target=\"_blank\">auto scaling to automatically adjust your cluster horizontally</a> by adding or removing shards or replica nodes.</p>\n<h4><a id=\"Prerequisites_6\"></a><strong>Prerequisites</strong></h4>\n<p>For this walkthrough, you need the following prerequisites:</p>\n<ol>\n<li>An <a href=\"https://portal.aws.amazon.com/billing/signup\" target=\"_blank\">AWS account</a> (you can use the <a href=\"https://aws.amazon.com/memorydb/\" target=\"_blank\">AWS Free Tier</a>)</li>\n<li>An ElastiCache for Redis or MemoryDB cluster (a single instance is enough)</li>\n<li>Your local machine or a remote environment such as <a href=\"https://aws.amazon.com/cloud9/\" target=\"_blank\">AWS Cloud9</a> with connectivity to your cluster</li>\n<li>The redis-cli client to <a href=\"https://docs.aws.amazon.com/memorydb/latest/devguide/getting-startedclusters.connecttonode.html\" target=\"_blank\">connect remotely</a> to your instance</li>\n<li>Python 3.5 or newer with the following libraries</li>\n</ol>\n<p>To run some of the examples in this post, you need the following Python libraries:</p>\n<pre><code class=\"lang-\">pip install redis-py-cluster # to connect to your elasticache or memorydb cluster\npip install faker # to simulate different types of data\npip install msgpack # to serialize complex data in binary format\npip install lz4 pyzstd # to compress long and short data types\n</code></pre>\n<p>To know how much memory is used, <a href=\"https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/GettingStarted.ConnectToCacheNode.html\" target=\"_blank\">redis-cli</a> has the <code>memory usage</code> command:</p>\n<pre><code class=\"lang-\">redis-cli -h <your_instance_endpoint> --tls -p 6379\n>>memory usage "my_key"\n(integer) 153\n</code></pre>\n<p>To connect to our Redis cluster from Python, we use <a href=\"https://redis-py-cluster.readthedocs.io/en/stable/\" target=\"_blank\">redis-py-cluster</a>. In our examples, we ignore the creation of the <code>RedisCluster</code> object for simplicity. To check that you have connectivity to your Redis instance, you can use the following code:</p>\n<pre><code class=\"lang-\">>>from rediscluster import RedisCluster\n>>HOST="<Your host URL>"\n>>redis_db = RedisCluster(host=HOST, port=6379, ssl=True)\n</code></pre>\n<p>If you’re performing multiple operations, consider using pipelines. <a href=\"https://redis-py-cluster.readthedocs.io/en/stable/pipelines.html\" target=\"_blank\">Pipelines</a> allow you to batch multiple operations and save multiple network trips for higher performance. See the following code:</p>\n<pre><code class=\"lang-\">>>pipe = redis_db.pipeline()\n>>pipe.set(key_1, value_1)\n...\n>>pipe.set(key_n, value_n)\n>>pipe.execute()\n</code></pre>\n<p>To check the size of an item before inserting them in Redis, we use the following code:</p>\n<pre><code class=\"lang-\">>>import sys\n>>x = 2 # x can be any python object\n>>sys.getsizeof(x) # returns the size in bytes of the object x\n24\n</code></pre>\n<p>To simulate more realistic data, we use the library Faker:</p>\n<pre><code class=\"lang-\">>>from faker import Faker\n>>fake = Faker()\n>>fake.name()\n'Lucy Cechtelar'\n>>fake.address()\n'426 Jordy Lodge'\n</code></pre>\n<h4><a id=\"Basic_optimizations_71\"></a><strong>Basic optimizations</strong></h4>\n<p>Before going to advanced optimizations, we apply the basic optimizations. These are easy manipulations so we don’t provide the code in this post.</p>\n<p>In our example, we assume we have a big list of key-value pairs. As the keys, we use the IPs of hypothetical visitors to our website. As the values, we have a counter of how many visits, their reported name, and their recent actions:</p>\n<pre><code class=\"lang-\">IP:123.82.92.12 → {"visits":"1", "name":"John Doe", "recent_actions": "visit,checkout,purchase"},\nIP:3.30.7.124 → {"visits":"12", "name":"James Smith", "recent_actions": "purchase,refund"},\n...\nIP:121.66.3.5 → {"visits":"5", "name":"Peter Parker", "recent_actions": "visit,visit"}\n</code></pre>\n<p>Use the following code to insert these programmatically:</p>\n<pre><code class=\"lang-\">redis_db.hset("IP:123.82.92.12", {"visits":"1", "name":"John Doe", "recent_actions": "visit,checkout,purchase"})\nredis_db.hset("IP:3.30.7.124", {"visits":"12", "name":"James Smith", "recent_actions": "purchase,refund"})\n...\nredis_db.hset("IP:121.66.3.5", {"visits":"5", "name":"Peter Parker", "recent_actions": "visit,visit"})\n</code></pre>\n<h4><a id=\"Reduce_field_names_93\"></a><strong>Reduce field names</strong></h4>\n<p>Redis field names consume memory each time they’re used, so you can save space by keeping names as short as possible. In our previous example, instead of <code>visits</code> as the field name, we may want to use <code>v</code>. Similarly, we can use <code>n</code> instead of <code>name</code>, and <code>r</code> instead of <code>recent_actions</code>. We can also shorten the key name to <code>i</code> instead of <code>IP</code>.</p>\n<p>Within the fields themselves, you can also reduce common words with symbols. We can switch recent actions to the initial character (<code>v</code> instead of <code>visit</code>).</p>\n<p>The following code is how our previous example looks after this simplification:</p>\n<pre><code class=\"lang-\">i:123.82.92.12 → {"v":"1", "n":"John Doe", "r": "vcp"},\ni:3.30.7.124 → {"v":"12", "n":"James Smith", "r": "pr"},\n...\ni:121.66.3.5 → {"v":"5", "n":"Peter Parker", "r": "vv"}\nThis results in 23% memory savings in our specific example.\n</code></pre>\n<h4><a id=\"Use_position_to_indicate_the_data_type_109\"></a><strong>Use position to indicate the data type</strong></h4>\n<p>If all fields exist, we can use a list instead of a hash, and the position tells us what the field name is. This allows us to remove the field names altogether. See the following code:</p>\n<pre><code class=\"lang-\">i:123.82.92.12 → [1, "John Doe", "vcp"],\ni:3.30.7.124 → [12, "James Smith", "pr"],\n...\ni:121.66.3.5 → [5, "Peter Parker", "vv"]\n</code></pre>\n<p>This results in an additional 14% memory savings in our case.</p>\n<h4><a id=\"Serialize_complex_types_122\"></a><strong>Serialize complex types</strong></h4>\n<p>There are different ways to serialize complex objects, which allows you to store these in an efficient manner. Most languages have their own serialization libraries (pickle in Python, Serializable in Java, and so on). Some libraries work across languages and are often more space efficient, such as ProtoBuff or MsgPack.</p>\n<p>The following code shows an example using <a href=\"https://msgpack.org/\" target=\"_blank\">MsgPack</a>:</p>\n<pre><code class=\"lang-\">import msgpack\n\ndef compress(data: object) -> bytes:\n return msgpack.packb(data, use_bin_type=True)\n \ndef write(key, value):\n key_bytes = b'i:'+compress(key) # we can serialize the key too\n value_bytes = compress(value)\n redis_db.set(key_bytes, value_bytes)\n\nwrite([121,66,3,5] , [134,"John Doe","vcp"])\n</code></pre>\n<p>In this case, the original object is 73 bytes, whereas the serialized object is 49 bytes (33% space reduction).</p>\n<p>To recover the value, MsgPack makes it very convenient, returning the Python object ready to use:</p>\n<pre><code class=\"lang-\">def decompress(data: bytes) -> object:\n return msgpack.unpackb(data, raw=False)\n \ndef read(key):\n value_bytes = redis_db.get(key)\n return decompress(value_bytes)\n\n# now we can recover the value object\nvalue = read([121,66,3,5])\n</code></pre>\n<h4><a id=\"Redisspecific_optimizations_158\"></a><strong>Redis-specific optimizations</strong></h4>\n<p>To provide some of its functionalities like fast access and TTL, Redis may need to use additional space in memory besides the space the data itself needs. The next two sections help us reduce that additional overhead to a minimum. Then we show some probabilistic structures that can further help reduce the memory.</p>\n<h4><a id=\"Move_from_strings_or_lists_to_hashes_162\"></a><strong>Move from strings or lists to hashes</strong></h4>\n<p>In our initial example, we have many small strings stored as independent lists. Each entry on Redis is between 60 (without expire) or 100 bytes (with expire) of additional space, which is meaningful if we store several million items (100 million entries x 100 bytes = 10 GBs of overhead). In the first example, these are stored as a Redis list:</p>\n<pre><code class=\"lang-\">"i:123.82.92.12" → [1, "John Doe", "vcp"], \n"i:3.30.7.124" → [12, "James Smith", "pr"],\n...\n"i:121.66.3.5" → [5, "Peter Parker", "vv"]\n</code></pre>\n<p>In the resulting optimization, all fields are stored in a single hash:</p>\n<pre><code class=\"lang-\">mydata → {\n "i:123.82.92.12" : "1, John Doe, vcp",\n "i:3.30.7.124" : "12, James Smith, pr",\n ...\n "i:121.66.3.5" : "5, Peter Parker, vv"\n}\n</code></pre>\n<p>This allows us to save 90 bytes per entry. In the preceding example, given each entry is relatively small—representing 40% less memory usage.</p>\n<h4><a id=\"Convert_to_smaller_hashes_using_ziplist_186\"></a><strong>Convert to smaller hashes (using ziplist)</strong></h4>\n<p>Hashes in Redis can be encoded in memory either as a hash table or ziplist (listpacks in Redis 7). The decision of which to use is based on two parameters in your <a href=\"https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/ParameterGroups.html\" target=\"_blank\">parameter group</a>:</p>\n<ul>\n<li><code>hash-max-ziplist-value</code> (by default 64)</li>\n<li><code>hash-max-ziplist-entries</code> (by default 512)</li>\n</ul>\n<p>If any value for a key exceeds the two configurations, it’s stored automatically as a hash table instead of a ziplist. A hash table uses twice as much memory as a ziplist, but it can be faster for big hashes. The idea is to use a ziplist while keeping the number of items in each hash to a reasonable number.</p>\n<p>To store items efficiently in ziplists, we migrate our single big hash into many similarly sized small hashes:</p>\n<pre><code class=\"lang-\">mydata:001 → {\n i:12.82.92.12 : [1, "John Doe", "vcp"],\n i:34.30.7.124 : [12, "James Smith", "pr"],\n i:121.66.3.5 : [5, "Peter Parker", "vv"]\n}\nmydata:002 → {\n i:1.82.92.12 : [1, "John Doe", "vcp"],\n i:9.30.7.124 : [12, "James Smith", "pr"],\n i:11.66.3.5 : [5, "Peter Parker", "vv"]\n}\n...\nmydata:999 → {\n i:23.82.92.12 : [1, "John Doe", "vcp"],\n i:33.30.7.124 : [12, "James Smith", "pr"],\n i:21.66.3.5 : [5, "Peter Parker", "vv"]\n}\n</code></pre>\n<p>To achieve this space efficiency, we use the following code:</p>\n<pre><code class=\"lang-\">import binascii\nSHARDS = 1000 # number of buckets, aim at less than 1000 items per bucket\nPREFIX = "mydata:" # prefix to find easily the data in our database\ndef get_shard_key(key: bytes) -> str:\n """\n Computes the shard_key for the given key, based on the CRC and number of shards.\n """\n shard_id = binascii.crc32(key) % SHARDS # use modulo to get exactly 1000 buckets\n return PREFIX+str(shard_id)\n\ndef write(key, value):\n shard_key = get_shard_key(key) # the shard is a function of the key \n redis_db.hset(shard_key, key, value)\n\nwrite(b'i:21.66.3.5', b'1, John Doe, vcp')\n</code></pre>\n<p>To read the values back, we use the following code:</p>\n<pre><code class=\"lang-\">def read(key):\n shard_key = get_shard_key(sample_key)\n return redis_db.hget(shard_key, sample_key)\n \nvalue = read(sample_key)\n</code></pre>\n<p>The following screenshot shows an example of how to edit a parameter group on the Amazon ElastiCache console for Redis.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/5bceea2f33394a528f3dd62e8e0cbcd3_image.png\" alt=\"image.png\" /></p>\n<p>To make sure you’re actually using a ziplist, you can use the <code>object encoding</code> command:</p>\n<pre><code class=\"lang-\">>>object encoding "mydata:001"\n"ziplist"\n>>memory usage "mydata:001"\n(integer) 5337\n</code></pre>\n<p>Memory usage should also be about 40% less than if we stored as a hash list. If you don’t see the type as ziplist, check the two parameters and ensure both conditions are satisfied.</p>\n<h4><a id=\"Use_probabilistic_structures_261\"></a><strong>Use probabilistic structures</strong></h4>\n<p>If you need to count the number of items in a set, but don’t need to be exact, a <a href=\"https://redis.com/redis-best-practices/counting/hyperloglog/\" target=\"_blank\">HyperLogLog</a> is a natively supported probabilistic data structure for counting unique items in a set. The <a href=\"https://en.wikipedia.org/wiki/HyperLogLog\" target=\"_blank\">HyperLogLog algorithm</a> is able to estimate cardinalities of over 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory.</p>\n<p><a href=\"https://en.wikipedia.org/wiki/Bloom_filter\" target=\"_blank\">Bloom filters</a> are also a highly space- and time-efficient probabilistic data structure. Although not natively supported, it can be implemented. Bloom filters are used to test whether an element is a member of a set with a small chance of false positives (but no false negatives). It allows you to store items in less than 10 bits for a 1% false positive probability.</p>\n<p>If you need to simply check for equivalence with minimal chances of collision, you can store a hash of the content. For example, you can use a fast, non-cryptographic hash function such as <a href=\"https://github.com/Cyan4973/xxHash\" target=\"_blank\">xxHash</a>:</p>\n<pre><code class=\"lang-\">>>import xxhash\n>>hashed_content = xxh32_intdigest(b'134, John William Doe, vcpvvcppvvcc')\n>>redis_db.set("my_key",hashed_content)\n</code></pre>\n<p>There is no way to recover the original value (unless you stored it elsewhere), but we can use the hashed version to check for existence:</p>\n<pre><code class=\"lang-\">>>new_content = xxh32_intdigest(b'131, Peter Doe, vcpvvcppvvcc')\n>>if redis_db.get("my_key") == new_content:\n>> print("No need to store")\n</code></pre>\n<p>These techniques are very effective (often reducing the memory required by 99%) but have some trade-offs such as the possibility of false positives.</p>\n<h4><a id=\"Data_compression_285\"></a><strong>Data compression</strong></h4>\n<p>One of the easiest ways to reduce memory consumption is to reduce the size of keys and values. This isn’t specific to Redis but applies particularly well to it.</p>\n<p>In general, compression for Redis (and for semistructured and structured databases) is very different than file compression. The reason is because in Redis, we’re usually storing short fields.</p>\n<h4><a id=\"Compression_of_long_data_291\"></a><strong>Compression of long data</strong></h4>\n<p>To compress long data, you can use popular compression algorithms such as Gzip, LZO, or Snappy. In this example, we use <a href=\"https://github.com/lz4/lz4\" target=\"_blank\">LZ4</a>, which has a good balance between speed and compression ratio:</p>\n<pre><code class=\"lang-\">import lz4.frame\nfrom faker import Faker\nfake = Faker()\n\ndef compress(text: str)-> bytes:\n text_bytes = str.encode(text) # convert to bytes\n return lz4.frame.compress(text_bytes)\n \ntext = fake.paragraph(nb_sentences=100) # generate a paragraph with 100 sentences\ncompressed_bytes = compress(text)\n</code></pre>\n<p>To recover the original value, we can simply pass the compressed bytes to the LZ4 library:</p>\n<pre><code class=\"lang-\">def decompress(data: bytes) -> str:\n return lz4.frame.decompress(data)\n \noriginal = decompress(compressed_bytes)\n</code></pre>\n<p>In this example, the compressed string is 20% smaller than the compressed one. Note that for small strings (more often stored in a database), this is unlikely to produce good results. If your data is very big, you may consider storing it in other places, such as <a href=\"http://aws.amazon.com/s3\" target=\"_blank\">Amazon Simple Storage Service</a> (Amazon S3).</p>\n<p>As mentioned earlier, for small strings (something more common in Redis), this doesn’t produce great results. We could try to compress groups of values (for example a shard of values), but it adds complexity to our code, so the next options are more likely to help us.</p>\n<h4><a id=\"Custom_base_encoding_321\"></a><strong>Custom base encoding</strong></h4>\n<p>One option to reduce the space used is to make use of the limited range of characters of a particular data type. For example, our application may enforce that user names can only contain lowercase letters (a-z) and numbers (0–9). If this is the case, <a href=\"https://en.wikipedia.org/wiki/Base36\" target=\"_blank\">base36 encoding</a> is more efficient than simply storing these as strings.</p>\n<p>You can also create your own base depending on your own alphabet. In this example, we assume we’re recording all the moves of a player of the game Rock, Paper, Scissors. To encode it efficiently, we store each move as a single character string. In our example, we represent rock as 0, paper as 1, and scissors as 2. We can then compress it further by storing it as an integer that uses all potential values:</p>\n<pre><code class=\"lang-\">>>BASE=3\n>>ALPHABET="012"\n>>player_moves=b"20110012200120000110" # 0 is rock, 1 is paper, 2 is scissors\n>>compressed=int(player_moves, BASE) # convert to a base-3 number\n2499732264\n</code></pre>\n<p>Although the original <code>player_moves</code> string would take 20 bytes, the compressed integer can be stored in just 4 bytes (75% space reduction). To decompress it is slightly more complex:</p>\n<pre><code class=\"lang-\">def decompress(s) -> str:\n res = ""\n while s:\n res+=ALPHABET[s%BASE]\n s//= BASE\n return res[::-1] or "0"\n \ndecompress(2499732264) # This will return '20110012200120000110'\n</code></pre>\n<p>Using base-36 can provide around a 64% reduction in space used. There are some <a href=\"https://en.wikipedia.org/wiki/Binary-to-text_encoding\" target=\"_blank\">commonly used encodings</a> and some <a href=\"https://www.npmjs.com/package/base-x\" target=\"_blank\">easy-to-use libraries</a>.</p>\n<h4><a id=\"Compression_of_short_strings_with_a_domain_dictionary_350\"></a><strong>Compression of short strings with a domain dictionary</strong></h4>\n<p>Most compression algorithms (such as Gzip, LZ4, Snappy) are not very effective to compress small strings (such a person name or a URL). This is because they learn about the data as they are compressing it.</p>\n<p>An effective technique is to use a <a href=\"https://pyzstd.readthedocs.io/en/latest/#dictionary\" target=\"_blank\">pre-trained dictionary</a>. In the following example, we use Zstandard to train a dictionary that we use to compress and decompress a job name:</p>\n<pre><code class=\"lang-\">import pyzstd \nfrom faker import Faker\nfake = Faker()\nFaker.seed(0)\n\ndef data() -> list:\n samples = []\n for i in range(1024): # use 1000+ jobs to train\n tmp_bytes = str.encode(fake.job())\n samples.append(tmp_bytes)\n return samples\n\njob_dictionary = pyzstd.train_dict(data(), 100*1024)\n</code></pre>\n<p>After you have the <code>job_dictionary</code>, compressing any random job is straightforward:</p>\n<pre><code class=\"lang-\">def compress(job_string) -> bytes:\n return pyzstd.compress(job_string, 10, job_dictionary)\n \ncompressed_job = compress(b'Armed forces technical officer') # we can compress any job\n</code></pre>\n<p>To decompress is also straightforward, but you need the original dictionary:</p>\n<pre><code class=\"lang-\">uncompressed_job = pyzstd.decompress(compressed_job, job_dictionary)\n\n</code></pre>\n<p>In this simple example, the original uncompressed job was 30 bytes, whereas the compressed job was 21 bytes (30% space savings).</p>\n<h4><a id=\"Other_encoding_techniques_for_short_fields_389\"></a><strong>Other encoding techniques for short fields</strong></h4>\n<p>Short string compression is common for file compression, but you can use other techniques depending on the use case. You can get inspiration from <a href=\"http://aws.amazon.com/redshift\" target=\"_blank\">Amazon Redshift</a> <a href=\"https://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html\" target=\"_blank\">compression encodings</a>:</p>\n<ul>\n<li><a href=\"https://docs.aws.amazon.com/redshift/latest/dg/c_Byte_dictionary_encoding.html\" target=\"_blank\">Byte-dictionary</a> – If the cardinality of your data is small, you can encode an index for a symbol instead of the full string. An example of this is the country of a customer. Because there are a small number of countries, you can save a lot of space by storing only the index of the country.</li>\n<li><a href=\"https://docs.aws.amazon.com/redshift/latest/dg/c_Delta_encoding.html\" target=\"_blank\">Delta</a> – Things like time of purchase might be stored using a full timestamp by the difference, which will be smaller.</li>\n<li><a href=\"https://docs.aws.amazon.com/redshift/latest/dg/c_MostlyN_encoding.html\" target=\"_blank\">Mostly</a> – Mostly encodings are useful when the data type for a column is larger than most of the stored values require.</li>\n<li><a href=\"https://docs.aws.amazon.com/redshift/latest/dg/c_Runlength_encoding.html\" target=\"_blank\">Run length</a> – Run length encoding replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences (the length of the run).</li>\n<li><a href=\"https://docs.aws.amazon.com/redshift/latest/dg/c_Text255_encoding.html\" target=\"_blank\">Text255 and text32k</a> – Text255 and text32k encodings are useful for compressing string fields in which the same words recur often.</li>\n</ul>\n<p>In general, short strings compression benefit from structural information constraints (for example, an IPv4 is composed of four numbers between 0–255), and distribution information (for example, many consumer emails have a free domain like gmail.com, yahoo.com, or hotmail.com).</p>\n<h4><a id=\"Decision_table_401\"></a><strong>Decision table</strong></h4>\n<p>The following table presents a summary and the necessary requirements to apply each of the techniques.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/86b2092ead58459ea3ce7196956e452c_image.png\" alt=\"image.png\" /></p>\n<h4><a id=\"Conclusion_407\"></a><strong>Conclusion</strong></h4>\n<p>In this post, I showed you multiple ways to optimize the memory consumption of your Redis instances. This can have a positive effect on cost and performance. With any optimization, there are trade-offs between space and time used for reading and writing. You may lose some native Redis capabilities (such as eviction policies) depending on how you store the data. Redis also has a good <a href=\"https://docs.redis.com/latest/ri/memory-optimizations/\" target=\"_blank\">memory optimization guide</a> with some overlap with this post.</p>\n<p>How do you efficiently store data in your Redis-compatible database? Let me know in the comments section.</p>\n<h4><a id=\"About_the_Author_413\"></a><strong>About the Author</strong></h4>\n<p><img src=\"https://dev-media.amazoncloud.cn/6f03b5d33f4b43d4a5a028f7336b20e7_image.png\" alt=\"image.png\" /></p>\n<p><strong>Roger Sindreu</strong> is a Solutions Architect with Amazon Web Services. He was a database engineer and has led engineering teams for the last 20 years. His interests are about AI/ML, Databases, FSI and everything related to AWS.</p>\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家