压缩列表简介:
是由特殊编码的内存块构成的列表,其是Redis的列表建和哈希键的底层实现之一。和整数集合一样,二者都是为Redis节省内存而开发的数据结构。ziplist可以用来存放字符串或者整数,其存储数据的特点是:比较小的整数或比较短的字符串。Redis的列表建,哈希键,有序集合的底层实现都用到了ziplist。
ziplist.c上的介绍
/* The ziplist is a specially encoded dually linked list that is designed
* to be very memory efficient. It stores both strings and integer values,
* where integers are encoded as actual integers instead of a series of
* characters. It allows push and pop operations on either side of the list
* in O(1) time. However, because every operation requires a reallocation of
* the memory used by the ziplist, the actual complexity is related to the
* amount of memory used by the ziplist.
翻译:
ziplist是一个特殊编码的双链表,设计的非常省内存。它同时存储字符串和整数值,其中整数编码为实际的整数而不是一系列字符。它允许在列表的两侧进行推送和弹出操作,时间只要O(1)。然而因为每个操作都需要重新分配内存,这个实际的复杂度与ziplist目前用的内存大小有关。
ZIPLIST OVERALL LAYOUT
* ======================
*
* The general layout of the ziplist is as follows:
*
* <zlbytes> <zltail> <zllen> <entry> <entry> ... <entry> <zlend>
*
* NOTE: all fields are stored in little endian, if not specified otherwise.
*
* <uint32_t zlbytes> is an unsigned integer to hold the number of bytes that
* the ziplist occupies, including the four bytes of the zlbytes field itself.
* This value needs to be stored to be able to resize the entire structure
* without the need to traverse it first.
*
* <uint32_t zltail> is the offset to the last entry in the list. This allows
* a pop operation on the far side of the list without the need for full
* traversal.
*
* <uint16_t zllen> is the number of entries. When there are more than
* 2^16-2 entries, this value is set to 2^16-1 and we need to traverse the
* entire list to know how many items it holds.
*
* <uint8_t zlend> is a special entry representing the end of the ziplist.
* Is encoded as a single byte equal to 255. No other normal entry starts
* with a byte set to the value of 255.
翻译:
zpilist的整体结构:
<zlbytes> <zltail> <zllen> <entry> <entry> ... <entry> <zlend>
<uint32_t zlbytes>:是一个无符号整数保存着ziplist占的大小,包括四个字节自己(uint32_t)的长度。它需要存的原因是当resizeziplist的时候不需要遍历。
<uint32_t zltail>:是最后一个entry节点的偏移量,这使得最远端弹出操作不需要全部的遍历
<uint16_t zllen>:entry数量,当有超过2^16-1个entry的时候,它只会保存2^16-1,所以我们需要遍历才知道具体个个数
<uint8_t zlend>:是一个特殊的entry在代表着ziplist的末尾,编码为等于255的字节。没有其他的entry会被设置成255.
ZIPLIST ENTRIES
* ===============
*
* Every entry in the ziplist is prefixed by metadata that contains two pieces
* of information. First, the length of the previous entry is stored to be
* able to traverse the list from back to front. Second, the entry encoding is
* provided. It represents the entry type, integer or string, and in the case
* of strings it also represents the length of the string payload.
* So a complete entry is stored like this:
*
* <prevlen> <encoding> <entry-data>
*
* Sometimes the encoding represents the entry itself, like for small integers
* as we'll see later. In such a case the <entry-data> part is missing, and we
* could have just:
*
* <prevlen> <encoding>
*
* The length of the previous entry, <prevlen>, is encoded in the following way:
* If this length is smaller than 254 bytes, it will only consume a single
* byte representing the length as an unsinged 8 bit integer. When the length
* is greater than or equal to 254, it will consume 5 bytes. The first byte is
* set to 254 (FE) to indicate a larger value is following. The remaining 4
* bytes take the length of the previous entry as value.
*
* So practically an entry is encoded in the following way:
*
* <prevlen from 0 to 253> <encoding> <entry>
*
* Or alternatively if the previous entry length is greater than 253 bytes
* the following encoding is used:
*
* 0xFE <4 bytes unsigned little endian prevlen> <encoding> <entry>
*
* The encoding field of the entry depends on the content of the
* entry. When the entry is a string, the first 2 bits of the encoding first
* byte will hold the type of encoding used to store the length of the string,
* followed by the actual length of the string. When the entry is an integer
* the first 2 bits are both set to 1. The following 2 bits are used to specify
* what kind of integer will be stored after this header. An overview of the
* different types and encodings is as follows. The first byte is always enough
* to determine the kind of entry.
*
* |00pppppp| - 1 byte
* String value with length less than or equal to 63 bytes (6 bits).
* "pppppp" represents the unsigned 6 bit length.
* |01pppppp|qqqqqqqq| - 2 bytes
* String value with length less than or equal to 16383 bytes (14 bits).
* IMPORTANT: The 14 bit number is stored in big endian.
* |10000000|qqqqqqqq|rrrrrrrr|ssssssss|tttttttt| - 5 bytes
* String value with length greater than or equal to 16384 bytes.
* Only the 4 bytes following the first byte represents the length
* up to 32^2-1. The 6 lower bits of the first byte are not used and
* are set to zero.
* IMPORTANT: The 32 bit number is stored in big endian.
* |11000000| - 3 bytes
* Integer encoded as int16_t (2 bytes).
* |11010000| - 5 bytes
* Integer encoded as int32_t (4 bytes).
* |11100000| - 9 bytes
* Integer encoded as int64_t (8 bytes).
* |11110000| - 4 bytes
* Integer encoded as 24 bit signed (3 bytes).
* |11111110| - 2 bytes
* Integer encoded as 8 bit signed (1 byte).
* |1111xxxx| - (with xxxx between 0000 and 1101) immediate 4 bit integer.
* Unsigned integer from 0 to 12. The encoded value is actually from
* 1 to 13 because 0000 and 1111 can not be used, so 1 should be
* subtracted from the encoded 4 bit value to obtain the right value.
* |11111111| - End of ziplist special entry.
*
* Like for the ziplist header, all the integers are represented in little
* endian byte order, even when this code is compiled in big endian systems.
翻译:
每一个entry在ziplist以包含两条数据的元数据开头。第一个是前一个entry的长度,只有可以让从后往前遍历,第二个是目前这个entry的编码格式。它代表这个entry的type,整数还是字符串,在字符串的情况,它同样表示了字符串负载?长。所以一个完整的entry是这样子的:
<prevlen> <encoding> <entry-data>
有时候encoding表示这个entry,例如我们后面会提到的小整数,在这种情况<entry-data>会消失。就变成如下样子:
<prevlen> <encoding>
前一个entry的长度被以下几种方式编码:
如果长度小于254字节,它只会消耗一个字节,u_int8。当比254要长的时候,它会消耗五个字节。第一个字节被设置为254(FE)去表明它会包含较大的长度。剩下的四个字节就表示entry的数目。
所以实际上一个entry的编码是:
<prevlen from 0 to 253> <encoding> <entry>
或者是(比253要多)
0xFE <4 bytes unsigned little endian prevlen> <encoding> <entry>
entry的编码取决于entry的内容。当entry是一个字符串,最开始的两个比特会保存字符串长度的编码类别,后面是字符串的真实长度。
当entry是一个整数,最先两个比特会被设置为1,后面的两个比特用来表示具体是什么类型的整数。
下面是不同类型和不同编码的概览,第一个字节是足够来标明这些信息的
|00pppppp| - 1 byte
entry是一个字符串,长度小于63
|01pppppp|qqqqqqqq| - 2 bytes
字符串小于16383byte
|10000000|qqqqqqqq|rrrrrrrr|ssssssss|tttttttt| - 5 bytes
字符串的长度大于16384字节,只有后面四个字节表示长度
|11000000| - 3 bytes
2个字节的整数 int16_t
|11010000| - 5 bytes
四个字节的整数 int32_t
|11100000| - 9 bytes
八个字节的整数 int64_t
|11110000| - 4 bytes
三个字节的整数
|11111110| - 2 bytes
一个字节的整数
|1111xxxx|
4个字节的整数 只能在0001-1101:1-13
|11111111|
特殊的entry表示ziplist结束
* EXAMPLES OF ACTUAL ZIPLISTS
* ===========================
*
* The following is a ziplist containing the two elements representing
* the strings "2" and "5". It is composed of 15 bytes, that we visually
* split into sections:
*
* [0f 00 00 00] [0c 00 00 00] [02 00] [00 f3] [02 f6] [ff]
* | | | | | |
* zlbytes zltail entries "2" "5" end
*
* The first 4 bytes represent the number 15, that is the number of bytes
* the whole ziplist is composed of. The second 4 bytes are the offset
* at which the last ziplist entry is found, that is 12, in fact the
* last entry, that is "5", is at offset 12 inside the ziplist.
* The next 16 bit integer represents the number of elements inside the
* ziplist, its value is 2 since there are just two elements inside.
* Finally "00 f3" is the first entry representing the number 2. It is
* composed of the previous entry length, which is zero because this is
* our first entry, and the byte F3 which corresponds to the encoding
* |1111xxxx| with xxxx between 0001 and 1101. We need to remove the "F"
* higher order bits 1111, and subtract 1 from the "3", so the entry value
* is "2". The next entry has a prevlen of 02, since the first entry is
* composed of exactly two bytes. The entry itself, F6, is encoded exactly
* like the first entry, and 6-1 = 5, so the value of the entry is 5.
* Finally the special entry FF signals the end of the ziplist.
*
* Adding another element to the above string with the value "Hello World"
* allows us to show how the ziplist encodes small strings. We'll just show
* the hex dump of the entry itself. Imagine the bytes as following the
* entry that stores "5" in the ziplist above:
*
* [02] [0b] [48 65 6c 6c 6f 20 57 6f 72 6c 64]
*
* The first byte, 02, is the length of the previous entry. The next
* byte represents the encoding in the pattern |00pppppp| that means
* that the entry is a string of length <pppppp>, so 0B means that
* an 11 bytes string follows. From the third byte (48) to the last (64)
* there are just the ASCII characters for "Hello World".
翻译:
ziplist的例子:
下面的ziplist包含两个元素字符串‘2’和字符串‘5’.它包括15个字节它会被分成这五个部分
* [0f 00 00 00] [0c 00 00 00] [02 00] [00 f3] [02 f6] [ff]
* | | | | | |
* zlbytes zltail entries "2" "5" end
一开始的四个字节表示数字15,表面整个ziplist的字节数。
第二块表示ziplist的最后一个元素的位置:12
第三块表示entry的个数:2
第四块的比特是:00000000 11110010 第一个字节因为前面没有entry所以为空,第二个字节的前四个比特可以对于上文中的分类
第五块的比特是00000010 11110110 第一个字节表示前一个entry的长度
第六块就是ziplist的固定结尾
增加一个元素‘hello world’在5后面
[02] [0b] [48 65 6c 6c 6f 20 57 6f 72 6c 64]
02是前一个entry的长度
0b表示是一个小于63的字符串
后面就是helloworld的的ascii码
ziplist的内存分配:
数据结构:
/* We use this function to receive information about a ziplist entry.
* Note that this is not how the data is actually encoded, is just what we
* get filled by a function in order to operate more easily. */
typedef struct zlentry {
unsigned int prevrawlensize; /* Bytes used to encode the previous entry len*/
unsigned int prevrawlen; /* Previous entry len. */
unsigned int lensize; /* Bytes used to encode this entry type/len.
For example strings have a 1, 2 or 5 bytes
header. Integers always use a single byte.*/
unsigned int len; /* Bytes used to represent the actual entry.
For strings this is just the string length
while for integers it is 1, 2, 3, 4, 8 or
0 (for 4 bit immediate) depending on the
number range. */
unsigned int headersize; /* prevrawlensize + lensize. */
unsigned char encoding; /* Set to ZIP_STR_* or ZIP_INT_* depending on
the entry encoding. However for 4 bits
immediate integers this can assume a range
of values and must be range-checked. */
unsigned char *p; /* Pointer to the very start of the entry, that
is, this points to prev-entry-len field. */
} zlentry;
翻译:
我们使用这个函数(不知道为啥结构体翻译成函数)去存储ziplist的entry中的信息。注意这里的结构不是entry真正的编码结构,而是为了我们抽象出来使我们操作更容易。
typedef struct zlentry {
unsigned int prevrawlensize; /* 编码上一个entry长度所需的字节数*/
unsigned int prevrawlen; /* 上一个entry的长度. */
unsigned int lensize; /* encoding占的长度*/
unsigned int len; /* 字节去表示真实的entry,对于string这个就是string的长度,而对于整数这个是012348都有可能 */
unsigned int headersize; /* prevrawlensize + lensize. */
unsigned char encoding; /* 被设定为ZIP_STR_* or ZIP_INT_*,但对于立即整数,这个被设置为一个范围内的值,必须被检查 */
unsigned char *p; /* 指向entry的最开始 */
} zlentry;
API
创建:
/* The size of a ziplist header: two 32 bit integers for the total
* bytes count and last item offset. One 16 bit integer for the number
* of items field. */
#define ZIPLIST_HEADER_SIZE (sizeof(uint32_t)*2+sizeof(uint16_t))
/* Size of the "end of ziplist" entry. Just one byte. */
#define ZIPLIST_END_SIZE (sizeof(uint8_t))
/* Return total bytes a ziplist is composed of. */
#define ZIPLIST_BYTES(zl) (*((uint32_t*)(zl)))
/* Return the offset of the last item inside the ziplist. */
#define ZIPLIST_TAIL_OFFSET(zl) (*((uint32_t*)((zl)+sizeof(uint32_t))))
/* Return the length of a ziplist, or UINT16_MAX if the length cannot be
* determined without scanning the whole ziplist. */
#define ZIPLIST_LENGTH(zl) (*((uint16_t*)((zl)+sizeof(uint32_t)*2)))
/* Create a new empty ziplist. */
unsigned char *ziplistNew(void) {
unsigned int bytes = ZIPLIST_HEADER_SIZE+ZIPLIST_END_SIZE;
unsigned char *zl = zmalloc(bytes);
ZIPLIST_BYTES(zl) = intrev32ifbe(bytes);
ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(ZIPLIST_HEADER_SIZE);
ZIPLIST_LENGTH(zl) = 0;
zl[bytes-1] = ZIP_END;
return zl;
}
插入元素
#define ZIPLIST_HEAD 0
#define ZIPLIST_TAIL 1
/* Return the pointer to the first entry of a ziplist. */
#define ZIPLIST_ENTRY_HEAD(zl) ((zl)+ZIPLIST_HEADER_SIZE)
/* Return the pointer to the last byte of a ziplist, which is, the
* end of ziplist FF entry. */
#define ZIPLIST_ENTRY_END(zl) ((zl)+intrev32ifbe(ZIPLIST_BYTES(zl))-1)
unsigned char *ziplistPush(unsigned char *zl, unsigned char *s, unsigned int slen, int where) {
unsigned char *p;
p = (where == ZIPLIST_HEAD) ? ZIPLIST_ENTRY_HEAD(zl) : ZIPLIST_ENTRY_END(zl);
return __ziplistInsert(zl,p,s,slen);
}
这里的where只有两种选择一个使head一个使tail,不能中间插入
这里的p是指向第一个entry或者指向最后一个entry(ziplist没有值的时候指向ZIP_END)
/* Return the pointer to the first entry of a ziplist. */
#define ZIPLIST_ENTRY_HEAD(zl) ((zl)+ZIPLIST_HEADER_SIZE)
/* Return the pointer to the last entry of a ziplist, using the
* last entry offset inside the ziplist header. */
#define ZIPLIST_ENTRY_TAIL(zl) ((zl)+intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl)))
#define ZIP_BIG_PREVLEN 254
/* Return the number of bytes used to encode the length of the previous
* entry. The length is returned by setting the var 'prevlensize'. */
#define ZIP_DECODE_PREVLENSIZE(ptr, prevlensize) do { \
if ((ptr)[0] < ZIP_BIG_PREVLEN) { \
(prevlensize) = 1; \
} else { \
(prevlensize) = 5; \
} \
} while(0)
/* Return the length of the previous element, and the number of bytes that
* are used in order to encode the previous element length.
* 'ptr' must point to the prevlen prefix of an entry (that encodes the
* length of the previous entry in order to navigate the elements backward).
* The length of the previous entry is stored in 'prevlen', the number of
* bytes needed to encode the previous entry length are stored in
* 'prevlensize'. */
返回entry中的prevlen和prevlen所占的size
#define ZIP_DECODE_PREVLEN(ptr, prevlensize, prevlen) do { \
ZIP_DECODE_PREVLENSIZE(ptr, prevlensize); \
if ((prevlensize) == 1) { \
(prevlen) = (ptr)[0]; \
} else { /* prevlensize == 5 */ \
(prevlen) = ((ptr)[4] << 24) | \
((ptr)[3] << 16) | \
((ptr)[2] << 8) | \
((ptr)[1]); \
} \
} while(0
/* Insert item at "p". */
unsigned char *__ziplistInsert(unsigned char *zl, unsigned char *p, unsigned char *s, unsigned int slen) {
size_t curlen = intrev32ifbe(ZIPLIST_BYTES(zl)), reqlen;
unsigned int prevlensize, prevlen = 0;
size_t offset;
int nextdiff = 0;
unsigned char encoding = 0;
long long value = 123456789; /* initialized to avoid warning. Using a value
that is easy to see if for some reason
we use it uninitialized. */
zlentry tail;
/* Find out prevlen for the entry that is inserted. */
if (p[0] != ZIP_END) { //如果ziplist有值
ZIP_DECODE_PREVLEN(p, prevlensize, prevlen); //获得p当前entry的prevlen
} else { //如果ziplist没有值
unsigned char *ptail = ZIPLIST_ENTRY_TAIL(zl);
if (ptail[0] != ZIP_END) {
prevlen = zipRawEntryLength(ptail); //最后一个entry的真实长度
}
}
/* See if the entry can be encoded */看这个entry能否编码
if (zipTryEncoding(s,slen,&value,&encoding)) {
/* 'encoding' is set to the appropriate integer encoding */
reqlen = zipIntSize(encoding); //存整数情况下的encoding长度
} else {
/* 'encoding' is untouched, however zipStoreEntryEncoding will use the
* string length to figure out how to encode it. */
reqlen = slen;
}
/* We need space for both the length of the previous entry and
* the length of the payload. */
reqlen += zipStorePrevEntryLength(NULL,prevlen); //前一个长度编码:1:5
reqlen += zipStoreEntryEncoding(NULL,encoding,slen); //encoding长度
/* When the insert position is not equal to the tail, we need to
* make sure that the next entry can hold this entry's length in
* its prevlen field. */
//当插入的地方不是最后面,我们需要保证下一个entry能保存前一个的长度 连锁升级
int forcelarge = 0;
nextdiff = (p[0] != ZIP_END) ? zipPrevLenByteDiff(p,reqlen) : 0;
if (nextdiff == -4 && reqlen < 4) { //原来的entry要更大,但新entry<4
nextdiff = 0;
forcelarge = 1;
}
/* Store offset because a realloc may change the address of zl. */
offset = p-zl;
zl = ziplistResize(zl,curlen+reqlen+nextdiff); //新长度
p = zl+offset;
/* Apply memory move when necessary and update tail offset. */
if (p[0] != ZIP_END) {
/* Subtract one because of the ZIP_END bytes */
memmove(p+reqlen,p-nextdiff,curlen-offset-1+nextdiff);
/* Encode this entry's raw length in the next entry. */
if (forcelarge)
zipStorePrevEntryLengthLarge(p+reqlen,reqlen); 存entry之后的pre
else
zipStorePrevEntryLength(p+reqlen,reqlen);
/* Update offset for tail */
ZIPLIST_TAIL_OFFSET(zl) =
intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+reqlen);
/* When the tail contains more than one entry, we need to take
* "nextdiff" in account as well. Otherwise, a change in the
* size of prevlen doesn't have an effect on the *tail* offset. */
zipEntry(p+reqlen, &tail);
if (p[reqlen+tail.headersize+tail.len] != ZIP_END) {
ZIPLIST_TAIL_OFFSET(zl) =
intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+nextdiff);
}
} else {
/* This element will be the new tail. */
ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(p-zl);
}
/* When nextdiff != 0, the raw length of the next entry has changed, so
* we need to cascade the update throughout the ziplist */
if (nextdiff != 0) {
offset = p-zl;
zl = __ziplistCascadeUpdate(zl,p+reqlen);
p = zl+offset;
}
entry的value赋值
/* Write the entry */
p += zipStorePrevEntryLength(p,prevlen);
p += zipStoreEntryEncoding(p,encoding,slen);
if (ZIP_IS_STR(encoding)) {
memcpy(p,s,slen);
} else {
zipSaveInteger(p,value,encoding);
}
ZIPLIST_INCR_LENGTH(zl,1);
return zl;
}
unsigned char *__ziplistCascadeUpdate(unsigned char *zl, unsigned char *p) {
size_t curlen = intrev32ifbe(ZIPLIST_BYTES(zl)), rawlen, rawlensize;
size_t offset, noffset, extra;
unsigned char *np;
zlentry cur, next;
while (p[0] != ZIP_END) {
zipEntry(p, &cur);
rawlen = cur.headersize + cur.len;
rawlensize = zipStorePrevEntryLength(NULL,rawlen);
/* Abort if there is no next entry. */
if (p[rawlen] == ZIP_END) break;
zipEntry(p+rawlen, &next);
/* Abort when "prevlen" has not changed. */
if (next.prevrawlen == rawlen) break;
if (next.prevrawlensize < rawlensize) {
/* The "prevlen" field of "next" needs more bytes to hold
* the raw length of "cur". */
offset = p-zl;
extra = rawlensize-next.prevrawlensize;
zl = ziplistResize(zl,curlen+extra);
p = zl+offset;
/* Current pointer and offset for next element. */
np = p+rawlen;
noffset = np-zl;
/* Update tail offset when next element is not the tail element. */
if ((zl+intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))) != np) {
ZIPLIST_TAIL_OFFSET(zl) =
intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+extra);
}
/* Move the tail to the back. */
memmove(np+rawlensize,
np+next.prevrawlensize,
curlen-noffset-next.prevrawlensize-1);
zipStorePrevEntryLength(np,rawlen);
/* Advance the cursor */
p += rawlen;
curlen += extra;
} else {
if (next.prevrawlensize > rawlensize) {
/* This would result in shrinking, which we want to avoid.
* So, set "rawlen" in the available bytes. */
zipStorePrevEntryLengthLarge(p+rawlen,rawlen);
} else {
zipStorePrevEntryLength(p+rawlen,rawlen);
}
/* Stop here, as the raw length of "next" has not changed. */
break;
}
}
return zl;
}
这里的连锁升级比较好理解,需要注意的是,当插入的元素要比原来的len要小。从5byte变1byte的时候,redis不会做缩容的操作,而是选择依然用5byte来进行保存,具体调用的就是zipStorePrevEntryLengthLarge这个方法。
/* Encode the length of the previous entry and write it to "p". This only
* uses the larger encoding (required in __ziplistCascadeUpdate). */
int zipStorePrevEntryLengthLarge(unsigned char *p, unsigned int len) {
if (p != NULL) {
p[0] = ZIP_BIG_PREVLEN;
memcpy(p+1,&len,sizeof(len));
memrev32ifbe(p+1);
}
return 1+sizeof(len);
}
除了这里的不缩容的一个场景,上一篇学的intset也不支持缩容的操作,redis虽然是一个主打内存消耗小的产品,但是在内存的申请和释放还是比较保守的,之后谈的string类型中编码方式也有涉及内存的特殊操作。