Prometheus time series database - storage structure in disk

Prometheus time series database - storage structure in disk

preface

In the previous article, the author described the structure of monitoring data in Prometheus memory in detail. The storage structure in the disk is also very interesting. This part will be described in this article.

Disk directory structure

First, let's look at the file directory structure formed after Prometheus runs
The specific structure on the author's own machine is as follows:

prometheus-data
	|-01EY0EH5JA3ABCB0PXHAPP999D (block)
	|-01EY0EH5JA3QCQB0PXHAPP999D (block)
		|-chunks
			|-000001
			|-000002
			.....
			|-000021
		|-index
		|-meta.json
		|-tombstones
	|-wal
	|-chunks_head

Block

A Block is an independent small database, which stores the information used by all queries for a period of time. Including label / index / symbol table data, etc. The essence of Block is to organize the memory data in a period of time into a file and save it.

The most recent Block usually stores data for 2 hours, while the older Block will be merged through the compactor. A Block may store information for several hours. It is worth noting that the merge operation only reduces the size of the index (especially the merge of symbol tables), while the size of its own data (chunks) has not changed.

meta.json

We can check meta JSON to get some meta information of the current Block.

{
	"ulid":"01EY0EH5JA3QCQB0PXHAPP999D"
	// maxTime-minTime = 7200s => 2 h
	"minTime": 1611664000000
	"maxTime": 1611671200000
	"stats": {
		"numSamples": 1505855631,
		"numSeries": 12063563,
		"numChunks": 12063563
	}
	"compaction":{
		"level" : 1
		"sources: [
			"01EY0EH5JA3QCQB0PXHAPP999D"
		]
	}
	"version":1
}

The meta information is very clear. The data of this Block has been recorded for 2 hours.
Let's find another old Block and take a look at its meta Json

	"ulid":"01EXTEH5JA3QCQB0PXHAPP999D",
	// maxTime - maxTime =>162h
	"minTime":1610964800000,
	"maxTime":1611548000000
	......
	"compaction":{
		"level": 5,
		"sources: [
			31 01 EX......
		]
	},
	"parents: [
		{	
			"ulid": 01EXTEH5JA3QCQB1PXHAPP999D
			...
		}
		{	
			"ulid": 01EXTEH6JA3QCQB1PXHAPP999D
			...
		}
				{	
			"ulid": 01EXTEH5JA31CQB1PXHAPP999D
			...
		}
	]

We can see from it that the Block is compressed by 31 original blocks for 5 times. The last compressed three Block ulid s are recorded in parents. As shown in the figure below:

Chunks structure

CUT file segmentation

All Chunk files will not be greater than 512M on disk, and the corresponding source code is:

func (w *Writer) WriteChunks(chks ...Meta) error {
	......
	for i, chk := range chks {
		cutNewBatch := (i != 0) && (batchSize+SegmentHeaderSize > w.segmentSize)
		......
		if cutNewBatch {
			......
		}
		......
	}
}

When a single file written to the disk exceeds 512M, a new file will be automatically segmented.

A Chunks file contains a lot of memory Chunks, as shown in the following figure:
The figure also shows how we find the corresponding chunk. By encoding the file name (00000 1, the first 32 bits) and (offset, the last 32 bits) into an int type refId, we can easily obtain the corresponding chunk data through this id.

chunks files are accessed through mmap

Since the chunks file size is basically fixed (the maximum size is 512M), we can easily access the corresponding data through mmap. Directly hand over the reading operation of the corresponding file to the operating system, which saves both worry and effort. The corresponding code is:

func NewDirReader(dir string, pool chunkenc.Pool) (*Reader, error) {
	......
	for _, fn := range files {
		f, err := fileutil.OpenMmapFile(fn)
		......
	}
	......
	bs = append(bs, realByteSlice(f.Bytes()))
}
adopt sgmBytes := s.bs[offset]The corresponding data can be obtained directly

index structure

After introducing the chunk file, we can begin to elaborate on the most complex index structure.

Addressing process

Index is to let us quickly find the content we want, in order to facilitate understanding. The author explores the disk index structure of Prometheus through one-time data addressing. Consider querying a

It has three labels in the series
({__name__:http_requests}{job:api-server}{instance:0})
And the time is start/end All sequence data

Let's start by selecting Block and traverse the meta of all blocks JSON, find the specific Block
As mentioned earlier, finding data through Labels is through inverted index. Our inverted index is saved in the index file. So how to find the inverted index position in this single file? This introduces TOC(Table Of Content)

TOC(Table Of Content)


Since the index file will not change once it is formed, Prometheus still uses mmap for operation. It is very easy to read TOC with mmap:

func NewTOCFromByteSlice(bs ByteSlice) (*TOC, error) {
	......
	// indexTOCLen = 6*8+4 = 52
	b := bs.Range(bs.Len()-indexTOCLen, bs.Len())
	......
	return &TOC{
		Symbols:           d.Be64(),
		Series:            d.Be64(),
		LabelIndices:      d.Be64(),
		LabelIndicesTable: d.Be64(),
		Postings:          d.Be64(),
		PostingsTable:     d.Be64(),
	}, nil
}

Posting offset table and posting inverted index

First, we access the Posting offset table. Because the inverted index has many entries according to different LabelPair(key/value). Therefore, the positioning offset table determines which Posting index to access. Offset refers to the offset of this Posting entry in the file.

Series

We get it from the intersection of three Postings inverted indexes

{series1,Series2,Series3,Series4}
∩
{series1,Series2,Series3}
∩
{Series2,Series3}
=
{Series2,Series3}

That is to read the data in Series2 and Serie3, and Ref(Series2) and Ref(Series3) in Posting are the offsets of the two series in the index file.
Series records the chunkId and the time range contained in the chunk in the form of Delta. In this way, we can easily filter out the chunks we need, and then access them according to the chunk file to find the final original data.

SymbolTable

It is worth noting that in order to minimize the size of our file, we will store the limited data such as Label Name and Value in the symbol table in alphabetical order. Because it is ordered, we can directly think of the symbol table as a [] string slice. Then obtain the corresponding sting through the subscript of the slice. Consider the following symbol table:
When reading the index file, all symboltables will be loaded into memory and organized into slices such as symbols []string, so that all tag values in a Series can be accessed through slice subscripts.

Label Index and Label Table

In fact, the previous introduction has finished a general data addressing process. However, the index file also contains label index and label Table, which are used to record all possible values under a label. In this way, we can easily find which labelpairs we need when regularizing. See the previous section for details.

In fact, the real Label Index is a little more complex than in the figure. It is designed as a LabelIndex, which can represent all data (multiple label combinations). However, in Prometheus code, only the form of storing all values corresponding to one label will be used.

Complete index file structure

The complete index file structure is directly given here, which is extracted from index. In Prometheus MD document.

┌────────────────────────────┬─────────────────────┐
│ magic(0xBAAAD700) <4b>     │ version(1) <1 byte> │
├────────────────────────────┴─────────────────────┤
│ ┌──────────────────────────────────────────────┐ │
│ │                 Symbol Table                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │                    Series                    │ │
│ ├──────────────────────────────────────────────┤ │
│ │                 Label Index 1                │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                 Label Index N                │ │
│ ├──────────────────────────────────────────────┤ │
│ │                   Postings 1                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                   Postings N                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │               Label Index Table              │ │
│ ├──────────────────────────────────────────────┤ │
│ │                 Postings Table               │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      TOC                     │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘

tombstones

Because the data of Prometheus Block will not change after writing. If you want to delete some data, you can only record the range of deleted data, and delete it when the next compactor forms a new block. The file that records this information is tomstones.

summary

As a time series database, Prometheus designed various file structures to save a large amount of monitoring data, while taking into account the performance. Only by thoroughly understanding its storage structure can we better guide us to apply it!

Welcome to my official account, there are all kinds of dry cargo and gift bags.

Tags: Prometheus c3d ede delta

Posted by qteks200 on Thu, 14 Apr 2022 11:46:36 +0930