Optimization Practice of ten million MongoDB data index

Xiao Li is the back-end person in charge of this company. Suddenly one afternoon, he received a large number of customer service feedback that users can't use our APP. Many operations and loads are network waiting timeout.

After receiving the information, Xiao Li immediately investigated the cause of the problem. However, after a while, he located that there were a large number of slow queries in the database, resulting in server overload and high CPU. Why did this happen? At this time, Xiao Li was very flustered. After querying the data, he began to explore in the direction of slow query. Unexpectedly, due to the rapid growth of business data, the corresponding data table did not have the index data of corresponding query, At the moment, the corners of Xiao Li's mouth rose, with a smile on his face. With confidence, he added index fields to the relevant data tables of the database. However, the situation has not improved, and the online is still not restored. Due to experience, we can only adopt the downgrade scheme (close the relevant query business of this table) to temporarily restore the online normal first.

However, the matter is not over and the problem has not been fundamentally solved. The company and himself are still very concerned about the solution of this problem. At dinner at night, Xiao Li suddenly remembered that he knew an industry leader (Lao BAI). After telling Lao Bai about the problem, Lao Bai soon professionally told Xiaobai which operations had problems and how to correctly solve the problem. When quoting, he should first learn to do query and analysis, and then understand the ESR best practice rules (which will be explained below). Xiao Li didn't feel lost because of his own shortcomings, but because of his own shortcomings, he was full of thirst for knowledge.

What are the excellent gestures of database indexing application?

MongoDB index type

Single key index

db.user.createIndex({createdAt: 1}) 

createdAt creates a single field index, which can quickly retrieve various query requests of createdAt field. It is common
{createdAt: 1} ascending index or descending index through {createdAt: -1}. For single field index,
The effect of ascending / descending order is the same.

Composite index

db.user.createIndex({age: 1, createdAt: 1}) 

You can jointly create an index for multiple fields. First, sort by the first field. Documents with the same first field are sorted by the second field, and so on. Therefore, the application of sorting and index is also very important when making queries.

In the actual scenario, this kind of index is also used most. In MongoDB, it is satisfied, so it can match the query that meets the index prefix, such as dB user. createIndex({age: 1, createdAt: 1}) ,
We don't need to be DB alone user. Createindex ({age: 1}) is used to create an index, because when using age alone as query criteria, you can also hit dB user. Createindex ({age: 1, createdat: 1}), but when using createdat as the query criteria alone, it cannot match dB user. createIndex({age: 1, createdAt: 1})

Multivalued index

When the field of the index is an array, the index created is called a multi key index. A multi key index will establish an index for each element of the array

// User's social login information,
schema = {
    ...
    snsPlatforms:[{
        platform:String, // Login platform
        openId:String, // Login unique identifier
    }]
}
// This is also a column to row document design, which will be described later
db.user.createIndex({snsPlatforms.openId:1}) 

TTL index

You can specify the expiration time of the document for a certain time field (for data storage that is valid only for a period of time, and the document will be deleted when it reaches the specified time, so you can automatically delete the data)
This deletion operation is safe. The data will be executed in the low peak period of the application, so it will not cause high IO due to deleting a large number of files, which will seriously affect the data performance.

Partial index

This feature is only supported in version 3.2, which indexes qualified data documents in order to save index storage space and writing cost

db.user.createIndex({sns.qq.openId:1})
/**
 * For qq login openid, only a small part of the system actually uses qq login, and then this data field will exist
 * There is no need to add this index to all documents, only if the conditions are met
 */
db.user.createIndex({sns.qq.openId:1} ,{partialFilterExpression:{$exists:1}})

Sparse index

Even if the index field contains an empty index field.
The index skips all documents that are missing index fields.

db.user.createIndex({sns.qq.openId:1} ,{sparse:true})

Note: since version 3.2, partial indexes are provided, which can be used as a superset of sparse indexes. The official recommends giving priority to partial indexes rather than sparse indexes.

ESR index rules

Index field order: equal > sort > range

The fields with exact (Equal) matching are placed first, the sorting conditions are placed in the middle, and the fields with range matching are placed last. It is also applicable to es and ER.

Practical example: get the students whose math scores in Senior 2 class are greater than 120 in the grade sheet, and sort them according to the scores from large to small
It is not difficult to see that the class and subject (Mathematics) can be accurately matched, and the score is not only a range query, but also a sorting condition
Then, according to the ESR rules, we can establish the index in this way
{"class": 1, "subject": 1, "score": 1}

How do we analyze the hit and validity of this index?

db. collection. The explain () function can output the document and find the execution plan, which can help us make more correct choices.
The analysis function returns a lot of data, but we can mainly focus on this field

executionStats execution statistics

{
    "queryPlanner": {
        "plannerVersion": 1,
        "namespace": "test.user",
        "indexFilterSet": false,
        "parsedQuery": {
            "age": {
                "$eq": 13
            }
        },
        "winningPlan": { ... },
        "rejectedPlans": []
    },
    "executionStats": {
        "executionSuccess": true,
        "nReturned": 100,
        "executionTimeMillis": 137,
        "totalKeysExamined": 48918,
        "totalDocsExamined": 48918,
        "allPlansExecution": []
    },
    "ok": 1,
}

nReturned actual number of returned data rows

The total execution time of the executionTimeMillis command, in milliseconds

totalKeysExamined indicates that MongoDB has scanned N index data. The number of keys checked matches the number of documents returned, which means mongod only needs to check the index key to return the result. Mongod does not have to scan all documents, only N matching documents are pulled into memory. This query result is very efficient.

totalDocsExamined document scans

The more explicit the values of these fields, the better the efficiency. The best state is
nReturned = totalKeysExamined = totalDocsExamined
If there is a big difference, it indicates that there is still a lot of room for optimization. When the specific business needs to be analyzed as appropriate.
Details of the optimal execution plan returned by the query optimizer for the query (queryPlanne.winningPlan)

stage

COLLSCAN: Full table scan,This situation is the worst
IXSCAN: Index scan
FETCH: Retrieve the specified information according to the index document
SHARD_MERGE: The returned data of each fragment is processed merge
SORT: Indicates that the sort is in memory
LIMIT: use limit Limit returns
SKIP: use skip Skip
IDHACK: in the light of_id Make a query
SHARDING_FILTER: adopt mongos Query fragment data
COUNT: utilize db.coll.explain().count()And so on count operation
COUNTSCAN:  count Not used Index conduct count Timely stage return
COUNT_SCAN:  count Used Index conduct count Timely stage return
SUBPLA: Unused index $or Inquired stage return
TEXT: When using full-text index for query stage return
PROJECTION: Limited return field time stage Return of

What we don't want to see (we should pay attention to the following situations, and problems may arise)

COLLSCAN(Full table scan)
SORT But there is no relevant index
 Oversized SKIP
SUBPLA in use $or Failed to hit the index when
COUNTSCAN implement count The index was not hit

Then let's look at the actual execution order of a common query

db.user.find({age:13}).skip(100).limit(100).sort({createdAt:-1})

As can be seen from the figure, the first is IXSCAN index scanning, and the last is SKIP data filtering.

In executionStats, each item has nReturned and executionTimeMillisEstimate, so that we can view the whole query execution from inside to outside, and at which step the execution is slow.

About column to row document design pattern

First of all, the more database indexes are not the better. In the upper limit of MongoDB single document index, the number of indexes in the collection cannot exceed 64, and some well-known large manufacturers recommend no more than 10.

In a main table, due to redundant document design, there will be a lot of information that needs to be indexed. Let's take social login as an example

Conventional design

schema = {
...
        qq:{
            openId:String
        },
        wxapp:{
            openId:String,
        },
        weibo:{
            openId:String,
        }
...
}

// Each time a new login type is added, the document schema needs to be modified and the index needs to be added
db.user.createIndex({qq.openId:1}) 
db.user.createIndex({wxapp.openId:1}) 
db.user.createIndex({weibo.openId:1}) 

Column to row design

schema = {
...
 snsPlatforms:[{
    platform:String, // Login platform
    openId:String, // Login unique identifier
 }]
}
// At this time, no matter whether the login platform is added or deleted, there is no need to change the index design. One index solves all problems of the same type
db.user.createIndex({snsPlatforms.openId:1,snsPlatforms.platform:1})

Question: why should openId be placed in front of the platform?

This little story tells Xiao Li's thinking and process of dealing with problems that cannot be solved by his own knowledge. Everyone has something beyond his ability. In this case, we should give priority to solving the problem or reduce the impact scope of the accident.

Posted by Shaun13 on Sun, 17 Apr 2022 16:25:49 +0930