# volatile memory barrier pit

#### Look at the following code and try to guess the output:

You may give up reading the following code, but if you want to fully understand volatile, you need patience. The following code is very simple!

In the following code, we define four fields x, y, a and b, which are initialized to 0
Then, we create two tasks that call Test1 and Test2 respectively, and wait for the two tasks to complete.
After completing the two tasks, we check whether a and b are still 0,
If so, print their values.
Finally, we reset everything to 0 and run the same loop again and again.

```using System;

namespace MemoryBarriers
{
class Program
{
static volatile int x, y, a, b;
static void Main()
{
while (true)
{
var t1 = Task.Run(Test1);
var t2 = Task.Run(Test2);
if (a == 0 && b == 0)
{
Console.WriteLine("{0}, {1}", a, b);
}
x = y = a = b = 0;
}
}

static void Test1()
{
x = 1;
// Interlocked.MemoryBarrierProcessWide();
a = y;
}

static void Test2()
{
y = 1;
b = x;
}
}
```

If you run the above code (preferably in Release mode), you will see many outputs with outputs of 0 and 0, as shown in the following figure. #### Let's analyze ourselves according to the code

In Test1, we set x to 1 and a to y, while Test2 set Y to 1 and b to x
Therefore, these four statements will compete in two threads
List the following possible situations:

#### 1. Test1 is executed before Test2:

```x = 1
a = y
y = 1
b = x
```

In this case, if Test1 is completed before Test2, the final value will be

```x = 1，a = 0，y = 1，b = 1
```

#### 2. execute Test1 after test2:

```y = 1
b = x
x = 1
a = y
```

In this case, the final value will be

```x = 1，a = 1，y = 1，b = 0
```

#### 2. Execute Test2 during test1 execution:

```x = 1
y = 1
b = x
a = y
```

In this case, the final value will be

```x = 1，a = 1，y = 1，b = 1
```

#### 3. Execute Test1 during test2 execution

```y = 1
x = 1
a = y
b = x
```

In this case, the final value will be

```x = 1，a = 1，y = 1，b = 1
```

#### 4. Test1 and Test2

```x = 1
y = 1
a = y
b = x
```

In this case, the final value will be

```x = 1，a = 1，y = 1，b = 1
```

#### 5.Test2 Test1

```y = 1
x = 1
b = x
a = y
```

In this case, the final value will be

```x = 1，a = 1，y = 1，b = 1
```

I think it's listed above
All possible situations have been covered,
But no matter what kind of competition happens,
It seems that once both tasks are completed,
It is impossible to make both a and b zero at the same time,
But miraculously, it has been printing 0,0 (please see the dynamic diagram above. If you doubt it, try to copy the code) #### There is always only one truth

Let's reveal the answer first: the out of order execution of cpu.

Let's take a look at the IL intermediate code of Test1 and Test2.
I added comments in the relevant sections.

```#ConsoleApp9.Program.Test1()
#function prolog ommitted
L0015: mov dword ptr [rax+8], 1   # Upload value 1 to memory address' x '
L001c: mov edx, [rax+0xc]         # Download the value from memory address' y 'and put it into EDX (register)
L001f: mov [rax+0x10], edx.       # Upload value from (edx) register to memory address' a '
L0022: add rsp, 0x28.
L0026: ret

#ConsoleApp9.Program.Test2()
#function prolog
L0015: mov dword ptr [rax+0xc], 1  # Upload value 1 to memory address' y '
L001c: mov edx, [rax+8].           # Download the value from memory address' x 'and put it into EDX (register)
L001f: mov [rax+0x14], edx.        # Upload value from (edx) register to memory address' b '
L0022: add rsp, 0x28
L0026: ret
```

To read a value from a variable and assign it to another storage location,
We must read it into the CPU register (such as edx above),
It can then be assigned to the target variable.
Because the CPU operation is very fast, the reading or writing of memory is really slow compared with the operation performed in the CPU.
Therefore, I use "Upload" and "download", which are relative to the CPU cache [the behavior of reading and writing to memory]
As slow as we upload or download to or from a remote Web service.

#### The following are the indicators (2020 data) (ns is nanosecond)

L1 cache reference: 1 ns
L2 cache reference: 4 ns
Branch mispredict: 3 ns
Mutex lock/unlock: 17 ns
Main memory reference: 100 ns
Compress 1K bytes with Zippy: 2000 ns
Send 2K bytes over commodity network: 44 ns
Read 1 MB sequentially from memory: 3000 ns
Round trip within same datacenter: 500,000 ns
Disk seek: 2,000,000 ns
Read 1 MB sequentially from disk: 825,000 ns
Read 1 MB sequentially from SSD: 49000 ns

#### It can be seen that accessing the main memory is 100 times slower than accessing the content in the CPU cache

How would you design this? You want to save time by parallelizing threads!
This is the function of CPU. The CPU is very smart designed by us,
In actual operation, it can be determined that some "Upload" and "download" operations (instructions) will not affect each other,
In order to save time, the CPU performs (optimized) parallel processing on them (instructions),
Also called out of order

I said above: in the actual operation, it can be determined that some "Upload" and "download" operations (instructions) will not affect each other,
There is a precondition here: this assumption is only based on per thread basis dependency checks.
Although a single thread can be determined as instruction independence, the CPU cannot consider the situation of multiple threads, so [volatile keyword] is provided

#### Let's go back to the above example. Although we have marked the field volatile, it doesn't feel to work. Why?

Generally speaking, volatile, I will give the following example (memory visibility)

```using System;
public class C {
bool completed;
static void Main()
{
C c = new C();
var t = new Thread (() =>
{
bool toggle = false;
while (!c.completed) toggle = !toggle;
});
t.Start();
c.completed = true;
t.Join();        // Blocks indefinitely
}
}

```

If you run the above code in release mode, it will also loop indefinitely.
The CPU is not guilty this time, but the culprit is JIT optimization.

If you put:

```bool completed;
```

Change to

```volatile bool completed;
```

There will be no dead cycle.
Let's take a look at the IL codes of [without volatile] and [with volatile]:

# No volatile

```L0000: xor eax, eax
L0002: mov rdx, [rcx+8]
L0006: movzx edx, byte ptr [rdx+8]
L000a: test edx, edx
L000c: jne short L001a
L000e: test eax, eax
L0010: sete al
L0013: movzx eax, al
L0016: test edx, edx # < -- look here
L0018: je short L000e
L001a: ret
```

```L0000: xor eax, eax
L0002: mov rdx, [rcx+8]
L0006: cmp byte ptr [rdx+8], 0
L000a: jne short L001e
L000c: mov rdx, [rcx+8]
L0010: test eax, eax
L0012: sete al
L0015: movzx eax, al
L0018: cmp byte ptr [rdx+8], 0  <-- Look here
L001c: je short L0010
L001e: ret
```

Pay attention to the line I annotated. These IL code lines above are actually where the code is checked:

```        while (!c.completed)
```

When volatile is not used, the JIT caches the completed value into the register (edx), and then uses only the value of the edx register to judge (while (!c.completed)).
However, when we use volatile, we will force JIT not to cache,
Instead, each time we need to read the value of its direct access to memory (cmp byte ptr [rdx+8], 0)

JIT caches to registers because it finds that the speed of memory access is more than 100 times slower. Just like CPU, JIT caches variables for good intentions.
Therefore, it cannot detect changes in other threads.
volatile solves the problem here, forcing the JIT not to cache.

Having said that, let's talk about another feature of volatile: memory barrier

1. Ensure that the download instruction from the volatile variable is completed before the next upload / download instruction is executed.

2. Ensure that the last upload / download instruction is completed before executing the current upload instruction to the {volatile variable.

However, volatile does not prohibit completing the download instruction of volatile variables before completing the last upload instruction.
The CPU can execute in parallel and can continue to perform any operation performed first.
Because the volatile keyword cannot be blocked, this is what happens here:

```mov dword ptr [rax+0xc], 1  # Upload value 1 to memory address' y '
mov edx, [rax+8].           # Download the value from memory address' x 'and put it into EDX (register)
```

Become this

```mov edx, [rax+8].           # Download the value from memory address' x 'and put it into EDX (register)
mov dword ptr [rax+0xc], 1  # Upload value 1 to memory address' y '
```

Therefore, since the CPU considers these instructions to be independent, it reads x before y update. Similarly, it reads y before x update in Test1 method.
That's why the pit of this example appears ~ ~!

#### How to solve it?

Input memory barrier memory barrier is a special locking instruction to the CPU, which prohibits the reordering of instructions on this barrier. Therefore, the program will run as expected, but the disadvantage is that it will be tens of nanoseconds slower.

In our example, a line of code is annotated:

```   //Interlocked.MemoryBarrierProcessWide();
```

If you uncomment this line, the program will run normally~~~~~

#### summary

In general, when we say volatile, it is easy to understand its memory visibility. It is difficult to understand the concept of memory barrier. The assignment of volatile variables in the concept of memory barrier,
Volatile does not prohibit the download instruction of volatile variables before the last upload instruction is completed. You must pay attention to this in a multithreaded environment!

Tags: volatile

Posted by tengkie on Tue, 19 Apr 2022 05:06:51 +0930