System Hardware & 计算机体系结构

System Hardware & 计算机体系结构

CPU

The CPU has several layers of cache between it and main memory (RAM), because even accessing main memory is too slow.

The closer the cache is to the CPU, the faster it is and the smaller it is. L1 cache is small and very fast, and right next to the core that uses it. L2 is bigger and slower, and still only used by a single core. L3 is more common with modern multi-core machines, and is bigger again, slower again, and shared across cores on a single socket. Finally you have main memory, which is shared across all cores and all sockets.

Latency from CPU to… Approx. number of CPU cycles Approx. time in nanoseconds
Main memory ~60-80ns
QPI transit (between sockets, not drawn) ~20ns
L3 cache ~40-45 cycles, ~15ns
L2 cache ~10 cycles, ~3ns
L1 cache ~3-4 cycles, ~1ns
Register 1 cycle



(1) Cache lines

Now the interesting thing to note is that it’s not individual items that get stored in the cache - i.e. it’s not a single variable, a single pointer. The cache is made up of cache lines, typically 64 bytes, and it effectively references a location in main memory. A Java long is 8 bytes, so in a single cache line you could have 8 long variables.

参考 Dissecting the Disruptor: Why it’s so fast (part two) - Magic cache line padding

线程

It is a basic Law of Computing that given a single CPU resource, executing A and B sequentially will always be faster than executing A and B “simultaneously” through time-slicing. Once the number of threads exceeds the number of CPU cores, you’re going slower by adding more threads, not faster.

Don’t be tricked into thinking, “SSDs are faster and therefore I can have more threads”. That is exactly 180 degrees backwards. Faster, no seeks, no rotational delays means less blocking and therefore fewer threads [closer to core count] will perform better than more threads. More threads only perform better when blocking creates opportunities for executing.

False Sharing

多个线程修改共享一个 Cache Line 的独立变量的时候,会造成这个 Cache Line 中的其他变量缓存失效,这被称之为 False sharing

在核心 1 上运行的线程想更新变量 X,同时核心2上的线程想要更新变量 Y。不幸的是,这两个变量在同一个缓存行中。每个线程都要去竞争缓存行的所有权来更新变量。如果核心 1 获得了所有权,缓存子系统将会使核心 2 中对应的缓存行失效。当核心 2 获得了所有权然后执行更新操作,核心 1 就要使自己对应的缓存行失效。这会来来回回的经过 L3 缓存,大大影响了性能。如果互相竞争的核心位于不同的插槽,就要额外横跨插槽 (socket interconnect) 连接,问题可能更加严重。

False sharing occurs when threads on different processors modify variables that reside on the same cache line. This invalidates the cache line and forces a memory update to maintain cache coherency.

(1) Java 6 通过填充缓存行 (padding the cache) 来避免:

1
2
3
abstract class RingBufferPad {
protected long p1, p2, p3, p4, p5, p6, p7;
}

参考 RingBuffer.java

(2) Java 7

It seems Java 7 got clever and eliminated or re-ordered the unused fields, thus re-introducing false sharing.

1
2
3
4
5
6
7
8
public static long sumPaddingToPreventOptimisation(final int index) {
PaddedAtomicLong v = longs[index];
return v.p1 + v.p2 + v.p3 + v.p4 + v.p5 + v.p6;
}

public static class PaddedAtomicLong extends AtomicLong {
public volatile long p1, p2, p3, p4, p5, p6 = 7L;
}

参考 False Sharing && Java 7

(3) Java 8 通过声明 sun.misc.Contended 注解加启动参数 -XX:-RestrictContended 来搞定:

1
2
3
4
5
6
7
8
9
10
public class ConcurrentHashMap<K,V> extends AbstractMap<K,V>
implements ConcurrentMap<K,V>, Serializable {

@sun.misc.Contended
static final class CounterCell {
volatile long value;
CounterCell(long x) { value = x; }
}

}

参考 Java8 中用 sun.misc.Contended 避免伪共享(false sharing)

位数

虽然 32 位使用较小的指针大小可以节约内存,但由于操作系统的寻址受限,它在缓冲区大小设置上存在固有的约束,理论上,每个进程在 32 为系统上的最大可用内存为 4GB,实际上在很多系统上这个数值很小。

磁盘

关于磁盘,你需要关注磁盘读延迟 (每次读访问需要多长时间) 和 fsync 延迟 (每个 fsync 耗时多少),大多数存储引擎是针对硬盘读写优化的,所以不要指望固态硬盘能够出现什么性能上的奇迹。


  • Hd(0,0) is the 1st partition of the 1st physical disk.
  • Hd(0,1) is the 2nd partition of the 1st physical disk.

I believe that

1
2
sda = hd0, 
sdb = hd1,

so on etc. etc. sdc = hd2 When it says sda1 that would be (hd0, 0) and sda2 (hd0, 1) and sda3 (hd0, 2) and sdb1 (hd1, 0) so on and so forth…

Hello World 的运行过程

信息的表示和处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <stdio.h>

typedef unsigned char *byte_pointer;

void show_bytes(byte_pointer start, int len) {
int i;
for (i=0; i<len; i++) {
printf("%.2x", start[i]);
}
printf("\n");
}

void show_int(int x) {
show_bytes((byte_pointer)&x, sizeof(int));
}

void show_float(float x) {
show_bytes((byte_pointer)&x, sizeof(float));
}

void show_pointer(void *x) {
show_bytes((byte_pointer)&x, sizeof(void *));
}
  • %.2x 表示整数必须用至少两个数字的十六进制格式输出。

1
2
3
4
5
6
int val = 0x87654321;
byte_pointer = (byte_pointer) &val;

show_bytes(valp, 1);
show_bytes(valp, 2);
show_bytes(valp, 3);
  • 小端法: 21 大端法: 87
  • 小端法: 21 43 大端法: 87 65
  • 小端法: 21 43 65 大端法: 87 65 43
  • 小端法: 21 43 65 87 大端法: 87 65 43 21

整数和浮点数的编码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void test_show_bytes(int val) {
int ival = val;
float fval = (float) val;
int *pval = &ival;

show_int(ival);
show_float(fval);
show_pointer(pval);
}

int main() {
test_show_bytes(12345);
return 0;
}
  • 整数 12345 的编码: 39300000
  • 浮点数 12345.0 的编码: 00e44046
1
2
0011 1001 0011 0000 0000 0000 0000 0000
0000 0000 1110 0100 0100 0000 0100 0110

你会发现整数和浮点数有一部分是交叉重复的。


位级运算一个常见用法就是实现掩码运算,表示从一个字中选出的位的集合。

  • 0xFF 表示一个字的低位字节
  • x & 0xFF 生成一个由 x 的最低有效字节组成的值,而其他字节置位 0
  • ~0xFF 表示 8 个最低位为 0,而其余的位都是 1
  • ~0 生成一个全 1 的掩码,不管机器的字是多少。尽管对于一个 32 位的机器来说,同样的掩码可以写成 0xFFFFFFFF,但是这样的代码是不可移植的。

实际中较为常用的几种模式:

  • x 的最低有效字节,其他位均置位 0: x &0xFF
  • 除了 x 的最低有效字节外,其他的位都取补,最低有效字节保持不变: x ^ ~0xFF
  • x 的最低有效字节设置成全 1,其他字节保持变: x | 0xFF

对于 0x87 ([10000111]) 来说,不同的移位操作:

  • x << 3: [00111000]
  • x >> 2 (逻辑): [00100001]
  • x >> 2 (算术): [11100001]

反汇编器是一种将可执行程序文件转换回可读性更好的 ASCII 码形式的程序。这些文件包含许多十六进制数字,都是用典型的补码形式来表示这些值。能够认识这些数字并理解它们的意义 (例如,它们是正数还是负数),是一项重要的技能:

1
8048337: 81 ec b8 01 00 00    sub    $0x1b8,%esp    440

81 ec b8 01 00 00 是一条指令的字节级表示,取出后 4 个字节 b8 01 00 00 按照相反的顺序写出来,我们得到 00 00 01 b8,这就是等价的十进制 440

1
8048343: 8b 85 58 fe ff ff    mov    0xfffffe58(%ebp),%eax   -424

58 fe ff ff 相反顺序:

1
2
3
4
5
ff ff fe 58
// 以二进制表示
11111111 11111111 11111110 01011000
// 取反 + 1,这个是正数 424,因此原值为 -424
00000000 00000000 00000001 10101000

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// 在无符 word 上进行的,算是逻辑移位
// 从参数的低 8 位中提取一个值,得到范围 0 ~ 155 之间的一个整数
// 0x00000076 -> 0x00000076
// 0x87654321 -> 0x00000021
// 0x000000C9 -> 0x000000C9
// 0xEDCBA987 -> 0x00000087
int fun1(unsigned word) {
return (int) ((word << 24) >> 24);
}

// word 强制转换为 int
// 进行算术移位
// 从这个参数的低 8 位中提取一个值,但是它还要执行符号扩展。结果将是介于 -128 ~ 127 之间的一个数
// 0x00000076 -> 0x00000076
// 0x87654321 -> 0x00000021
// 0x000000C9 -> 0xFFFFFFC9
// 0xEDCBA987 -> 0xFFFFFF87
int fun2(unsigned word) {
return ((int) word << 24) >> 24;
}

参数 length 是无符的,计算 0-1 将进行无符号运算,这等价于模数加法。结果得到 unsigned max,任何数都是小于或者等于 unsigned max 的,所以这个比较总是真!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Bug fix (1):
// float sum_elements(float a[], int length) {}
//
// Bug fix (2):
// for (i=0; i<length; i++) {}
float sum_elements(float a[], unsigned length) {
int i;

float result = 0;
for (i=0; i<=length - 1; i++) {
result += a[i];
}

return result;
}

阅读下面这个小测试:

1
2
3
4
5
6
7
8
9
#include <stdio.h>

int main() {
unsigned length = 0;
unsigned ret = length - 1;
// 4294967295
printf("%u", ret);
return 0;
}

1
2
3
4
5
6
7
8
9
10
11
size_t strlen(const char *s);

// 当 `s` 比 `t` 短的时候,该函数会不正确的返回 1
// strlen(s) - strlen(t) 会为负
// 变成了一个很大的无符号数,且大于 0
//
// Bug Fix (1):
// return strlen(s) > strlen(t);
int strlonger(char *s, char *t) {
return strlen(s) - strlen(t) > 0;
}

size_t 在 32 位机器上典型的被定义为 unsigned int,看下面这个示例:

1
2
3
4
5
6
7
8
9
10
#include <stdio.h>
#include <string.h>

int main() {
size_t lenA = strlen("a");
size_t lenB = strlen("abcdefg");
// 18446744073709551610
printf("%lu", lenA - lenB);
return 0;
}
  • %lu for unsigned long
  • %llu for unsigned long long

虚拟内存 Virtual Memory

In computing, virtual memory (also virtual storage) is a memory management technique (内存管理技巧) that provides an “idealized abstraction of the storage resources that are actually available on a given machine” which “creates the illusion to users of a very large (main) memory.” (给用户一种我有很大内存的感觉)

Virtual memory combines active RAM and inactive memory on DASD to form a large range of contiguous addresses.

Virtual_Memory

Address translation hardware (地址转换硬件) in the CPU, often referred to as a memory management unit or MMU, automatically translates virtual addresses to physical addresses.

Interrupt 中断

In system programming, an interrupt is a signal (信号) to the processor (处理器) emitted by hardware or software (硬件或者软件发出的) indicating an event that needs immediate (立即处理) attention. The processor responds by suspending (挂起) its current activities, saving its state, and executing a function called an interrupt handler (中断回调) (or an interrupt service routine, ISR) to deal with the event. This interruption is temporary, and, after the interrupt handler finishes, the processor resumes (恢复) normal activities.

Types of interrupts:

(1) Level-triggered (电平触发,也被称为条件触发): A level-triggered interrupt is an interrupt signaled by maintaining the interrupt line at a high or low logic level. 电平触发是在高或低电平保持的时间内触发。只要满足条件,就触发一个事件(只要有数据没有被获取,内核就不断通知你)

(2) Edge-triggered (边沿触发): An edge-triggered interrupt is an interrupt signalled by a level transition on the interrupt line, either a falling edge (high to low) or a rising edge (low to high). 边沿触发是由高到低或由低到高这一瞬间触发。每当状态变化时,触发一个事件。

参考

推荐文章