Operating System (操作系统)

Operating System (操作系统)

What is wrong with the code : x = x + a , in context of a concurrent system?

At machine level, x = x + a is implemented in multiple instructions :

  • Load the value of x into register.
  • Add a to the register value.
  • Store the value from register.

For concurrent systems, race condition can come and result in multiple processes/threads reading the old value simultaneously and updating independently. This will result in overwriting results from each other. To implement the code safely, we need to acquire write-lock (mutually exclusive lock) before writing our value. This will result in updating of value sequentially which will avoid the race condition:

1
2
3
4
5
void add(x, a) {
lockW(x);
x = x + a;
unlockW(x);
}

What is inode in file system?

inode is a type of data-structure in a UNIX-style file system, and is used to store a file’s metadata such as block location, owner, time of last change, etc.

In Linux Operaing System, Two files can have same inode number: In linux kernel Inode numbers are filesystem-specific and are created to index files locally. There is no mechanism for filesystems to coordinate across partitions and devices and future-connected devices to decide unique inode numbers. A file is identified by first identifying the device and an inode within the device. therefore two files on different devices or partitions can have same inode number. Hardlink is a good example.

Scheduling

The scheduler is an operating system module that selects the next jobs to be admitted into the system and the next process to run.

  • Long-term scheduling: decides which jobs or processes are to be admitted (from the secondary memory) to the ready queue (in main memory).
  • Medium-term scheduling: The medium-term scheduler temporarily removes processes from main memory and places them in secondary memory.
  • Short-term scheduling: decides which of the ready, in-memory processes is to be executed (allocated a CPU) after a clock interrupt, an I/O interrupt, an operating system call or another form of signal.

Example of fork() system call

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// From Advanced Programming in the Unix Environment Figure 8.1
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int glob = 6;
int
main(void)
{
int var;
pid_t pid;
var = 88;
printf("before fork\n");
if ((pid = fork()) < 0) {
err_sys("fork error");
} else if (pid == 0) {
glob++;
var++;
} else {
sleep(2);
}
printf("pid = %d, glob = %d, var = %d\n", getpid(), glob, var);
exit(0);
}

The fork() system call is called once and returns twice. In child process, when we modify the local and global variables which are located in stack and heap respectively of the child process, only the child process will be affected with output:

1
2
3
before fork
pid = 9894, glob = 7, var = 89 // child output
pid = 9893, glob = 6, var = 88 // parent output after 2 seconds

Zombie Processes & Prevention

When a process is created in UNIX using fork() system call, the address space of the Parent process is replicated. If the parent process calls wait() system call, then the execution of parent is suspended until the child is terminated. At the termination of the child, a ‘SIGCHLD’ signal is generated which is delivered to the parent by the kernel. Parent, on receipt of ‘SIGCHLD’ reaps the status of the child from the process table. Even though, the child is terminated, there is an entry in the process table corresponding to the child where the status is stored. When parent collects the status, this entry is deleted. Thus, all the traces of the child process are removed from the system. If the parent decides not to wait for the child’s termination and it executes its subsequent task, then at the termination of the child, the exit status is not read. Hence, there remains an entry in the process table even after the termination of the child. This state of the child process is known as the Zombie state.

Different ways in which creation of Zombie can be prevented:-

  • Using wait() system call : When the parent process calls wait(), after the creation of child, it indicates that, it will wait for the child to complete and it will reap the exit status of the child. The parent process is suspended(waits in a waiting queue) until the child is terminated. It must be understood that during this period, the parent process does nothing just waits.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// A C program to demonstrate working of
// fork()/wait() and Zombie processes
#include<stdio.h>
#include<unistd.h>
#include<sys/wait.h>
#include<sys/types.h>

int main()
{
int i;
int pid = fork();
if (pid==0)
{
for (i=0; i<20; i++)
printf("I am Child\n");
}
else
{
wait(NULL);
printf("I am Parent\n");
while(1);
}
}
  • By ignoring the SIGCHLD signal : When a child is terminated, a corresponding SIGCHLD signal is delivered to the parent, if we call the ‘signal(SIGCHLD,SIG_IGN)’, then the SIGCHLD signal is ignored by the system, and the child process entry is deleted from the process table. Thus, no zombie is created. However, in this case, the parent cannot know about the exit status of the child.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// A C program to demonstrate ignoring 
// SIGCHLD signal to prevent Zombie processes
#include<stdio.h>
#include<unistd.h>
#include<sys/wait.h>
#include<sys/types.h>

int main()
{
int i;
int pid = fork();
if (pid == 0)
for (i=0; i<20; i++)
printf("I am Child\n");
else
{
signal(SIGCHLD, SIG_IGN);
printf("I am Parent\n");
while(1);
}
}
  • By using a signal handler : The parent process installs a signal handler for the SIGCHLD signal. The signal handler calls wait() system call within it. In this senario, when the child terminated, the SIGCHLD is delivered to the parent.On receipt of SIGCHLD, the corresponding handler is activated, which in turn calls the wait() system call. Hence, the parent collects the exit status almost immediately and the child entry in the process table is cleared. Thus no zombie is created.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// A C program to demonstrate handling of
// SIGCHLD signal to prevent Zombie processes.
#include<stdio.h>
#include<unistd.h>
#include<sys/wait.h>
#include<sys/types.h>

void func(int signum)
{
wait(NULL);
}

int main()
{
int i;
int pid = fork();
if (pid == 0)
for (i=0; i<20; i++)
printf("I am Child\n");
else
{
signal(SIGCHLD, func);
printf("I am Parent\n");
while(1);
}
}

Elaborate what happens under the hood after typing ls?

First of all, whenever we press a key on keyboard, the keyboard controller will emit an interrupt signal to processor(CPU) indicating there is an event that needs immediate attention. As interrupts usually have high priority, the processor will suspending its current execution, save its state, and call an interrupt handler(should be the one that handles keyboard interrupt). Suppose we type ‘l’ then this character will be written the file that fd stdout points to, while shell’s stdout usually points to screen (a device, in *nix familiy, everything “looks” like a file), then “l” will be shown on the screen. After the interrupt handler finishes its job, the process will resume its original work.

We type ‘ls’ and hit enter, then shell will first check out $PATH environment variable to see if there is a program ‘ls’ under each given path. Suppose we find /usr/bin/ls. Shell will call fork(), followed by execve("/usr/bin/ls"). fork() will create an identical child process and return twice. In parent(shell), it will typically call wait() to wait child process to complete. In child, it will execute execve() and a successful execve() will replace original data(including text, data, heap and stack, etc) in the child process’s address space with new data in order to run the new executable. Note that file descriptors opened by parent will be kept(that is why output from ls will be displayed on screen like shell).

Then the child process will be one that runs /usr/bin/ls code, it will make system calls(open(2), printf(3c) etc.) to list directory entries in the current working directory. After the child process finishes its job, it will call exit()(usually called implicitly by ‘return’ in main()). Then all of the fds, streams will be closed. The parent process(in this case the shell) will be notified of child’s termination, so wait() will be return and child exit code could be read from wait(). Then parent process(the shell) can proceed, waiting for next command to run.

What will happen if another interrupt is received while the processor is running interrupt handler code?

A: Different OS may have different ways to deal with this situation. For linux, task of an interrupt handler is split into two halves, top half and bottom half. Top half runs with interrupts disabled and respond to the interrupt as fast as possible, then bottom half runs with interrupts enabled for as long as it needs and could be preempted.

How does shell implement I/O redirection if we want to redirect output of ls to another command as its input? like ls | sort:

A: Briefly speaking, shell will call a pipe() before fork() to get two fds, rfd for read end and wfd for write end, then call dup2(wfd, 1) in ls and dup2(rfd, 0) in sort.

There are two types of interrupts: hardware interrupt(which caused by external device, like keyboard, mouse, disk, etc) and software interrupt(which caused by program, like system call, divide-by-zero)

进程

If a 进程 wishes to access another process’ resources, inter-process communications have to be used. These include 管道, 文件, socks, and other forms.

上下文切换: 如何测量时间? - Cracking The Coding Interview

A context switch is the time spent switching between two processes. This happens in multitasking. The operating system must bring the state information of waiting processes into memory and save the state information of the currently running process.

怎么知道发生了上下文切换: 我们不知道。

Another issue is that swapping is governed by the scheduling algorithm of the operating system and there may be many kernel level threads which are also doing context switches

In order to overcome these obstacles, 构建环境 such that after P1 executes, 任务调度器立马选择 P2 去运行. This may be accomplished by constructing a data channel, such as 管道, between P1 and P2 and having the two processes play a game of ping-pong with a data token.

That is, let’s allow P1 to be the initial sender and P2 to be the receiver. Initially, P2 is blocked (sleeping) as it awaits the data token. When P1 executes, it delivers the token over the data channel to P2 and immediately attempts to read a response token. However, since P2 has not yet had a chance to run, no such token is available for P1 and the process is blocked.This relinquishes the CPU.

A context switch results and the task scheduler must select another process to run. Since P2 is now in a ready-to-run state, it is a desirable candidate to be selected by the task scheduler for execution. When P2 runs, the roles of P1 and P2 are swapped. P2 is now acting as the sender and P1 as the blocked receiver. The game ends when P2 returns the token to P1.

  • 进程2 正在等待来自进程1 的数据
  • 进程1 记录开始时间
  • 进程1 发送 Token 给进程2
  • 进程1 尝试从 进程2 读取响应,这会触发一个上下文切换
  • 进程2 被调度然后收到 Token
  • 进程2 发送一个 Token 响应给进程1
  • 进程2 尝试读取来自 进程1 的 Token 响应,这会触发一个上下文切换
  • 进程1 被调度然后收到了 Token
  • 进程1 记录结束时间
1
T = 2 * (Time of deliver + Time of context switch + Time of receive)
1
2
P1 ---token---> CPU Context Switch ---token---> P2
P2 ---response---> CPU Context Switch ---response ---> P1

P1 will be able to easily compute T, since this is just the time between events 3 and 8. So, to solve for Tc’ we must first determine the value of Td + Tr·

How can we do this? We can do this by measuring the length of time it takes P1 to send and receive a token to itself. This will not induce a context switch 因为 P1 发送的时候已经运行在 CPU 上了,并不会阻塞它.

The game is played a number of iterations to average out any variability in the elapsed time between steps 2 and 9 that may result from unexpected kernel interrupts and additional kernel threads contendingfor the CPU. We select the smallest observed context switch time as our final answer.

However, all we can ultimately say that this is an approximation which depends on the underlying system. For example, 我们假设一旦数据准备好,P2立马被调度运行. However, this is dependent on the implementation of the task scheduler and we cannot make any guarantees.

That’s okay; it’s important in an interview to recognize when your solution might not be perfect

哲学家进餐问题,如何防止死锁 - Cracking The Coding Interview

(1) 要么成要么不成:

1
2
3
4
5
6
public class Chopstick {
/* same as before */
public boolean pickUp() {
return lock.tryLock();
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public class Philosopher extends Thread {

public void eat() {
if (pickUp()) {
chew();
putDown();
}
}

public boolean pickUp() {
if (!left.pickUp()) {
return false;
}

if (!right.pickUp()) {
left.putDown();
return false;
}

return true;
}

}

(2) 优先级筷子:

为每一根筷子赋予一个数字 0 ~ N-1,每一位哲学家都得优先获取数字较小的那根筷子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
public class Philosopher extends Thread {

private Chopstick lower, higher;

public Philosopher(Chopstick left, Chopstick right) {
if (left.getNumber() < right.getNumber()) {
this.lower = left;
this.higher = right;
} else {
this.lower = right;
this.higher = left;
}
}

public void eat() {
pickup();
chew();
putDown();
}

public void pickUp() {
lower.pickup();
higher.pickup();
}

public void putDown() {
higher.putDown();
lower.putDown();
}

}

保证不会出现死锁的时候才提供的锁 - Cracking The Coding Interview

There are several common ways to prevent deadlocks. One of the popular ways is to require a process to declare upfront what locks it will need. We can then verify if a deadlock would be created by issuing these locks, and we can fail if so.

保证三个线程顺序执行 - Cracking The Coding Interview

A lock in Java is owned by 相同的线程 which locked it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
public class Foo {
public Semaphore seml, sem2;

public Foo() {
try {
sem1 = new Semaphore(1);
sem2 = new Semaphore(1);
// ...
sem1.acquire();
sem2.acquire();
} catch (...) { ... }
}

public void first() {
try {
// ...
sem1.release();
} catch (...) { ... }
}

public void second() {
try {
sem1.acquire();
sem1.release();
// ...
sem2.release();
} catch (...) { ... }
}

public void third() {
try {
sem2.acquire();
sem2.release();

} catch (...) { ... }
}
}

文件系统格式

  • FAT32 支持的最大文件大小: FAT32 was implemented on 32-bit hardware for a 32-bit operating system with a 32-bit compiler, so it’s an obvious choice and anything beyond 32 bits would have cost precious processor cycles, disk space and programming time. (In those days, we thought 4GB was pretty large for an HD, let alone for a single file.)

特别好的操作系统课程

进程被分为几部分

  • Process memory is divided into 四部分 as shown in Figure 3.1 below:
    • The text section comprises the compiled program code 编译后的程序, read in from non-volatile storage when the program is launched.
    • The data section stores global and static variables 全局、静态变量, allocated and initialized prior to executing main.
    • The heap is used for dynamic memory allocation 动态内存分配, and is managed via calls to new, delete, malloc, free, etc.
    • The stack is used for local variables. Space on the stack is reserved for local variables when they are declared ( at function entrance or elsewhere, depending on the language ), and the space is freed up when the variables go out of scope. Note that the stack is also used for function return values, and the exact mechanisms of stack management may be language specific.
    • Note that the stack and the heap start at opposite ends of the process’s free space and grow towards each other. If they should ever meet, then either a stack overflow error will occur, or else a call to new or malloc will fail due to insufficient memory available.

进程结束

child code:

1
2
int exitCode;
exit( exitCode ); // return exitCode; has the same effect when executed from main( )

parent code:

1
2
3
4
5
pid_t pid;
int status
pid = wait( &status );
// pid indicates which child exited. exitCode in low-order bits of status
// macros can test the high-order bits of status for why it stopped

Processes may also be terminated by the system for a variety of reasons, including:

  • The inability of the system to deliver necessary system resources.
  • In response to a KILL command, or other un handled process interrupt.
  • A parent may kill its children if the task assigned to them is no longer needed.
  • If the parent exits, the system may or may not allow the child to continue 没有父进程. ( On UNIX systems, orphaned processes are generally inherited by init, which then proceeds to kill them. The UNIX nohup command allows a child to 继续执行 after 父进程已经退出. )

进程协调

Communications models:

  • Message passing
  • Shared memory

  • Shared Memory is faster once it is set up, because no system calls are required and access occurs at normal memory speeds. However it is more complicated to set up, and doesn’t work as well across multiple computers. Shared memory is generally preferable when large amounts of information must be shared quickly on the same computer.
  • Message Passing requires system calls for every message transfer, and is therefore slower, but it is simpler to set up and works well across multiple computers. Message passing is generally preferable when the amount and/or frequency of data transfers is small, or when multiple computers are involved.

Shared Memory 消费者-生产者

以下哪种方式,在读取磁盘上多个顺序数据块时的效率最高?

  • 程序直接访问方式跟循环检测IO方式,应该是一个意思吧,是最古老的方式。CPU和IO串行,每读一个字节(或字),CPU都需要不断检测状态寄存器的busy标志,当busy=1时,表示IO还没完成;当 busy=0 时,表示IO完成。此时读取一个字的过程才结束,接着读取下一个字。
  • 中断控制方式:循环检测先进些,IO设备和CPU可以并行工作,只有在开始IO和结束IO时,才需要CPU。但每次只能读取一个字
  • DMA方式:Direct Memory Access,直接存储器访问,比中断先进的地方是每次可以读取一个块,而不是一个字。
  • 通道方式:比DMA先进的地方是,每次可以处理多个块,而不只是一个块。

大端小端


上下文切换 Context Switching

1. 什么是上下文切换

2. 什么触发了上下文切换

情景一 (当一个进程由 RUNNING 进入 BLOCKED 状态的时候):

1
2
3
4
5
6
7
8
int main() {
char str[10];
// 执行到这句话的时候,需要等待用户输入
// 这个时候进程的状态 RUNNING -> BLOCKED
// 操作系统变回处罚 Context Switching
// 将其它进程调度进来,让它拥有 CPU
scanf("%s", str);
}

情景二 (当一个进程结束的时候):

1
2
3
4
5
6
int main() {
int a = 4;
printf("%d\n", a);
// RUNNING -> ZOMBIE
exit(0);
}

情景三 (RUNNING -> READY):

An (hardware) interrupt occurs. 一个非常 popular interrupt 是 Timer Interrupt,周期性(10ms ~ 200ms)地发送 interrupt 给 CPU。

对于每一个 process,the kernel store some metadata, essentially there are three metadata:

  • pocess control block
  • the kernel stack
  • the page tables corresponding to that process

当发生上下文切换的时候:

  • entire context gets stored in the kernel stack, this context is stored in a structure known as the trap frame.
  • Context switches to the scheduler’s context.

3. Context Switching Overhead

Memory page size 页大小

A page, memory page, or virtual page is a fixed-length (固定长度的) contiguous block of virtual memory (虚拟内存), described by a single entry in the page table. It is the smallest unit (最小单位) of data for memory management in a virtual memory operating system. Similarly, a page frame (页帧) is the smallest fixed-length contiguous block of physical memory (物理内存) into which memory pages are mapped by the operating system.

Linux 检测页大小:

1
2
3
4
5
6
7
8
9
10
#include <stdio.h>
#include <unistd.h> /* sysconf(3) */

int main(void) {
// The page size for this system is 4096 bytes.
printf("The page size for this system is %ld bytes.\n",
sysconf(_SC_PAGESIZE)); /* _SC_PAGE_SIZE is OK too. */

return 0;
}

In many Unix systems, the command line utility getconf can be used. For example, getconf PAGESIZE will return the page size in bytes.

目前基本上大多数电脑都是 4KB 页大小。


参考Linux 的实现中,文件 Cache 分为两个层面,一是 Page Cache,另一个 Buffer Cache,每一个 Page Cache 包含若干 Buffer Cache。Page Cache、Buffer Cache、文件以及磁盘之间的关系如图所示:

file_page_buffer_block_disk

Linux内核中文件预读算法的具体过程是这样的:对于每个文件的第一个读请求,系统读入所请求的页面并读入紧随其后的少数几个页面(不少于一个页面,通常是三个页面),这时的预读称为同步预读。对于第二次读请求,如果所读页面不在Cache中,即不在前次预读的group中,则表明文件访问不是顺序访问,系统继续采用同步预读;如果所读页面在Cache中,则表明前次预读命中,操作系统把预读group扩大一倍,并让底层文件系统读入group中剩下尚不在Cache中的文件数据块,这时的预读称为异步预读。无论第二次读请求是否命中,系统都要更新当前预读group的大小。此外,系统中定义了一个window,它包括前一次预读的group和本次预读的group。任何接下来的读请求都会处于两种情况之一:第一种情况是所请求的页面处于预读window中,这时继续进行异步预读并更新相应的window和group;第二种情况是所请求的页面处于预读window之外,这时系统就要进行同步预读并重置相应的window和group。图5是Linux内核预读机制的一个示意图,其中a是某次读操作之前的情况,b是读操作所请求页面不在window中的情况,而c是读操作所请求页面在window中的情况。

Linux内核中与文件Cache操作相关的API有很多,按其使用方式可以分成两类:一类是以拷贝方式操作的相关接口, 如read/write/sendfile等,其中sendfile在2.6系列的内核中已经不再支持;另一类是以地址映射方式操作的相关接口,如mmap等。

第一种类型的API在不同文件的Cache之间或者Cache与应用程序所提供的用户空间buffer之间拷贝数据,其实现原理如图所示:

read_write_sendfile

第二种类型的APICache映射到用户空间,使得应用程序可以像使用内存指针一样访问文件,Memory map访问Cache的方式在内核中是采用请求页面机制实现的,其工作过程如图所示:

memory_map

首先,应用程序调用mmap(图中1),陷入到内核中后调用do_mmap_pgoff(图中2)。该函数从应用程序的地址空间中分配一段区域作为映射的内存地址,并使用一个VMAvm_area_struct)结构代表该区域,之后就返回到应用程序(图中3)。当应用程序访问mmap所返回的地址指针时(图中4),由于虚实映射尚未建立,会触发缺页中断(图中5)。之后系统会调用缺页中断处理函数(图中6),在缺页中断处理函数中,内核通过相应区域的VMA结构判断出该区域属于文件映射,于是调用具体文件系统的接口读入相应的Page Cache项(图中7、8、9),并填写相应的虚实映射表。经过这些步骤之后,应用程序就可以正常访问相应的内存区域了。


参考Linux下,Page Cache可以加速访问磁盘上的文件,这是因为,当它第一次读或写数据到类似硬盘一样的存储介质时,Linux会存储这些数据到闲置的内存区域,它充当于一个缓存的角色,如果以后读取数据,他就可以迅速的在缓存中读取:

Linux下,有多少内存被Page Cache使用,可以用free -m查看Cached那一列,例如:

page_cache_linux_memory

数据写入

如果数据被写入,他首先会写到Page Cache,并把其作为一个脏页(dirty page)进行管理,“dirty”意味着数据存储在Page Cache中,但需要先写入到底层的存储设备,这些脏页的内容会被周期性的转移(系统调用syncfsync)到底层的存储设备上。

下面的例子将会展示创建10M的文件,它首先将会写入到Page Cache中,相应的脏页将会增加,直到他把数据写到磁盘上,在这种情况下我们将会使用sync命令

dirty_page_linux_sync

数据读取

文件的写入操作不仅仅只有写入,也有读的操作,例如,当你读取同一个40M的文件两次,第二次读取将会变快,因为文件的block直接来自Page Cache的内存而不会去读取磁盘。


参考

读文件:

  1. 用户发起read操作
  2. 操作系统查找页缓存
    a. 若未命中,则产生缺页异常,然后创建页缓存,并从磁盘读取相应页填充页缓存
    b. 若命中,则直接从页缓存返回要读取的内容
  3. 用户read调用完成

写文件:

  1. 用户发起write操作
  2. 操作系统查找页缓存
    a. 若未命中,则产生缺页异常,然后创建页缓存,将用户传入的内容写入页缓存
    b. 若命中,则直接将用户传入的内容写入页缓存
  3. 用户write调用完成
  4. 页被修改后成为脏页,操作系统有两种机制将脏页写回磁盘
  5. 用户手动调用fsync()
  6. pdflush进程定时将脏页写回磁盘

mmap除了可以减少read,write等系统调用以外,还可以减少内存的拷贝次数,比如在read调用时,一个完整的流程是操作系统读磁盘文件到页缓存,再从页缓存将数据拷贝到read传递的buffer里,而如果使用mmap之后,操作系统只需要将磁盘读到页缓存,然后用户就可以直接通过指针的方式操作mmap映射的内存,减少了从内核态到用户态的数据拷贝

mmap适合于对同一块区域频繁读写的情况,比如一个64M的文件存储了一些索引信息,我们需要频繁修改并持久化到磁盘,这样可以将文件通过mmap映射到用户虚拟内存,然后通过指针的方式修改内存区域,由操作系统自动将修改的部分刷回磁盘,也可以自己调用fsync手动刷磁盘。

mmap

推荐文章