Operating Systems

The more you know about operating system, the better you can cooperate with it.

Operating System Overview

Role

Role 1

The Operating System is an Abstract Machine

  • Extends the basic hardware with added functionality
  • Provides high-level abstractions
    • More programmer friendly
    • Common core for all applications
  • It hides the details of the hardware
    • Makes application code portable

Role 2

The Operating System is a Resource Manager

  • Responsible for allocating resources to users and processes
  • Must ensure
    • No Starvation
    • Progress
    • Allocation is according to some desired policy
      • First-come, first-served; Fair share; Weighted fair share; limits (quotas), etc…
    • Overall, that the system is efficiently used

Structural (Implementation) View

the Operating System is the software Privileged mode.

Operating System Kernel

  • Portion of the operating system that is running in privileged mode
  • Usually resident (stays) in main memory
  • Contains fundamental functionality
    • Whatever is required to implement other services
    • Whatever is required to provide security
  • Contains most-frequently used functions
  • Also called the nucleus or supervisor

The Operating System is Privileged

Applications should not be able to interfere or bypass the operating system

OS can enforce the “extended machine”

OS can enforce its resource allocation policies

Prevent applications from interfering with each other

The Structure of a Computer System

  • Applications interact with themselves and via function calls to library procedures
  • OS and application interacts via System calls

A Note on System Libraries

System libraries are just that, libraries of support functions (procedures, subroutines)

Only a subset of library functions are actually system calls, and system call functions are in the library for convenience

Privilege-less OS

Can implement OS functionality, but cannot enforce it.

Some Embedded OSs have no privileged component

  • All software runs together
  • No isolation
  • One fault potentially brings down entire system

Operating System Software

  • Fundamentally, OS functions the same way as ordinary computer software.

    • It is machine code that is executed (same machine instructions as application).
    • It has more privileges (extra instructions and access).
  • Operating system relinquishes control of the processor to execute other programs, it reestablishes control after

    • System calls
    • and Interrupts (especially timer interrupts).

Operating System Internal Structure

The Monolithic Operating System Structure, Also called the “spaghetti nest” approach(Everything is tangled up with everything else. )

Processes and Threads

Major Requirements of an Operating System

  • Interleave the execution of several processes to maximize processor utilization while providing reasonable response time
  • Allocate resources to processess
  • Support interprocess communication and user creation of processes

Processes and Threads

Processes

  • Also called a task or job
  • Execution of an individual program
  • “Owner” of resources allocated for program execution
    • trace the usage of processes, clean up memory after finishing execution
  • Encompasses one or more threads

Threads

The sequence of execution through an application

  • Unit of execution
  • Can be traced
    • list the sequence of instructions that execute
  • Belongs to a process
    • Process provide environment, and all thread running in the process share the environment
    • Executes within it

Process

The Process Model

Single process machine
the process model

  • Multiprogramming of four programs
  • Conceptual model of 4 independent, sequential processes(with a single thread each)
  • Only one program active at any instant

threads and processes

the box represents a process, and the line represents the thread and the dotted lines represent what the OS can do

Process and thread models of selected OSes

  • Single process, single thread
    • MSDOS
  • Signle process, multiple threads
    • OS/161 as distributed
  • Multiple processes, single thread
    • Traditional UNIX
  • Multiple processes, multiple threads
    • Modern Unix(Linux, Solaris), Windows

Process Creation

Principal events that cause process creation

  • System initialization
    • Foreground processes(interactive programs)
    • Background processes
      • Email server, web server, print server, etc.
      • Called a daemon(unix) or service(Windows)
  • Execution of a process creation system call by running a process
    • New login shell for an incoming ssh connection
  • User request to create a new process
  • Initiation of a bach job

Note: Technically, all these cases use the same system machanism to create new processes

Process Termination

Conditions which terminate processes

  • Normal exit(voluntary)
  • Error exit(voluntary)
  • Fatal error(involuntary)
    • e.g. segmentation fault
  • Killed by another process(involuntary)

Implementation of Processes

  • A processes’ information is stored in a process control block(PCB)
  • The PCBs form a process table
    • Reality can be complex(hashing, chaining, allocation bitmaps,…)
    • Size can be dynamic

Example fields of a process table entry

Process Management Memory management File management
Registers Pointer to text segment Root directory
Program counter Pointer to data segment working directory
Program status word Pointer to stack segement File descriptors
Stack pointer User ID
Process state Group ID
Priority
Scheduling parameters
Process ID
Parent process
process group
Signals
time when process started
CPU time used
Children’s CPU time
Time of next alarm

Process/Thread States

Three states process model(the minial model the OSes implement)states

In the image the numbers prepresent:

  1. Process blocks for input
  2. Scheduler picks another process
  3. Scheculer picks this process
  4. Input becomes available
  • Possible process/thread states
    • running
    • blocked
    • ready

Generally, if no IO, processes almost all sitting in running and ready

  • Transitions between states shown

some Transition Causing Events

  • Running -> Ready
    • Voluntary Yield()
    • End of timeslice
  • Running -> Blocked
    • Waiting fot input
      • File, network,
    • Waiting for a timer(alarm signal)
    • Waiting for a resource to become available

Scheduler

responsible for want to run next

  • Sometimes alse called the dispatcher
    • The literature is alse a little inconsistent on with terminology
  • Has to choose a Ready process to run
    • It is inefficent to search through all processes

The Ready Queue
ready queue

What about blocked processes

When an unblocking event occurs, we also wish to avoid scanning all processes to select one to make Ready

using two queues(One option)
blocked queue

another option(using a queue for every event)
blocked queues

Implementation

  • Minimally consist of three segments
    • Text
      • contains the code (instructions)
    • Data
      • Global variables
    • Stack
      • Activation records of procedure/function/method
      • Local variables
  • User-mode
    • Processes (programs) scheduled by the kernel
    • Isolated from each other
    • No concurrency issues between each other
  • System-calls transition into and return from the kernel
  • Kernel-mode
    • Nearly all activities still associated with a process
    • Kernel memory shared between all processes
    • Concurrency issues exist between processes concurrently executing in a system call

Thread

The Thread Model

thread model

(a) Three processes each with one thread
(b) One process with three thread

Separating execution from the environment

Per process items Per thread items
Address space Program counter
Global variables Registers
Open files Stack
Child processess State
Pending alarms
Signals and signal handlers
Accounting information
  • Items shared by all threads in a process

  • Item private to each thread

  • The Thread Model

    • State implicitly stored on the stack
    • the state model
    • Local variables are per thread
      • Allocated on the stack
      • Don’t have concurrency issues
    • Global variables are shared between all threads
      • Allocated in the data section
      • Concurrency control is an issue
    • Dynamically allocated memory(malloc) can be glocal or local
      • Program defined(the pointer can be global or local)
      • If the pointer is global variable(has concurrency issue) while locals don’t

Thread Usage

Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// dispatcher thread
While(TRUE) {
get_next_request(&buf);
handoff_work(&buf);
}

// worker thread - can overlap disk I/O with the execution of other threads
while (TRUE) {
wait_for_work(&buf);
look_for_page_in_cache(&buf, &page);
if (page_not_in_cache(&page)
read_page_from_disk(&buf, &page);
return_page(&page);
}
Model Characteristics
Threads Parallelism, blocking system calls
Sigle-threaded process No parallelism, blocking system calls
Finite-state machine Parallelism, nonblocking system calls, interrupts

Why Threads?

  • Simpler to program than a state machine
  • Less resources are associated with them a complete process
    • Cheaper to create and destory
    • Shares resources(especially memory) between them
  • Performace: Threads waiting for I/O can be overlapped with computing threads
    • Note if all threads are compute bound, then there is no performance improvement(on a uniprocessor)
  • Threads can take advantage of the parallelism available on machines with more than one CPU(multiprocessor)

User-level Threads

  • Implementation at user-level
    • User-level Thread Control Block (TCB), ready queue, blocked queue, and dispatcher
    • Kernel has no knowledge of the threads (it only sees a single process)
    • If a thread blocks waiting for a resource held by another thread inside the same process, its state is saved and the dispatcher switches to another ready thread
    • Thread management (create, exit, yield, wait) are implemented in a runtime support library
  • Pros
    • Thread management and switching at user level is much faster than doing it in kernel level
      • No need to trap (take syscall exception) into kernel and back to switch
    • Dispatcher algorithm can be tuned to the application
    • Can be implemented on any OS (thread or non-thread aware)
    • Can easily support massive numbers of threads on a per-application basis
      • Use normal application virtual memory
      • Kernel memory more constrained. Difficult to efficiently support wildly differing numbers of threads for different applications.
  • Cons
    • Threads have to yield() manually (no timer interrupt delivery to userlevel)
      • Co-operative multithreading
        • A single poorly design/implemented thread can monopolise the available CPU time
      • There are work-arounds (e.g. a timer signal per second to enable preemptive multithreading), they are course grain and a kludge.
    • Does not take advantage of multiple CPUs (in reality, we still have a single threaded process as far as the kernel is concerned)
    • If a thread makes a blocking system call (or takes a page fault), the process (and all the internal threads) blocks
      • Can’t overlap I/O with computation

Kernel-level threads

  • Cons

    • Thread creation and destruction, and blocking and unblocking threads requires kernel entry and exit.
      • More expensive than user-level equivalent
  • Pros

    • Preemptive multithreading
    • Parallelism
      • Can overlap blocking I/O with computation
      • Can take advantage of a multiprocessor

Context Switch

A context switch can refer to a switch between threads, involving saving and restoring of state associated with a thread; A switch between processes, involving the above, plus extra state associated with a process.

Context Switch Occurrence

A switch between process/threads can happen any time the OS is invoked

  • On a system call
    • Mandatory if system call blocks or on exit()
  • On an exception
    • Mandatory if offender is killed
  • On an interrupt
    • Triggering a dispatch is the main purpose of the timer interrupt

A thread switch can happen between any two instructions

Note instructions do not equal program statements

Context Switch

  • Context switch must be transparent for processes/threads
    • When dispatched again, process/thread should not notice that something else was running in the meantime (except for elapsed time)
    • OS must save all state that affects the thread
  • This state is called the process/thread context
  • Switching between process/threads consequently results in a context switch.

Concurrency and Synchronisation

Concurrency Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// count is a global variable shared between two threads
int count = 0;
void increment ()
{
int t;
t = count;
t = t + 1;
count = t;
{
void decrement ()
{
int t;
t = count;
t = t - 1;
count = t;
}

Have a race condition. Occurred when global variables shared by threads

There is in-kernel concurrency even for single-threaded process
The OS has to deal with concurrency even if it is a single thread application.
image

Critical Region

  • We can control access to the shared resource by controlling access to the code that accesses the resource

A critical region is a region of code where shared resources are accessed(e.g. Variables, memory, files, etc…)

  • Uncoordinated entry to the critical region results in a race condition
    • => Incorrect behavior, deadlock, lost work,…

Indentifying critical regions

  • Critical regions are regions of code that:
    • Access a shared resource,
    • and correctness relied on the shared resource not being concurrently modified by another thread/process/entity.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// count is a global variable shared between two threads
int count = 0;
void increment ()
{
int t;
t = count; // the start of critical region
t = t + 1;
count = t; // end
{
void decrement ()
{
int t;
t = count; // start
t = t - 1;
count = t; // end
}

if we can make the critical regions never overlap, we won’t have concurrency problem.
mutual exclusion using critical regions

Example critical regions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
struct node {
int data;
struct node *next;
}
struct node *head;

void init(void)
{
head = NULL;
}

void insert(struct *item)
{
item->next = head;
head = item;
}

struct node *remove(void)
{
struct node *t;
t = head;
if (t != NULL) {
head = head->next;
}
return t;
}
  • Race example
1
2
3
4
5
6
7
8
9
10
11
12
13
// thread 1
void insert(struct *item)
{
item->next = head; // 1
head = item; // 2
}

// thread 2
void insert(struct *item)
{
item->next = head; // 3
head = item; // 4
}

one possible sequence: 1 -> 3 -> 2 -> 4, then the item in thread 1 is lost(seg fault)
both1-2 and 3-4 are critical regions

Critical regions solutions

  • We seek a solution to coordinate access to critical regions
    • Also called critical sections
  • Conditions required of any solution to the critical region problem
    • Mutual Exclusion: No two processes simultaneously in critical region
    • No assumptions made about speeds or numbers of CPUs
    • Progress
      • No process running outside its critical region may block another process
    • Bounded
      • No process waits forever to enter its critical region

A solution?

  • A lock variable
    • if lock == 1, somebody is in the critical section and we mush wait
    • if lock == 0, nobody is in the critical section and we are free to enter
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// A problematic execution sequence
while (TRUE) {
while (lock == 1); // 1
lock = 1; // 2
critical(); // 3
lock = 0; // 4
non_critical(); // 5
}

while (TRUE) {
while (lock == 1); // 6
lock = 1; // 7
critical(); // 8
lock = 0; // 9
non_critical(); // 10
}
/* does not work because if it execute like
* 6 -> 1 -> 2 -> 3
* both of them see the lock is not one and may all work in their critical section
*/

In the example above, there is race contition between the observation that the lock variable is not one and setting the lock to prevent somebody else from entering.

Mutual Exclusion by Taking Turns

1
2
3
4
5
6
7
8
9
10
11
12
while (TRUE) {
while (turn != 0);
critcal_rigion();
turn = 1;
noncritical_region();
}
while (TRUE) {
while (turn != 1);
critcal_rigion();
turn = 0;
noncritical_region();
}

This example works because the update is only been updated by the thread having its turn.

  • Works due to strict alternation
    • Each process takes turns
  • Cons
    • Busy waiting(loop)
      • sol: not my turn, do sth else, come back to my turn
    • Process must wait its turn even while the other process is doing something else.
      • With many processes, must wait for everyone to have a turn
        • Does not guarantee progress if a process no longer needs a turn
      • Poor solution when processes require the critical section at differing rates

Mutual Exclusion by Disabling Interrupts

only used in very short critical sections, because it prespond other devices in the machine and make them less responsive

  • Before entering a critical region, disable interrupts
  • After leaving the critical region, enable interrupts
  • Pros
    • simple
  • Cons
    • Only available in the kernel
    • Blocks everybody else, even with no contention
      • slows interrupt response time
    • Does not work on a multiprocessor

Hardware Support for mutual exclusion

  • Test and set instruction
    • can be used to implement lock variables correctly
      • It loads the value of the lock
      • If lock == 0,
        • set the lock to 1
        • return the result 0 – we acquire the lock
      • If lock == 1
        • return 1 – another thread/process has the lock
  • Hardware guarantees that the instruction executes atomically(atomically: as an indivisible unit)
1
2
3
4
5
6
7
8
9
enter_region:
TSL REFISTER,LOCK | copy lock to register and set lock to 1
CMP REGISTER, #0 | was lock one?
JNE enter_region | if it was non zero, lock was set, so loop
RET | return to caller; critical region entered

leave_region:
MOVE LOCK, #0 | store a 0 in lock
RET | return to caller
  • Pros
    • Simple(easy to show it’s correct)
    • Available at user-level
      • To any number of processors
      • To implement any number of lock variables
  • Cons
    • Busy waits(also termed a spin lock)
      • consumes CPU
      • starvation is possible when a process leaves its critical section and more than one process is waiting

Tackling the Busy-Wait Problem

  • Sleep / Wake up
    • The idea
      • When process is waiting for an event, it calls sleep to block, instead of busy waiting
      • The event happens, the event generator(another process) calls wakeup to unblock the sleeping process.
      • Walking a ready/running process has no effect

The producer-Consumer Problem

Also called the bounded buffer problem
A producer produces data items and stores the items in a buffer
A consumer takes the items out of the buffer and consumes them.

  • two problems: Concurrency(because shared resource) and the producer should be blocked when the buffer is full and the consumer should be blocked when the buffer is empty.
    • We must keep accurate count of items in the buffer
    • The consumer can call wakeup when it consumes the first entry of the full buffer
    • The Producer can call wakeup when it adds the first item to the buffer

Semaphores

  • Dijkstra introduced two primitives that are more powerful than simple sleep and wakeup alone.
    • P(): proberen, from Dutch to test
    • V(): verhogen, from Dutch to increment
    • Also called wait & signal, down & up

How do they work

  • If a resource is not available, the corresponding semaphore blocks any process waiting for the resource
  • Blocked processes are put into a process queue maintained by the semaphore(avoids busy waiting)
  • When a process releases a resource, it signals this by means of the semaphore
  • Signalling resumes a blocked process if there is any
  • Wait(P) and signal(V) operations cannot be interrupted
  • Complex coordination can be implemented by multiple semaphores

Semaphore implementation

  • Define a semaphore as a record

Each primitive is atomit(wait and signal), the OS would use the disabling of interrupts in the implementation of Semaphore inside the OS.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
typedef sturct {
int count;
struct process *L;
} semaphore;
// Semaphore operations now defined as
wait(S):
S.count --;
if (S.count < 0) {
add this process to S.L;
sleep;
}
signal(S):
S.count++;
if (S.count <= 0) {
remove a process P from S.L;
wakeup(P);
}

Semaphore as a General Synchronization Tool

  • Execute B in Pj only after A executed in Pi
  • Use semaphore count initlialized to 0
  • Code:
    code
    • A first: signal(count is 1) -> then B runs(count is 0) -> A(this is what we want)
    • B first: wait(count < 0, sleep) -> A runs, wake up B(still what we want)

Semaphore implementation of a Mutex

  • Mutex is short for Mutual Exclusion
    • Can also be called a lock
1
2
3
4
5
semaphore mutex;
mutex.count = 1; /* init the mutex */
wait(mutex); /* enter the critical region */
Blahblah();
signal(mutex); /* exit the critical region */

Notice that the inital count determines how mant waits can progress before blocking and requiring a signal => mutex.count initialised as 1

Solve the producer-consumer problem

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#define N = 4;
semaphore mutex = 1;

/* count empty slots */
semaphore empty = N

/* count full slots */
semaphore full = 0;

prod() {
while(TRUE) {
item = produce()
wait(empty);
wait(mutex)
insert_item();
signal(mutex);
signal(full);
}
}

con() {
while(TRUE) {
wait(full);
wait(mutex);
remove_item();
signal(mutex);
signal(empty);
}
}

Summary

  • Semaphores can be used to solve a varity of concurrency problems
  • However, programming with then can be error-prone
    • E.g. must signal for every wait for mutexes
      • Too many or few signals or waits, or signals and waits in the wrong order, can have catastrophic results

Monitors

  • To ease concurrent programming, Hoare(1974) proposed monitors
    • A higher level synchronisation primitive
    • Programming language construct
  • Idea
    • A set of procedures, variables, data types are grouped in a special kind of module, a monitor.
      • Variables and data types only accessed from within the monitor
    • Only one process/thread can be in the monitor at any one time
      • Mutual exclusion is implemented by the compiler(which should be less error prone)
        monitor
  • When a thread calls a monitor procedure that has a thread already inside, it is queued and it sleeps until the current thread exits the monitor
    example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
monitor exmaple
integer i;
condition c;

procedure producer();
.
.
.
end;

procedure consumer();
.
.
.
end;
end monitor;

How do we block waiting for an event?

  • We need a mechanism to block waiting for an event( in addition to ensuring mutual exclusion)
    • for producer consumer problem, when buffer is empty or full
  • Condition variables
    • To allow a process to wait within the monitor, a condition variable must be declared, as condition x, y
    • Condition variable can only be used with the operations wait and sigal
      • The operation x.wait();
        • means that the process invoking this operation is suspended until another process invokes
        • Another thread can enter the monitor while original is suspended
      • x.signal();
        • The operation resumes exactly on suspended process. If no process is suspended, then the signal opeartion has no effect.

The Readers and Writers Problem

  • Models access to a data base
    • E.g. airline reservation system
    • Can have more thant one concurrent reader
      • To check schedules and reservations
    • Writers must have exclusive access
      • To book a ticket or update a schedule

A solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
typedef in semaphore; 						/* use your imagination */
semaphore mutex = 1; /* controls access to 'rc'*/
semaphore db = 1; /* controls access to the database */
int rc = 0; /* # of processes reading ot wanting to */

void reader(void) {
while(TRUE) { /* repeat forever */
down(&mutex); /* get exclusive access to 'rc' */
rc = rc + 1; /* one reader more now */
if(rc == 1) down(&db); /* if this is the first reader */
up(&mutex); /* release exclusive access to 'rc' */
read_data_base(); /* access the data */
down(&mutex); /* get exclusive access to 'rc' */
rc = rc - 1; /* one reader fewer now */
if(rc == 0) up(&db); /* if this is the last reader */
up(&mutex); /* release exclusive access to 'rc */
use_data_read(); /* noncritical region */
}

void writer(void) {
while (TRUE) { /* repeat forever */
think_up_date(); /* noncritical regionn */
down(&db); /* get exclusive access */
wirte_data_base(); /* update the data */
up(&db); /* release exclusive access */
}
}
}

Deadlock

Resources

  • Examples of computer resources
    • printers
    • tape drives
    • Tables in a database
  • Processes need access to resources in reasonable order
  • Preemptable resources
    • e.g. virtual memory
    • can be taken away from a process with no ill effects
  • Nonpreemtable resources
    • will cause the process to fail if take away

Resources & Deadlocks

  • Suppose a process holds resource A and requests resource B
    • at same time another process holds B and requests A
    • both are blocked and remain so – Deadlock
  • Deadlocks occur when …
    • processes are granted exclusive access to devices, locks, tables, etc..
    • we refer to these entities generally as resources

Resource Access

  • Sequence of events required to use a resource
  1. request the resource
  2. use the resource
  3. release the resource
  • Must wait if request is denied
    • requesting process may be blocked
    • may fail with error code

Deadlock

A set of processes is deadlocked if each process in the set is waiting for an event that only another process in the set can cause.

  • Usually the event is release of a currently held resource
  • None of the processes can …
    • run
    • release resources
    • be awakened

Four Conditions for Deadlock

  1. Mutual exclusion condition
    • each resource assigned to 1 process or is available
  2. Hold and wait condition
    • process holding resources can request additional
  3. No preemption condition
    • previously granted resources cannot be forcibly taken away
  4. Circular wait condition
    • must be a circular chain of 2 or more processes
    • each is waiting for resource held by next member of the chain

Strategies for dealing with Deadlocks

  1. Just ignore the problem altogether

  2. prevention

    ​ Negating one of the four necessary conditions

  3. Detection and recovery

  4. Dynamic avoidance

    ​ Careful resource allocation

Appoarch 1: The Ostrich Algorithm

  • Pretend there is no problem
  • Reasonable if
    • deadlocks occur very rarely
    • Cost of prevention is high
      • Example of “cost”, only one process runs at a time
  • UNIX and Windows takes this approach for some of the more complex resource relationships they manage
  • It’s a trade off between
    • Convenience (engineering approach)
    • Correctness(mathematical approach)

Approach 2: Deadlock Prevention

The most common used strategy

Resource allocation rules prevent deadlock by prevent one of the four condition required for deadlock from occuring

  • Mutual exclusion

    • Not feasible in general
      • Some devices/resource are intrinsically not shareable
  • Hold and wait

    • Require processes to request resources before starting

      • a process never has to wait for what it needs
    • Issues

      • may not know required resources at start of run

        • must be determined at run time or by user input(e.g. using Word to open a file)

        • $\Rightarrow$ not always possible

      • Also ties up resources other processes could be using

    • Varitions:

      • Process must give up all resources if it would block holding a resource

      • then request all immediately needed

      • prone to livelock

        • Livelocked processes are not blocked, change state regularly, but never make progress.

  • No preemption

    • This is not a viable option
  • Circular Wait

    • aquire resource in the same order, order resources

Approach 3: Detection and Recovery

  • Need a method to determine if a system is deadlocked
  • Assuming deadlocked is detected, we need a method of recorvery to restore progress to the system.

Modeling resources and processes as a directed graph

(a) represents everthing in the system

(b) is a deadlock

What about resources with multiple units?

  • Some examples of multi-unit resources
    • RAM
    • Blocks on a hard disk drive
    • Slots in a buffer
  • We need an approach for dealing with resources that consists of more than a single unit

Modeling using matrix

Data structured needed by deadlock detection algorithm

sum of current resource allocation + resources available = resources that exist
$$
\sum_{i=1}^{n}{C_{ij}}+A_j = E_j
$$
Example

  • Process 1: has 1 scanners
  • Process 2: has 2 tape drivers and 1 CD roms
  • Process 3: has 1 plotters and 2 scanners

Detection Algoritm

  • Look for an unmarked process $Pi$, for which the $i$-th row of R is less than equal to A
  • If found, add the $i$-th row of C to A, and mark $Pi$. Got to step 1
  • If no such process exists, terminate

Remaining processes are deadlocked.

Recovery from Deadlock

  • Recovery through preemption
    • Take a resource from some other process
    • Depends on nature of the resource
  • Recovery through rollback
    • Checkpoint a process periodically
      • not come for free, not practical
    • Use this saved state
    • Restart the process if it is found deadlocked
      • No guarantee is won’t deadlock again
  • Recovery through killing processes
    • Crudest but simplest way to break a deadlock
    • Kill one of the processes in the deadlock cycle
    • The other processes get its resources
    • Choose process that can be rerun from the beginning

Approach 4 Deadlock Avoidance

Not practical for all systems, need to know enough information in advance

Two process resource trajectories

Safe and Unsafe States

  • A state is safe if
    • The system is not deadlocked
    • There exists a scheduling order that results in every process running to completion, even if they all request their maximum resources immediately
  • Unsafe states are not necessarily deadlocked
    • with a lucky sequence, all processes may complete
    • However, we cannot guarantee that they will complete(not deadlock)
  • Safe states guarantee we will eventually complete all processes
  • Deadlock avoidance algorithm
    • Only grant requests that result in safe states

Bankers Algorithm

  • Modelled on a Banker with Customers
    • The banker has a limited amount of money to loan customers
      • Limited number of resources
    • Each customer can borrow money up to the customer’s credit limit
      • Maximum number of resources required
  • Basic idea
    • keep the bank in a safe state
      • So all customers are happy even if they all request to borrow up to their credit limit at the same time
    • Customers wishing to borrow such that the back would enter an unsafe state must wait until somebody else repays their loan such that the transaction becomes safe.
  • Bankers Algorithm is not commonly used in practice
    • It is difficult (sometimes impossible ) to know in advance
      • the resources a process will require
      • the number of processes in a dynamic system

Starvation

A process never receives the resource it is waiting for, despite the resource(repeatedly) becoming free, the resource is always allocated to another waiting process.

One solution: First-come, first-serve policy

System Call

System Calls

  • Can be viewed as special function calls
    • Provides for a controlled entry into the kernel
    • While in kernel, they perform a privileged operation
    • Returns to original caller with the result
  • The system call interface represents the abstract machine provided by the operating system.

A brief overview

From the user’s perspective

  • Process Management
  • File I/O
  • Dirctories management
  • Some other selected Calls
  • There are many more
    • On Linux, see man syscalls for a list

Process Management

Call Descritption
pid=fork() Create a child process identical to the parent
Pid = waitpid(pid, &statloc, options) Wait for a child to terminate
s = execve(name, argv, environp) Replace a process’ core image
exit(status) Terminate process execution and return status

File Management

Call Descrition
fd=open(file, how, …,) Open a file for reading, writing or both
s = close(fd) Close an open file
n = read(fd, buffer, bytes) Read data from a file into a buffer
n = write(fd, buffer, bytes) Write data from a buffer into a file
Position = lseek(fd, offset, whence) Move the file pointer
s = stat(name, &buf) Get a file’s status information

A stripped down shell:

1
2
3
4
5
6
7
8
9
10
11
while(TRUE) {                       /* repeat forever */
type_prompt(); /* display prompt */
read_command(command, parameters);
if (fork() != 0) { /* fork off child process*/
/* Parent code */
waitpid(-1, &status, 0); /* wait for child to exit */
} else {
/* Child code */
execve(command, parameters, 0); /* execute command */
}
}

System Call Implementation

A Simple Model of CPU Computation

The fetch-execute cycle

  • Load memory contents from address in program counter (PC)
    • The instruction
  • Execute the instruction
  • increment PC
  • Repeat

A Simple Model of CPU Computation

  • Stack Pointer(SP)

  • Status Register

  • Condition codes

    • Positive result
    • Zero result
    • Negative result
  • General Purpose Register

    • Holds operands of most instructions
    • Enables programmers (compiler) to minmise memory reference

Privileged-mode Operation

To protect operating system execution, two or more CPU modes of operation exist

  • Privileged mode(system, kernel-mode)
    • All instructions and register are available
  • User-mode
    • Users ‘safe’ subset of instruction set
      • Only affects the state of the application itself
      • They cannot be used to uncontrollably interference with OS
    • Only ‘safe’ registers are accessible

System Call

System Call Mechanism Overview

  • System call transitions triggered by special processor instructions
    • User to Kernel
      • System call instruction
    • Kernel to User
      • Return from privileged mode instruction
  • Processor mode
    • Switched from user-mode to kernel-mode
      • Switched back when returning to user mode
  • Stack Pointer (SP)
    • User-level SP is saved and a kernel SP is initialised
      • User-level SP restored when returning to user-mode
  • Program Counter (PC)
    • User-level PC is saved and PC set to kernel entry point
      • User-level PC restored when returning to user-level
    • Kernel entry via the designated entry point must be strictly enforced
  • Registers
    • Set at user-level to indicate system call type and its arguments
      • A convention between applications and the kernel
    • Some registers are preserved at user-level or kernel-level in order to restart user-level execution
      • Depends on language calling convention etc
    • Result of system call placed in registers when returning to user-level
      • Another convention

Why do we need system calls?

Why not simply jump into the kernel via a function call????

Function calls do not

  • Change from user to kernel mode
    • and eventually back again
  • Restrict possible entry points to secure locations
    • To prevent entering after any security checks

Steps in Making a System Call

File Management

Overview of the FS abstraction

User’s view Under the hood
Uniform namespace Heterogenrous collection of storage devices
Hierachical structure Flat address space(block numbers)
Arbitrarily-sized files Fixed-size blocks
Symbolic file names Numeric block addresses
Access control No access control
Tools for (formatting, defragmentation, Backup, Consistency checking)

File Names

File system must provide a convenient naming scheme

  • Textual Names
  • May have restrictions
    • Only certain characters
    • Limited length
    • Only certain format
  • Case (in)sensitive
  • Names may obey conventions
    • Interpreted by tools
    • interpreted by operating system

File Structure

Sequence of Bytes

  • OS consider a file to be unstructured
  • Applications can impose their own structure
  • Used by UNIX, Window, most modern OSes

File Types

Regular File

Directories

Device Files

  • May be divided into
    • Character Devices – stream of bytes
    • Block Devices
  • Some systems distinguish between regular file types
    • ASCII text files, binary files

File Access Types(Patterns)

Sequential access

  • read all bytes/records from the beginning
  • cannot jump around, could rewind or back up
  • convenient when medium was magnetic tape

Random access

  • bytes/records read in any order
  • essential for data base systems
  • read can be …
    • move file pointer (seek), then read or
      • lseek(location,…);read(…)
    • each read specifies the file pointer
      • read(location,…) 11

Typical File Operations

Create Delete Open Close Read Write Append Rename Seek Get attributes Set Attributes

File Organisation and Access

(Programmer’s Perspective)

Given an operating system supporting unstructured files that are a stream-of-bytes,how can one organise the contents of the files?

E.g. Executable Linkable Format (ELF)

Some possible access patterns:

  • Read the whole file –Read individual records from a file
  • record = sequence of bytes containing the record
  • Read records preceding or following the current one
  • Retrieve a set of records
  • Write a whole file sequentially
  • Insert/delete/update records in a file

Programmers are free to structure the file to suit the application

Criteria for File Organization

Things to consider when designing file layout

  • Rapid access
    • Needed when accessing a single record
    • Not needed for batch mode
      • read from start to finish
  • Ease of update
    • File on CD-ROM will not be updated, so this is not a concern
  • Economy of storage
    • Should be minimum redundancy in the data
    • Redundancy can be used to speed access such as an index

File Directories

Provide mapping between file names and the files themselves

Contain information about files

  • Attributes
  • Location
  • Ownership

Directory itself is a file owned by the operating system

Hierarchical (Tree-Structured) Directory

Files can be located by following a path from the root, or master, directory down various branches

  • This is the absolute pathname for the file

Can have several files with the same file name as long as they have unique path names

Current Working Directory

Always specifying the absolute pathname for a file is tedious!

Introduce the idea of a working directory

  • Files are referenced relative to the working directory

Relative and Absolute Pathnames

Absolute pathname

  • A path specified from the root of the file system to the file

A Relative pathname

  • A pathname specified from the cwd

Note: ‘.’ (dot) and ‘..’ (dotdot) refer to current and parent directory

Typical Directory Operations

Create Delete Opendir Closedir Readdir Rename Link Unlink

Nice properties of UNIX naming

Simple, regular format

  • Names referring to different servers, objects, etc., have the same syntax.
    • Regular tools can be used where specialised tools would be otherwise be needed.

Location independent

  • Objects can be distributed or migrated, and continue with the same names.

File Sharing

In multiuser system, allow files to be shared among users

Two issues

  • Access rights
    • None
      • User may not know of the existence of the file
      • User is not allowed to read the directory that includes the file
    • Knowledge
      • User can only determine that the file exists and who its owner is
    • Execution
      • The user can load and execute a program but cannot copy it
    • Reading
      • The user can read the file for any purpose, including copying and execution
    • Appending
      • The user can add data to the file but cannot modify or delete any of the file’s contents
    • Updating
      • The user can modify, delete, and add to the file’s data. This includes creating the file, rewriting it, and removing all or part of the data
    • Changing protection
      • User can change access rights granted to other users
    • Deletion
      • User can delete the file
    • Owners
      • Has all rights previously listed
      • May grant rights to others using the following classes of users
        • Specific user
        • User groups
        • All for public files
  • Management of simultaneous access
    • Most OSes provide mechanisms for users to manage concurrent access to files
      • Example: flock(), lockf(), system calls
    • Typically
      • User may lock entire file when it is to be updated
      • User may lock the individual records (i.e. ranges) during the update
    • Mutual exclusion and deadlock are issues for shared access

File System

  • The FS must map symbolic file names into a collection of block addresses

  • The FS must keep track of

    • which blocks belong to which files
    • in what order the blocks form the file
    • which blocks are free for allocation
  • Given a logical region of a file, the FS must track the corresponding block(s) on disk

    • Stored in file system metadata

File Allocation Methods

A file is divided into “blocks” – the unit of transfer to storage

External and internal fragmentation

External fragmentation

  • The space wasted external to the allocated memory regions
  • Memory space exists to satisfy a request but it is unusable as it is not contiguous

Internal fragmentation

  • The space wasted internal to the allocated memory regions
  • Allocated memory may be slightly larger than requested memory; this size difference is wasted memory internal to a partition

Contiguous Allocation

Easy bookkeeping (need to keep track of the starting block and length of the file)

Increases performance for sequential operations

Need the maximum size for the file at the time of creation

As files are deleted, free space becomes divided into many small chunks (external fragmentation)

Dynamic Allocation Strategies

Disk space allocated in portions as needed

Allocation occurs in fixed-size blocks

No external fragmentation

Does not require pre-allocating disk space

Partially filled blocks (internal fragmentation)

File blocks are scattered across the disk

Complex metadata management (maintain the collection of blocks for each file)

Dynamic allocation: Linked list allocation

Each block contains a pointer to the next block in the chain. Free blocks are also linked in a chain.

Only single metadata entry per file

Best for sequential files

Poor for random access

Blocks end up scattered across the disk due to free list eventually being randomised

Dynamic Allocation: File Allocation Table (FAT)

Keep a map of the entire FS in a separate table

  • A table entry contains the number of the next block of the file
  • The last block in a file and empty blocks are marked using reserved values

The table is stored on the disk and is replicated in memory

​Random access is fast (following the in-memory list)

Requires a lot of memory for large disks

Free block lookup is slow

Dynamical Allocation: inode-based FS structure

Idea: separate table (index-node or i-node) for each file.

  • Only keep table for open files in memory
  • Fast random access

The most popular FS structure today

i-nodes occupy one or several disk areas

i-nodes are allocated dynamically, hence free-space management is required for i-nodes

  • Use fixed-size i-nodes to simplify dynamic allocation
  • Reserve the last i-node entry for a pointer (a block number) to an extension i-node.

i-node implementation issues

Free-space management

  • Approach 1: linked list of free blocks in free blocks on disk
    • List of all unallocated blocks
    • Background jobs can re-order list for better contiguity
    • Store in free blocks themselves
      • Does not reduce disk capacity
    • Only one block of pointers need be kept in the main memory
  • Approach 2: keep bitmaps of free blocks and free i-nodes on disk
    • Individual bits in a bit vector flags used/free blocks
    • May be too large to hold in main memory
    • Expensive to search
    • Concentrating (de)allocations in a portion of the bitmap has desirable effect of concentrating access
    • Simple to find contiguous free space

Implementing directories

Directories are stored like normal files

  • directory entries are contained inside data blocks

The FS assigns special meaning to the content of these files

  • a directory file is a list of directory entries
  • a directory entry contains file name, attributes, and the file i-node number
    • maps human-oriented file name to a system-oriented name

Fixed-size vs variable-size directory entries

  • Fixed-size directory entries
    • Either too small
    • Or waste too much space
  • Variable-size directory entries
    • Freeing variable length entries can create external fragmentation in directory blocks
      • Can compact when block is in RAM

Searching Directory Listings

Locating a file in a directory

  • Linear scan
    • Implement a directory cache in software to speed-up search
  • Hash lookup
  • B-tree (100’s of thousands entries)

Storing file attributes

(a) disk addresses and attributes in directory entry –FAT

(b) directory in which each entry just refers to an i-node –UNIX

Inode Contents

Size

  • Offset of the highest byte written

Block count

  • Number of disk blocks used by the file.
  • Note that number of blocks can be much less than expected given the file size

Files can be sparsely populated

  • E.g. write(f,“hello”); lseek(f, 1000000); write(f, “world”);
  • Only needs to store the start and end of file, not all the empty blocks in between.
  • Size = 1000005
  • Blocks = 2 + any indirect blocks

Direct Blocks

  • Block numbers of first 12 blocks in the file
  • Most files are small – We can find blocks of file directly from the inode

Single Indirect Block – Block number of a block containing block numbers

  • Requires two disk access to read – One for the indirect block; one for the target block
  • Max File Size – Assume 1Kbyte block size, 4 byte block numbers
    • 12 * 1K + 1K/4 * 1K = 268 KiB
  • For large majority of files (< 268 KiB), given the inode, only one or two further accesses required to read any block in file

Double Indirect Block – Block number of a block containing block numbers of blocks containing block numbers

Triple Indirect

Hard Links

Note that inodes can have more than one name -– Called a Hard Link

Symbolic links

A symbolic link is a file that contains a reference to another file or directory

  • Has its own inode and data block, which contains a path to the target file
  • Marked by a special file attribute
  • Transparent for some operations
  • Can point across FS boundaries

FS reliability

Disk writes are buffered in RAM

  • OS crash or power outage ==> lost data
  • Commit writes to disk periodically
  • Use the sync command to force a FS flush

FS operations are non-atomic

  • Incomplete transaction can leave the FS in an inconsistent state

Journaling file systems

  • Keep a journal of FS updates
  • Before performing an atomic update sequence
  • write it to the journal
  • Replay the last journal entries upon an unclean shutdown

The ext3 journal

Option1: Journal FS data structure updates

Efficient use of journal space; hence faster journaling

Individual updates are applied separately

The journaling layer must understand FS semantics

Option2: Journal disk block updates

Ext3 implements Option 2

Even a small update adds a whole block to the journal

Multiple updates to the same block can be aggregated into a single update

The journaling layer is FSindependent (easier to implement)

Journaling Block Device (JBD)

JBD interface (continued)

  • Commit: write transaction data to the journal (persistent storage)
    • Multiple FS transactions are committed in one go
  • Checkpoint: flush the journal to the disk
    • Used when the journal is full or the FS is being unmounted

Trade-off in FS block size

  • File systems deal with 2 types of blocks
    • Disk blocks or sectors (usually 512 bytes)
    • File system blocks 512 * 2^N bytes
  • Larger blocks require less FS metadata
  • Smaller blocks waste less disk space (less internal fragmentation)
  • Sequential Access
    • The larger the block size, the fewer I/O operations required
  • Random Access
    • The larger the block size, the more unrelated data loaded
    • Spatial locality of access improves the situation
  • Choosing an appropriate block size is a compromise

Virtual File System(VFS)

  • Provides single system call interface for many file systems
  • Transparent handling of network file systems
  • File-based interface to arbitrary device drivers
  • File-based interface to kernel data structures
  • Provides an indirection layer for system calls
    • File operation table set up at file open time
    • Points to actual handling code for particular type
    • Further file operations redirected to those functions

VFS Interface

Two major data types

VFS

  • Represents all file system types
  • Contains pointers to functions to manipulate each file system as a whole (e.g. mount, unmount)

Vnode

  • Represents a file (inode) in the underlying filesystem
  • Points to the real inode
  • Contains pointers to functions to manipulate files/inodes (e.g. open, close, read, write,…)

File Descriptors

  • Each open file has a file descriptor
  • Read/Write/lseek/…. use them to specify which file to operate on.

State associated with a file descriptor

  • File pointer
    • Determines where in the file the next read or write is performed
  • Mode
    • Was the file opened read-only, etc….

Memory Management

OS Memory Management

  • Keeps track of what memory is in use and what memory is free

  • Allocates free memory to process when needed

    • And deallocates it when they don’t
  • Manages the transfer of memory between RAM and disk.

  • Two broad classes of memory management systems

    • Those that transfer processes to and from external storage during execution.
      • Called swapping or paging
    • Those that don’t
      • Simple
      • Might find this scheme in an embedded device, dumb phone, or smartcard.

Basic Memory Management

Monoprogramming without Swapping or Paging

Three simple ways of organizing memory

Monoprogramming

Okay if

  • Only have one thing to do
  • Memory available approximately equates to memory required
  • Otherwise,
    • Poor CPU utilisation in the presence of I/O waiting
    • Poor memory utilisation with a varied job mix

Fixed partitioning

Simple MM: Fixed, equal-sized partitions

  • Any unused space in the partition is wasted
    • Called internal fragmentation
  • Processes smaller than main memory, but larger than a partition cannot run

Simple MM: Fixed, variable-sized partitions

  • Divide memory at boot time into a selection of different sized partitions

    • Can base sizes on expected workload
  • Each partition has queue:

    • Place process in queue for smallest partition that it fits in.
    • Processes wait for when assigned partition is empty to start
  • Issue

    • Some partitions may be idle
      • Small jobs available, but only large partition free
      • Workload could be unpredictable

Alternative queue strategy

Single queue, search for any jobs that fit

  • Small jobs in large partition if necessary

  • Increases internal memory fragmentation

Fixed Partition Summary

  • Simple
  • Easy to implement
  • Can result in poor memory utilisation
    • Due to internal fragmentation
  • Used on IBM System 360 operating system (OS/MFT)
    • Announced 6 April, 1964
  • Still applicable for simple embedded systems
    • Static workload known in advance

Dynamic Partitioning

  • Partitions are of variable length

    • Allocated on-demand from ranges of free memory
  • Process is allocated exactly what it needs

    • Assumes a process knows what it needs
  • We end up with unusable holes (external fragmentation)

Classic Approach

Represent available memory as a linked list of available “holes” (free memory ranges).

  • Base, size
  • Kept in order of increasing address
    • Simplifies merging of adjacent holes into larger holes.
  • List nodes be stored in the “holes” themselves

Dynamic Partitioning Placement Algorithm

First-fit algorithm

  • Scan the list for the first entry that fits
    • If greater in size, break it into an allocated and free part
    • Intent: Minimise amount of searching performed
  • Aims to find a match quickly
  • Biases allocation to one end of memory
  • Tends to preserve larger blocks at the end of memory

Next-fit

  • Like first-fit, except it begins its search from the point in list where the last request succeeded instead of at the beginning.
    • (Flawed) Intuition: spread allocation more uniformly over entire memory to avoid skipping over small holes at start of memory
    • Performs worse than first-fit as it breaks up the large free space at end of memory.

Best-fit algorithm

  • Chooses the block that is closest in size to the request
  • Performs worse than first-fit
    • Has to search complete list
    • does more work than first-fit
    • Since smallest block is chosen for a process, the smallest amount of external fragmentation is left
    • Create lots of unusable holes

Worst-fit algorithm

  • Chooses the block that is largest in size (worst-fit)
    • (whimsical) idea is to leave a usable fragment left over
  • Poor performer
    • Has to do more work (like best fit) to search complete list
    • Does not result in significantly less fragmentation

Dynamic Partition Allocation Algorithm Summary

  • First-fit generally better than the others and easiest to implement

Compaction

  • We can reduce external fragmentation by compaction

    • Shuffle memory contents to place all free memory together in one large block.
    • Only if we can relocate running programs?
      • Pointers?
  • Generally requires hardware support

When are memory addresses bound?

  • Compile/link time
    • Compiler/Linker binds the addresses
    • Must know “run” location at compile time
    • Recompile if location changes
  • Load time
    • Compiler generates relocatable code
    • Loader binds the addresses at load time
  • Run time
    • Logical compile-time addresses translated to physical addresses by special hardware.

Hardware Support for Runtime Binding and Protection

  • For process B to run using logical addresses
    • Process B expects to access addresses from zero to some limit of memory size
    • Need to add an appropriate offset to its logical addresses
      • Achieve relocation
      • Protect memory “lower” than B
    • Must limit the maximum logical address B can generate
      • Protect memory “higher” than B

Base and Limit Register

  • Also called
    • Base and bound registers
    • Relocation and limit registers
  • Base and limit registers
    • Restrict and relocate the currently active process
    • Base and limit registers must be changed at
      • Load time
      • Relocation (compaction time)
      • On a context switch
  • Pro
    • Supports protected multi-processing (-tasking)
  • Cons
    • Physical memory allocation must still be contiguous
    • The entire process must be in memory
    • Do not support partial sharing of address spaces
      • No shared code, libraries, or data structures between processes

Thus far, we have a system suitable for a batch system

  • Limited number of dynamically allocated processes
    • Enough to keep CPU utilised
  • Relocated at runtime
  • Protected from each other

Swapping

  • A process can be swapped temporarily out of memory to a backing store, and then brought back into memory for continued execution.

  • Backing store – fast disk large enough to accommodate copies of all memory images for all users; must provide direct access to these memory images.

  • Can prioritize – lower-priority process is swapped out so higher-priority process can be loaded and executed.

  • Major part of swap time is transfer time; total transfer time is directly proportional to the amount of memory swapped.

    • slow
  • we assume a process is smaller than memory

Virtual Memory – Paging Overview

  • Partition physical memory into small equal sized chunks

    • Called frames
  • Divide each process’s virtual (logical) address space into same size chunks

    • Called pages
    • Virtual memory addresses consist of a page number and offset within the page
  • OS maintains a page table

    • contains the frame location for each page

    • Used by hardware to translate each virtual address to physical address

    • The relation between virtual addresses and physical memory addresses is given by page table

  • Process’s physical memory does not have to be contiguous

  • No external fragmentation

  • Small internal fragmentation (in last page)

  • Allows sharing by mapping several pages to the same frame

  • Abstracts physical organisation

    • Programmer only deal with virtual addresses
  • Minimal support for logical organisation

    • Each unit is one or more pages

Virtual Memory

Memory Management Unit(or TLB)

The position and function of the MMU

Page-based VM

Virtual Memory

  • Divided into equalsized pages
  • A mapping is a translation between
    • A page and a frame
    • A page and null
  • Mappings defined at runtime – They can change
  • Address space can have holes
  • Process does not have to be contiguous in physical memory

Physical Memory

  • Divided into equal-sized frames

Typical Address Space Layout

  • Stack region is at top, and can grow down
  • Heap has free space to grow up
  • Text is typically read-only
  • Kernel is in a reserved, protected, shared region
  • 0-th page typically not used

Page Faults

Referencing an invalid page triggers a page fault – An exception handled by the OS

Broadly, two standard page fault types

  • Illegal Address (protection error)
    • Signal or kill the process
  • Page not resident
    • Get an empty frame
    • Load page from disk
    • Update page (translation) table (enter frame #, set valid bit, etc.)
    • Restart the faulting instruction

Shared Pages

  • Private code and data
    • Each process has own copy of code and data
    • Code and data can appear anywhere in the address space
  • Shared code
    • Single copy of code shared between all processes executing it
    • Code must not be self modifying
    • Code must appear at same address in all processes

Page Table Structure

Page table is (logically) an array of frame numbers

  • Index by page number

Each page-table entry (PTE) also has other bits

  • Present/Absent bit
  • Also called valid bit, it indicates a valid mapping for the page
  • Modified bit
    • Also called dirty bit, it indicates the page may have been modified in memory
  • Reference bit
    • Indicates the page has been accessed
  • Protection bits
    • Read permission, Write permission, Execute permission
    • Or combinations of the above
  • Caching bit
    • Use to indicate processor should bypass the cache when accessing memory
      • Example: to access device registers or memory

Address Translation

Every (virtual) memory address issued by the CPU must be translated to physical memory

  • Every load and every store instruction
  • Every instruction fetch

Need Translation Hardware

In paging system, translation involves replace page number with a frame number

Page Tables

  • Page tables are implemented as data structures in main memory
  • Most processes do not use the full 4GB address space
  • We need a compact representation that does not waste space
  • Three basic schemes
    • Use data structures that adapt to sparsity
    • Use data structures which only represent resident pages
    • Use VM techniques for page tables (details left to extended OS)

Two-level Page Table

2nd –level page tables representing unmapped pages are not allocated – Null in the top-level page table

Alternative: Inverted Page Table

“Inverted page table” is an array of page numbers sorted (indexed) by frame number (it’s a frame table).

Algorithm

  • Compute hash of page number
  • Extract index from hash table
  • Use this to index into inverted page table
  • Match the PID and page number in the IPT entry
  • If match, use the index value as frame # for translation
  • If no match, get next candidate IPT entry from chain field
  • If NULL chain entry $\Rightarrow$ page fault

Properties of IPTs

  • IPT grows with size of RAM, NOT virtual address space
  • Frame table is needed anyway (for page replacement, more later)
  • Need a separate data structure for non-resident pages
  • Saves a vast amount of space (especially on 64-bit systems)
  • Used in some IBM and HP workstations

Improving the IPT: Hashed Page Table

  • Retain fast lookup of IPT – A single memory reference in best case
  • Retain page table sized based on physical memory size (not virtual)
    • Enable efficient frame sharing
    • Support more than one mapping for same frame

Sizing the Hashed Page Table

  • HPT sized based on physical memory size
  • With sharing
    • Each frame can have more than one PTE
    • More sharing increases number of slots used – Increases collision likelihood
  • However, we can tune HPT size based on:
    • Physical memory size
    • Expected sharing
    • Hash collision avoidance.
  • HPT a power of 2 multiple of number of physical memory frame

VM Implementation Issue

Performance

Each virtual memory reference can cause two physical memory accesses

  • One to fetch the page table entry
  • One to fetch/store the data

$\Rightarrow$ Intolerable performance impact!!

Solution:

  • High-speed cache for page table entries (PTEs)
    • Called a translation look-aside buffer (TLB)
    • Contains recently used page table entries
    • Associative, high-speed memory, similar to cache memory
    • May be under OS control (unlike memory cache)

TLB

Translation Lookaside Buffer

  • Given a virtual address, processor examines the TLB
  • If matching PTE found (TLB hit), the address is translated
  • Otherwise (TLB miss), the page number is used to index the process’s page table
    • If PT contains a valid entry, reload TLB and restart
    • Otherwise, (page fault) check if page is on disk
      • If on disk, swap it in
      • Otherwise, allocate a new page or raise an exception

TLB properties

  • Page table is (logically) an array of frame numbers

  • TLB holds a (recently used) subset of PT entries

    • Each TLB entry must be identified (tagged) with the page # it translates
    • Access is by associative lookup:
      • All TLB entries’ tags are concurrently compared to the page #
      • TLB is associative (or content-addressable) memory
  • TLB may or may not be under direct OS control

    • Hardware-loaded TLB
      • On miss, hardware performs PT lookup and reloads TLB
      • Example: x86, ARM
    • Software-loaded TLB
      • On miss, hardware generates a TLB miss exception, and exception handler reloads TLB
      • Example: MIPS, Itanium (optionally)
  • TLB size: typically 64-128 entries

  • Can have separate TLBs for instruction fetch and data access

  • TLBs can also be used with inverted page tables (and others)

TLB and context switching

  • TLB is a shared piece of hardware
  • Normal page tables are per-process (address space)
  • TLB entries are process-specific
    • On context switch need to flush the TLB (invalidate all entries)
      • high context-switching overhead (Intel x86)
    • or tag entries with address-space ID (ASID)
      • called a tagged TLB
      • used (in some form) on all modern architectures
      • TLB entry: ASID, page #, frame #, valid and write-protect bits

TLB effect

  • Without TLB

    • Average number of physical memory references per virtual reference

      = 2

  • With TLB (assume 99% hit ratio)

    • Average number of physical memory references per virtual reference

      = 0.99 * 1+ 0.01 * 2

      = 1.01

Demand Paging/Segmentation

  • With VM, only parts of the program image need to be resident in memory for execution
  • Can transfer presently unused pages/segments to disk
  • Reload non-resident pages/segment on demand.
    • Reload is triggered by a page or segment fault
    • Faulting process is blocked and another scheduled
    • When page/segment is resident, faulting process is restarted
    • May require freeing up memory first
      • Replace current resident page/segment
      • How determine replacement “victim”?
    • If victim is unmodified (“clean”) can simply discard it
      • This is reason for maintaining a “dirty” bit in the PT

Why does demand paging/segmentation work?

  • Program executes at full speed only when accessing the resident set.
  • TLB misses introduce delays of several microseconds
  • Page/segment faults introduce delays of several milliseconds

Answer

  • Less physical memory required per process
    • Can fit more processes in memory
    • Improved chance of finding a runnable one
  • Principle of locality

Principle of Locality

  • An important observation comes from empirical studies of the properties of programs.

    • Programs tend to reuse data and instructions they have used recently.
    • 90/10 rule – “A program spends 90% of its time in 10% of its code”
  • We can exploit this locality of references

  • An implication of locality is that we can reasonably predict what instructions and data a program will use in the near future based on its accesses in the recent past

  • Two different types of locality have been observed:

    • Temporal locality: states that recently accessed items are likely to be accessed in the near future
    • Spatial locality: says that items whose addresses are near one another tend to be referenced close together in time

Working Set

The pages/segments required by an application in a time window ($\Delta$)is called its memory working set.

Working set is an approximation of a programs’ locality

  • if $\Delta$ too small will not encompass entire locality
  • if $\Delta$ too large will encompass several localities
  • if $\Delta = \infin\Rightarrow$ will encompass entire program.
  • $\Delta$’s size is an application specific tradeoff

System should keep resident at least a process’s working set – Process executes while it remains in its working set

Working set tends to change gradually

  • Get only a few page/segment faults during a time window
  • Possible (but hard) to make intelligent guesses about which pieces will be needed in the future – May be able to pre-fetch page/segments

Thrashing

CPU utilisation tends to increase with the degree of multiprogramming – number of processes in system

Higher degrees of multiprogramming – less memory available per process

Some process’s working sets may no longer fit in RAM – Implies an increasing page fault rate

Eventually many processes have insufficient memory

  • Can’t always find a runnable process
  • Decreasing CPU utilisation
  • System become I/O limited

This is called thrashing

$ \sum$working set sizes > total physical memory size

Recovery From Thrashing

In the presence of increasing page fault frequency and decreasing CPU utilisation

  • Suspend a few processes to reduce degree of multiprogramming
  • Resident pages of suspended processes will migrate to backing store
  • More physical memory becomes available
  • Resume suspended processes later when memory pressure eases

VM Management Policies

Operation and performance of VM system is dependent on a number of policies:

  • Page table format (may be dictated by hardware)
    • Multi-level
    • Inverted/Hashed
  • Page size (may be dictated by hardware)
  • Fetch Policy
  • Replacement policy
  • Resident set size
    • Minimum allocation
    • Local versus global allocation
  • Page cleaning policy

Page Size

Increasing page size

Increases internal fragmentation –  reduces adaptability to working set size

Decreases number of pages – Reduces size of page tables

Increases TLB coverage – Reduces number of TLB misses

Increases page fault latency – Need to read more from disk before restarting process

Increases swapping I/O throughput – Small I/O are dominated by seek/rotation delays

Optimal page size is a (work-load dependent) trade-off

Multiple page sizes provide flexibility to optimise the use of the TLB

  • Large page sizes can be use for code
  • Small page size for thread stacks

Most operating systems support only a single page size

  • Dealing with multiple page sizes is hard!

Fetch Policy

Determines when a page should be brought into memory

Demand paging only loads pages in response to page faults – Many page faults when a process first starts

Pre-paging brings in more pages than needed at the moment

Improves I/O performance by reading in larger chunks

Wastes I/O bandwidth if pre-fetched pages aren’t used

Especially bad if we eject pages in working set in order to pre-fetch unused pages.

Hard to get right in practice

Replacement Policy

  • Page removed should be the page least likely to be references in the near future
  • Most policies attempt to predict the future behaviour on the basis of past behaviour
  • Constraint: locked frames
    • Kernel code
    • Main kernel data structure
    • I/O buffers
    • Performance-critical user-pages (e.g. for DBMS)
  • Frame table has a lock (or pinned) bit

Optimal Replacement policy

Toss the page that won’t be used for the longest time

Impossible to implement

Only good as a theoretic reference point:The closer a practical algorithm gets to optimal, the better

FIFO Replacement Policy

First-in, first-out: Toss the oldest page

Easy to implement

Age of a page is isn’t necessarily related to usage

Least Recently Used (LRU)

Toss the least recently used page

Assumes that page that has not been referenced for a long time is unlikely to be referenced in the near future

Will work if locality holds

Implementation requires a time stamp to be kept for each page, updated on every reference

Impossible to implement efficiently

Most practical algorithms are approximations of LRU

Clock Page Replacement

Clock policy, also called second chance.

Employs a usage or reference bit in the frame table.

Set to one when page is used

While scanning for a victim, reset all the reference bits

Toss the first page with a zero reference bit

Issue

How do we know when a page is referenced?

Use the valid bit in the PTE:

  • When a page is mapped (valid bit set), set the reference bit
  • When resetting the reference bit, invalidate the PTE entry
  • On page fault
    • Turn on valid bit in PTE
    • Turn on reference bit
  • We thus simulate a reference bit in software

Performance

It terms of selecting the most appropriate replacement, they rank as follows:

  1. Optimal
  2. LRU
  3. Clock
  4. FIFO

Resident Set Size

How many frames should each process have?

Fixed Allocation

Gives a process a fixed number of pages within which to execute.

Isolates process memory usage from each other

When a page fault occurs, one of the pages of that process must be replaced.

Achieving high utilisation is an issue – Some processes have high fault rate while others don’t use their allocation.

Variable Allocation

Number of pages allocated to a process varies over the lifetime of the process

Global Scope

Operating system keeps global list of free frames

Free frame is added to resident set of process when a page fault occurs

If no free frame, replaces one from any process

Easiest to implement

Adopted by many operating systems

Automatic balancing across system

Does not provide guarantees for important activities

Local Scope

Allocate number of page frames to a new process based on

  • Application type
  • Program request
  • Other criteria (priority)

When a page fault occurs, select a page from among the resident set of the process that suffers the page fault

Re-evaluate allocation from time to time!

Cleaning Policy

Observation – Clean pages are much cheaper to replace than dirty pages

Demand cleaning

  • A page is written out only when it has been selected for replacement
  • High latency between the decision to replace and availability of free frame.

Precleaning

  • Pages are written out in batches (in the background, the pagedaemon)
  • Increases likelihood of replacing clean frames
  • Overlap I/O with current activity

Multiprocessor Systems

Amdahl’s law

Given a proportion P of a program that can be made parallel, and the remaining serial portion (1-P), speedup by using N processors$\frac{1}{(1-P)+\frac{P}{N}}$

Types of Multiprocessors (MPs)

  • UMA MP (Uniform Memory Access)
    • Access to all memory occurs at the same speed for all processors.
  • NUMA MP (Non-uniform memory access)
    • Access to some parts of memory is faster for some processors than other parts of memory

Bus Based UMA

  • Simplest MP is more than one processor on a single bus connect to memory
    • Bus bandwidth becomes a bottleneck with more than just a few CPUs
  • Each processor has a cache to reduce its need for access to memory
    • Hope is most accesses are to the local cache
    • Bus bandwidth still becomes a bottleneck with many CPUs
    • With only a single shared bus, scalability can be limited by the bus bandwidth of the single bus - Caching only helps so much
    • Alternative bus architectures do exist.
      • They improve bandwidth available
      • Don’t eliminate constraint that bandwidth is limited

Multi-core Processor

(Multi-core) share the same Bus interface
Multiprocessors can

  • Increase computation power beyond that available from a single CPU
  • Share resources such as disk and memory

However

  • Assumes parallelizable workload to be effective
  • Assumes not I/O bound
  • Shared buses (bus bandwidth) limits scalability
    • Can be reduced via hardware design
    • Can be reduced by carefully crafted software behaviour
      • Good cache locality together with limited data sharing where possible

How do we construct an OS for a multiprocessor?

Each CPU has its own OS?

  • Statically allocate physical memory to each CPU
  • Each CPU runs its own independent OS
  • Share peripherals
  • Each CPU (OS) handles its processes system calls

Used in early multiprocessor systems to ‘get them going’

  • Simpler to implement
  • Avoids CPU-based concurrency issues by not sharing
  • Scales – no shared serial sections
  • Modern analogy, virtualisation in the cloud.

Issues

  • Each processor has its own scheduling queue
    • We can have one processor overloaded, and the rest idle
  • Each processor has its own memory partition
    • We can a one processor thrashing, and the others with free memory
      • No way to move free memory from one OS to another

Symmetric Multiprocessors (SMP)

  • OS kernel run on all processors
    • Load and resource are balance between all processors
      • Including kernel execution

Issue: Real concurrency in the kernel

  • Need carefully applied synchronisation primitives to avoid disaster
  • One alternative: A single mutex that make the entire kernel a large critical section
    • Only one CPU can be in the kernel at a time
    • The “big lock” becomes a bottleneck when in-kernel processing exceeds what can be done on a single CPU
  • Better alternative: identify largely independent parts of the kernel and make each of them their own critical section
    • Allows more parallelism in the kernel
  • Issue: Difficult task
    • Code is mostly similar to uniprocessor code
    • Hard part is identifying independent parts that don’t interfere with each other
      • Remember all the inter-dependencies between OS subsystems.
    • Lock contention can limit overall system performance

Test-and-Set

Hardware guarantees that the instruction executes atomically on a CPU.
Atomically: As an indivisible unit.
The instruction can not stop half way through

Test-and-Set on SMP

It does not work without some extra hardware support

A solution:

Hardware blocks all other CPUs from accessing the bus during the TSL instruction to prevent memory accesses by any other CPU.

  • TSL has mutually exclusive access to memory for duration of instruction.

Test-and Set is a busy-wait synchronisation primitive * Called a spinlock

Issue:

  • Lock contention leads to spinning on the lock
    • Spinning on a lock requires blocking the bus which slows all other CPUs down
      • Independent of whether other CPUs need a lock or no
      • Caching does not help reduce bus contention
  • Either TSL still blocks the bus
  • Or TSL requires exclusive access to an entry in the local cache

Reducing Bus Contention

  • Read before TSL
    • Spin reading the lock variable waiting for it to change
    • When it does, use TSL to acquire the lock
  • Allows lock to be shared read-only in all caches until its released
    • no bus traffic until actual release
  • No race conditions, as acquisition is still with TSL.
  • Test and set performs poorly once there is enough CPUs to cause contention for lock
    • Expected
  • Read before Test and Set performs better
    • Performance less than expected
    • Still significant contention on lock when CPUs notice release and all attempt acquisition
  • Critical section performance degenerates
    • Critical section requires bus traffic to modify shared structure
    • Lock holder competes with CPU that’s waiting as they test and set, so the lock holder is slower
    • Slower lock holder results in more contention

Cache Consistency

Cache consistency is usually handled by the hardware.

Spinning versus Blocking and Switching

  • Spinning (busy-waiting) on a lock makes no sense on a uniprocessor
    • The was no other running process to release the lock
    • Blocking and (eventually) switching to the lock holder is the only sensible option.
  • On SMP systems, the decision to spin or block is not as clear.
    • The lock is held by another running CPU and will be freed without necessarily switching away from the requestor

Spinning versus Switching

Blocking and switching

  • to another process takes time
    • Save context and restore another
    • Cache contains current process not new process
      • Adjusting the cache working set also takes time
    • TLB is similar to cache
    • Switching back when the lock is free encounters the same again

Spinning wastes CPU time directly

Trade off

If lock is held for less time than the overhead of switching to and back

​ $\Rightarrow$ It’s more efficient to spin
$\Rightarrow$ Spinlocks expect critical sections to be short
​ $\Rightarrow$ No waiting for I/O within a spinlock
​ $\Rightarrow$ No nesting locks within a spinlock

Preemption and Spinlocks

  • Critical sections synchronised via spinlocks are expected to be short
    • Avoid other CPUs wasting cycles spinning
  • What happens if the spinlock holder is preempted at end of holder’s timeslice
    • Mutual exclusion is still guaranteed
    • Other CPUs will spin until the holder is scheduled again!!!!!

$\Rightarrow$ Spinlock implementations disable interrupts in addition to acquiring locks to avoid lock-holder preemption

Scheduling

The scheduler decides who to run next. This process is sheduling

Application behaviour

Bursts of CPU usage alternate with periods of I/O wait
a) CPU-Bound process

  • Spends most of its computing
  • Time to completion largely determined by received CPU time

b) I/O-Bound process

  • Spend most of its time waiting for I/O to complete
    • Small bursts of CPU to process I/O and request next I/O
  • Time to completion largely determined by I/O request time
  • We need a mix of CPU-bound and I/O-bound processes to keep both CPU and I/O systems busy
  • Process can go from CPU- to I/O-bound (or vice versa) in different phases of execution

Key Insights

Choosing to run an I/O-bound process delays a CPU-bound process by very little
Choosing to run a CPU-bound process prior to an I/O-bound process delays the next I/O request significantly

  • No overlap of I/O waiting with computation
  • Results in device (disk) not as busy as possible
  • Generally, favour I/O-bound processes over CPU-bound processes
  • Generally, a scheduling decision is required when a process (or thread) can no longer continue, or when an activity results in more than one ready process

Preemptive versus Non-preemptive Scheduling

Non-preemptive

  • Once a thread is in the running state, it continues until it completes, blocks on I/O, or voluntarily yields the CPU
  • A single process can monopolised the entire system
    Preemptive Scheduling (responsive system)
  • Current thread can be interrupted by OS and moved to ready state.
  • Usually after a timer interrupt and process has exceeded its maximum run time
  • Can also be as a result of higher priority process that has become ready (after I/O interrupt).
  • Ensures fairer service as single thread can’t monopolise the system
  • Requires a timer interrupt

Categories of Scheduling Algorithms

Batch Systems

  • No users directly waiting, can optimise for overall machine performance

Interactive Systems

  • Users directly waiting for their results, can optimise for users perceived performance

Realtime Systems

  • Jobs have deadlines, must schedule such that all jobs (predictably) meet their deadlines

Goals of Scheduling Algorithms

All Algorithms

  • Fairness
    • Give each process a fair share of the CPU
  • Policy Enforcement
    • What ever policy chosen, the scheduler should ensure it is carried out
  • Balance/Efficiency
    • Try to keep all parts of the system busy
  • Interactive Algorithms
  • Minimise response time
    • Response time is the time difference between issuing a command and getting the result
      • E.g selecting a menu, and getting the result of that selection
    • Response time is important to the user’s perception of the performance of the system.
  • Provide Proportionality
    • Proportionality is the user expectation that short jobs will have a short response time, and long jobs can have a long response time.
    • Generally, favour short jobs
  • Real-time Algorithms
    • Must meet deadlines
      • Each job/task has a deadline.
      • A missed deadline can result in data loss or catastrophic failure
        • Aircraft control system missed deadline to apply brakes
    • Provide Predictability
      • For some apps, an occasional missed deadline is okay – E.g. DVD decoder
      • Predictable behaviour allows smooth DVD decoding with only rare skips

Interactive scheduling

Round Robin Scheduling

  • Each process is given a timeslice to run in
  • When the timeslice expires, the next process preempts the current process, and runs for its timeslice, and so on
    • The preempted process is placed at the end of the queue
  • Implemented with
    • A ready queue
    • A regular timer interrupt

Pros – Fair, easy to implement
Con – Assumes everybody is equal

What should the timeslice be?

Too short

  • Waste a lot of time switching between processes
  • Example: timeslice of 4ms with 1 ms context switch = 20% round robin overhead

Too long

  • System is not responsive
  • Example: timeslice of 100ms – If 10 people hit “enter” key simultaneously, the last guy to run will only see progress after 1 second.
  • Degenerates into FCFS if timeslice longer than burst length

Priorities

  • Each Process (or thread) is associated with a priority
  • Provides basic mechanism to influence a scheduler decision:
    • Scheduler will always chooses a thread of higher priority over lower priority
  • Priorities can be defined internally or externally
    • Internal: e.g. I/O bound or CPU bound
    • External: e.g. based on importance to the user
  • Usually implemented by multiple priority queues, with round robin on each queue
  • Con
    • Low priorities can starve
      • Need to adapt priorities periodically
        • Based on ageing or execution history

Traditional UNIX Scheduler

Two-level scheduler

  • High-level scheduler schedules processes between memory and disk
  • Low-level scheduler is CPU scheduler
    • Based on a multilevel queue structure with round robin at each level
  • The highest priority (lower number) is scheduled
  • Priorities are re-calculated once per second, and re-inserted in appropriate queue
  • Avoid starvation of low priority threads
  • Penalise CPU-bound threads
  • Priority = CPU_usage +nice +base
  • CPU_usage = number of clock ticks
  • Nice is a value given to the process by a user to permanently boost or reduce its priority ( Reduce priority of background jobs)
  • Base is a set of hardwired, negative values used to boost priority of I/O bound system activities

Single Shared Ready Queue

Pros

  • Simple
  • Automatic load balancing

Cons

  • Lock contention on the ready queue can be a major bottleneck
    • Due to frequent scheduling or many CPUs or both
  • Not all CPUs are equal
    • The last CPU a process ran on is likely to have more related entries in the cache

Affinity Scheduling

Basic Idea – Try hard to run a process on the CPU it ran on last time

One approach: Multiple Queue Multiprocessor Scheduling

Multiple Queue SMP Scheduling

  • Each CPU has its own ready queue
  • Coarse-grained algorithm assigns processes to CPUs
    • Defines their affinity, and roughly balances the load
  • The bottom-level fine-grained scheduler:
    • Is the frequently invoked scheduler (e.g. on blocking on I/O, a lock, or exhausting a timeslice)
    • Runs on each CPU and selects from its own ready queue
    • Ensures affinity
    • If nothing is available from the local ready queue, it runs a process from another CPUs ready queue rather than go idle
    • Termed “Work stealing”

No lock contention on per-CPU ready queues in the (hopefully) common case
Load balancing to avoid idle queues
Automatic affinity to a single CPU for more cache friendly behaviour

I/O Management

I/O Devices

There exists a large variety of I/O devices:

  • Many of them with different properties
  • They seem to require different interfaces to manipulate and manage them
    • We don’t want a new interface for every device
    • Diverse, but similar interfaces leads to code duplication
  • Challenge:
    • Uniform and efficient approach to I/O

Device Drivers

Drivers classified into similar categories

  • Block devices and character (stream of data) device

OS defines a standard (internal) interface to the different classes of devices

  • Example: USB Human Input Device (HID) class specifications
    • human input devices follow a set of rules making it easier to design a standard interface.

Device drivers job

  • translate request through the device-independent standard interface (open, close, read, write) into appropriate sequence of commands (register manipulations) for the particular hardware
  • Initialise the hardware at boot time, and shut it down cleanly at shutdown

After issuing the command to the device, the device either

  • Completes immediately and the driver simply returns to the caller
  • Or, device must process the request and the driver usually blocks waiting for an I/O complete interrupt.

Drivers are thread-safe as they can be called by another process while a process is already blocked in the driver

  • Thead-safe: Synchronised…

Device-Independent I/O Software

There is commonality between drivers of similar classes

Divide I/O software into device-dependent and device-independent I/O software

Device independent software includes

  • Buffer or Buffer-cache management
  • TCP/IP stack
  • Managing access to dedicated devices
  • Error reporting

I/O Device Handling

Data rate

  • May be differences of several orders of magnitude between the data transfer rates

Driver $\Leftrightarrow$ Kernel Interface

Major Issue is uniform interfaces to devices and kernel

  • Uniform device interface for kernel code
  • Allows different devices to be used the same way
  • Allows internal changes to device driver with fear of breaking kernel code
  • Uniform kernel interface for device code
    • Drivers use a defined interface to kernel services (e.g. kmalloc, install IRQ handler, etc.)
    • Allows kernel to evolve without breaking existing drivers
  • Together both uniform interfaces avoid a lot of programming implementing new interfaces
    • Retains compatibility as drivers and kernels change over time.

Interrupts

  • Devices connected to an Interrupt Controller via lines on an I/O bus (e.g. PCI)
  • Interrupt Controller signals interrupt to CPU and is eventually acknowledged.
  • Exact details are architecture specific.

I/O Interaction

Programmed I/O

  • Also called polling, or busy waiting
  • I/O module (controller) performs the action, not the processor
  • Sets appropriate bits in the I/O status register
  • No interrupts occur
  • Processor checks status until operation is complete
    • Wastes CPU cycles

Interrupt-Driven I/O

  • Processor is interrupted when I/O module (controller) ready to exchange data
  • Processor is free to do other work
  • No needless waiting
  • Consumes a lot of processor time because every word read or written passes through the processor

Direct Memory Access (DMA)

  • Transfers data directly between Memory and Device
  • CPU not needed for copying
  • Transfers a block of data directly to or from memory
  • An interrupt is sent when the task is complete
  • The processor is only involved at the beginning and end of the transfer

img

DMA Considerations

✅ Reduces number of interrupts

  • Less (expensive) context switches or kernel entry-exits 

❌ Requires contiguous regions (buffers)

  • Copying
  • Some hardware supports “Scatter-gather”

Synchronous/Asynchronous

Shared bus must be arbitrated (hardware)

– CPU cache reduces (but not eliminates) CPU need for bus

I/O Management Software

I/O Software Layers

img

Interrupt Handlers

  • Interrupt handlers
    • Can execute at (almost) any time
      • Raise (complex) concurrency issues in the kernel
      • Can propagate to userspace (signals, upcalls), causing similar issues
      • Generally structured so I/O operations block until interrupts notify them of completion kern/dev/lamebus/lhd.c

Interrupt Handler Steps

  • Save Registers not already saved by hardware interrupt mechanism
  • (Optionally) set up context for interrupt service procedure
  • Typically, handler runs in the context of the currently running process
    • No expensive context switch
  • Set up stack for interrupt service procedure
    • Handler usually runs on the kernel stack of current process
    • Or “nests” if already in kernel mode running on kernel stack
  • Ack/Mask interrupt controller, re-enable other interrupts
    • Implies potential for interrupt nesting.
  • Run interrupt service procedure
    • Acknowledges interrupt at device level
    • Figures out what caused the interrupt
      • Received a network packet, disk read finished, UART transmit queue empty
    • If needed, it signals blocked device driver
  • In some cases, will have woken up a higher priority blocked thread
    • Choose newly woken thread to schedule next.
    • Set up MMU context for process to run next
    • What if we are nested?
  • Load new/original process’ registers
  • Re-enable interrupt; Start running the new process

Sleeping in Interrupts

  • An interrupt generally has no

    context

    (runs on current kernel stack)

    • Unfair to sleep on interrupted process (deadlock possible)
    • Where to get context for long running operation?
    • What goes into the ready queue?
  • What to do?

    • Top and Bottom Half
    • Linux implements with tasklets and workqueues
    • Generically, in-kernel thread(s) handle long running kernel operations.

Top/Half Bottom Half

img

  • Top Half
    • Interrupt handler
    • remains short
  • Bottom half
    • Is preemptable by top half (interrupts)
    • performs deferred work (e.g. IP stack processing)
    • Is checked prior to every kernel exit
    • signals blocked processes/threads to continue
  • Enables low interrupt latency
  • Bottom half can’t block

Deferring Work on In-kernel Threads

  • Interrupt
    • handler defers work onto in-kernel thread
  • In-kernel thread handles deferred work (DW)
    • Scheduled normally
    • Can block
  • Both low interrupt latency and blocking operations

img

Buffering

No Buffering

  • Process must read/write a device a byte/word at a time
    • Each individual system call adds significant overhead
    • Process must what until each I/O is complete
      • Blocking/interrupt/waking adds to overhead.
      • Many short runs of a process is inefficient (poor CPU cache temporal locality)

User-level Buffering

  • Process specifies a memory buffer that incoming data is placed in until it fills
    • Filling can be done by interrupt service routine
    • Only a single system call, and block/wakeup per data buffer
      • Much more efficient
  • Issues
    • What happens if buffer is paged out to disk
      • Could lose data while unavailable buffer is paged in
      • Could lock buffer in memory (needed for DMA), however many processes doing I/O reduce RAM available for paging. Can cause deadlock as RAM is limited resource
    • Consider write case
      • When is buffer available for re-use?
        • Either process must block until potential slow device drains buffer
        • or deal with asynchronous signals indicating buffer drained

Single Buffer

  • Operating system assigns a buffer in kernel’s memory for an I/O request

  • In a stream-oriented scenario

    • Used a line at time
    • User input from a terminal is one line at a time with carriage return signaling the end of the line
    • Output to the terminal is one line at a time
  • Block-oriented

    • Input transfers made to buffer
    • Block copied to user space when needed
    • Another block is written into the buffer
      • Read ahead
  • User process can process one block of data while next block is read in

  • Swapping can occur since input is taking place in system memory, not user memory

  • Operating system keeps track of assignment of system buffers to user processes

  • Assume

    • T is transfer time for a block from device
    • C is computation time to process incoming block
    • M is time to copy kernel buffer to user buffer
  • Computation and transfer can be done in parallel

  • Speed up with buffering

    img

  • What happens if kernel buffer is full

    • the user buffer is swapped out, or
    • The application is slow to process previous buffer and more data is received???
    • => We start to lose characters or drop network packets

Double Buffer

  • Use two system buffers instead of one
  • A process can transfer data to or from one buffer while the operating system empties or fills the other buffer

Double Buffer Speed Up

  • Computation and Memory copy can be done in parallel with transfer

  • Speed up with double buffering

  • Usually M is much less than T giving a favourable result

    img

  • May be insufficient for really bursty traffic

    • Lots of application writes between long periods of computation
    • Long periods of application computation while receiving data
    • Might want to read-ahead more than a single block for disk

Circular Buffer

  • More than two buffers are used
  • Each individual buffer is one unit in a circular buffer
  • Used when I/O operation must keep up with process

Notice that buffering, double buffering, and circular buffering are all Bounded-Buffer Producer-Consumer Problems

Reference:
UNSW COMP3231 2020 lecture slides