Operating System Notes
This is the aggregation of all my notes from the University of Washington's CSE 451 Operating Systems course.
Week 1
Physical memory
- Slow compared to CPU caches (pull memory into caches)
- “Byte addressable”
Storage devices
- Persistent, slower, non-volatile
- HDD vs SSD
- Sector vs page
Other devices
- Network interface card (NIC)
- Input
- Mouse, keyboard
- Output
- Headphones, speakers
Typical abstractions
- CPU
- Process
- DRAM
- Virtual memory
- Storage devices
- Files, directories (FileSystem)
- Network
- Network stack, TCP/IP protocols
Why provide abstractions?
-
OS Roles
Illusionist, Glue, Referee
-
Ease of use (illusionist)
- Provides simplified services to user
- Storage → FileSystem, creates named files, folders to organize,
- DRAM (shared memory by all processes) → Virtual memory (all memory is yours, no matter where your place in physical memory)
- Mask hardware limitations
- FileSystem: hides block sizes by bytes
- Virtual memory
- Provides simplified services to user
-
Common interface (glue)
- allows processes to share files and other services
-
Managed access (layout and permissions) (referee)
- Resource management (process uses registers, memory)
- Schedule processes into a single CPU
- Sharing of physical memory
- Isolation
- Process can’t read another’s memory
- Can’t read OS memory
- Managed/Controlled sharing
- Shared memory (shm call)
- Resource management (process uses registers, memory)
Motivation
- OS manages and abstracts hardware resources
- Three benefits of abstraction:
- illusionist
- common interface
- managed access
- Must prevent processes from violating management
- OS could monitor user instructions and memory access
- How can every instruction be monitored?
- Does every memory access need to be monitored
- Efficiency
- OS could monitor user instructions and memory access
Protection Rings
- Supported by hardware
- Privilege level shown by lower 2 bits of the cs register
- Ring 0 (kernel mode)
- Kernel code executes
- Access to privileged instructions
- Pausing processor, update virtual memory, I/O (move data from I/O port, disable interrupts)
- Access to all mapped virtual memory
- Ring 1 & 2
- Where device drivers execute
- Can only access non-privileged instructions & some I/O sensitive instructions
- Access to all mapped virtual memory
- Ring 3 (user mode)
- Where the user code executes
- Can only access non-privileged instructions
- add, mov, call, ret
- Can only access virtual memory mapped with user permission
Types of Mode Transfer
- Exception (synchronous)
- A problem caused by current instruction
- Divide by 0
- Null pointer access
- Segfault
- Execute privileged instruction
- Processes will be terminated
- Process may resume on faulting instruction
- If kernel can handle the exception
- A problem caused by current instruction
- System call (synchronous)
- Processes requiring kernel services process (open, close, read, write)
- syscall, sysret: the instruction to enter/exit the kernel for a system call
- A process resumes at the next instruction upon a syscall return
- Interrupt (asynchronous)
- Hardware events/notifications, often time sensitive
- I/O completion (disk write complete, packet arrived), timer interrupt, keyboard input
- Unrelated to the current instruction
- Interrupt is checked and handled before the current instruction is executed
- Resumes on the current instruction since it was interrupted
- Who executes the kernel code when a mode switch occurs?
- The current running process can only execute the kernel code
- Once inside the kernel, a process can only execute the kernel code
- Upon a user kernel mode switch, %rip is set to kernel code by hardware
- Hardware events/notifications, often time sensitive
PC - Program Counter
Types of mode transfers
- Interrupts
- Issued by hardware (timely hardware event) high priority
- Also called HW Interrupts / external interrupts
- One at a time, End of Interrupt (EOI) after handling
- Preempt exception & syscall handling
- Example: disk finished writing, report back to kernel of completion
- Exceptions
- Problem caused by the current instruction
- Exception behavior varies (trap 13, trap 14 are common, invalid memory access)
- If you cause exception in your interrupt handler?
- syscalls
- Example: Problems caused by current instruction
- Syscalls
- Requested by the user (software interrupt) interrupt (INT#)
Mode Transfer Mechanisms
- When entering kernel, hardware overwrites %rip to kernel handler address
- We must save the process %rip user code (after completing kernel execution)
- Must save user’s registers
- Execution needs a stack
- User stack is available
- Who has access to this stack? The user process itself
- If the kernel uses the stack, what information might be on the stack?
- This might lead to granting other processes visibility of sensitive kernel information
- Example: Kernel performs decryption, stores result onto user stack, other process can view data in stack
- Example: When kernel procedure finishes executing and calls a return address, another thread can overwrite the return address to elsewhere (arbitrary code)
- To fix this, we allocate a separate kernel stack (user no permissions)
- Stack switch occurs on mode switch (carried out by hardware)
- How many kernel stacks are there?
- One stack per process (thread) (because they each do their own task)
- Which handler to load in %rip?
- Interrupt vector table / x86 Interrupt descriptor table
- An array that holds entries into a fixed array of system interrupts
Week 2
Mode Transfer control flow
- Switch into kernel mode, hw saves some registers, switch to kernel stack, push saved states onto the kernel stack → kernel code (IDT) handler pushes rest of the process’s states → execute handler logic → restore the states → switch into user mode, hw restores rsp, rip
- 256 vector functions, handlers in IDT entries
- Vector handlers push onto trapasm.S
- Pushing registers into trap frame to prepare for entrance into kernel
- %rsp points to location that stores %rax
- Then, we call trap which defines kernel C logic, unified interrupt handler
- Trap frame definition
- Includes all the registers that are pushed by the kernel
- rax, rbx, rcv, r8-r15 … (saved by the kernel)
- err, rip, cs, rflags (saved by the hardware)
- When we return from kernel back into user mode, we restore the trap frame values back into the corresponding registers ensuring users have the exact same view before entering the kernel
- Includes all the registers that are pushed by the kernel
- System calls (syscall) include arguments
- x86 rdi, rsi…
- When we enter into the kernel, reading trapframe for arguments
- Come in many different types, void*, integers, addresses
- Kernel must validate that these addresses are valid - user might pass in kernel memory address (kernel must prevent this from occurring)
- Kernel performs validation, OS typically makes a copy of the arguments
- String is null terminated, kernel must go through the entire string
- If we validate the size of a string, but then allow other processes to manipulate the original memory location then we get corruption and security issues
- xk deals with single thread, OS supports multithreading and opens security vulnerabilities
- Return value (rax) set within the trap frame, allowing user to see when resuming execution
Processes
- A process is a running instance of a program
- Execution stream
- rip, stack, registers used by that process
- virtual memory (address space)
- Process metadata, management data tracked by the OS, known as the PCB (process control block)
- In Lab 1, we must add syscall supports for each process, requiring tracking of additional metadata for each process
- Struct proc_ stores this information
- id, process kernel stack, OS abstractions (file descriptor tables)
- In Lab 1, we must add syscall supports for each process, requiring tracking of additional metadata for each process
- Boundary for isolation and protection
- Process only have visibility into its own thing, unless permitted by the kernel
- Kernel memory at the top of the address space, mapped by supervision
- Mapping of virtual memory to physical memory is done by the kernel through translation table
- All kernel data is within kernel memory, this is a design choice made by modern OS (security vulnerability in 2018)
- By mapping the kernel and user memory in the same virtual address space, it makes it easier for kernel to access user memory
- Assume kernel memory is stored elsewhere, kernel must work harder to fetch
- Separate translation tables for process kernel memory and user memory comes with a nontrivial cost of switching
- Some 4-5x performance overhead
- TLB (translation lookup buffer) (memory translation cache from memory translation results)
- Modern OS don’t map entire kernel memory into user memory, only part of kernel memory to enter the kernel itself
- By mapping the kernel and user memory in the same virtual address space, it makes it easier for kernel to access user memory
- Execution stream
Process Implementation
- ELF loading (executable and linkable format)
- How to set up virtual address space (VAS) of a process
- Entry point of that process
- Stores program headers
- File offset x size bytes
- Virtual address
- Size of virtual memory region
- Code section, initial read-only data in VAS
- Initial stack page is set up by the OS
- When we first start executing, let’s say we have an int main(int argc, char** argv) {}
- We store char** argv data on the initial stack frame, and the pointer argv itself on the initial stack frame
- rsp points to this location
- There is a size for how much of the file we should read, and how big the VAS is
- PCB
- pid, kernel stack, initialize file descriptor table (any data structure), virtual memory data
- Kernel responsible for setting up trapframe to the starting state of the process
- rip, entry point, rsp in trap frame
How processes share a CPU (core)
- Scheduling: OS policy on who should run on the CPU
- Let processes take turns using the CPU (not all programs process to completion)
- Round Robin (10-100ms)
- Fixed time slice (how long each turn is)
- FIFO order through all processes
- Context switching takes place in kernel mode
- Mode switch into kernel first, then context switch
Fork & Exec combo
We spent time copying parent’s memory content, set up a translation table
Set up process’s new VAS, resource intensive
Copy on write (COW)
Share the same physical memory for as long as possible (while only servicing reads)
The moment write occurs, we need to make a copy
Both of the above, HW detects memory permission violation
Mark shared memory as read-only, even if the parent VAS memory was writable
The moment a write occurs, page fault, then detected by hardware (kernel must determine if copy on write, or real permission violation)
Process creation transition
- Elf loading
- Setting virtual address space (VAS)
- Initialize process control block (PCB)
Ready State (RUNNABLE)
- Ready to get scheduled on the CPU
- OS keeps track of ready queue
- Gets dispatched (scheduled)
Running state
- Single core, single running state
- Gets terminated (ZOMBIE) (will never go back into ready queue)
- Timeout, preempt, time slice expires (time slice/time quantum)
Blocked State (waiting)
- Running enters a blocked state
- Do not schedule
- Examples: IO Request, timer (sleeptimer)
- We can only be blocked by one event at a time, single execution stream
- Unblocking, event has taken place
Grocery Line Example
- Checkout is execution
- Standing in line is process ready queue
- Kernel must track waiting event
procstate is scheduling state (runnable, waiting, zombie)
- chan channel is tracking the waiting event, if waiting event then event is pointer
- xk: sleep(*chan) → block on this event
- wakeup(*chan) → unblock procs/threads waiting on chan
- Even if the time quantum is larger than the time that a process gets to run
- Example: Time Q = 10ms, program exec = 1ms, blocking = 9ms
- CPU is a scarce resource, so when a process enters a blocking state, the scheduler will maximize the utilization of the CPU, and schedule another process
Process APIs
- fork()
- Creates a new process, a child process, that’s identical to the parent at the time of fork
- No params
- Two different/separate processes, different VAS, different physical memory
- Separate memory translation table (page table)
- Child inherits OS resources from parent (open files)
- Child will have its own fd array (copied from parent)
- Different kernel stack, different processes enter the kernel at different times
- Child will not execute code before the fork() call from parent
- %rax will hold different return values for parent and child, >0 and 0 respectively
- exec() (system call)
- Loads a program into an existing process
- Process will be stored on the same process VAS, but with new code, new data, new stack, new entry point
- Does not create a new process
- This is what new program the process should execute
- Example:
pid = fork()
if (pid == 0) {
fd = open("output.txt");
close(stdout);
dup(fd);
exec("ls");
}
// Redirection logic is written under the child case (pid == 0)
// Fork gives you the opportunity to configure a bunch of things
// without the knowledge of another program being involved
- Fork & Exec combo
- We spent time copying parent’s memory content, set up a translation table
- Set up process’s new VAS, resource-intensive
- Copy on write (COW)
- Share the same physical memory for as long as possible (while only servicing reads)
- The moment a write occurs, we need to make a copy
- Both of the above, HW detects memory permission violation
- Mark shared memory as read-only, even if the parent VAS memory was writable
- The moment a write occurs, page fault, then detected by hardware (kernel must determine if copy on write, or real permission violation)
Process APIs
fork, exec, cow fork (performance optimization)
- fork inherits OS resources
- Windows uses spawn() instead of fork and exec
- Combination of both
- Unix api uses clone()
- Creates a new process
- Takes several params to pass which resources it requires
exit()
- How a process voluntarily terminates
- Recall when a new process is initialized, we set up the PCB, VAS, KStack, open files
- All open files are closed upon exit()
- Exit handler lives in kernel mode, if it tries to clear itself, it will lose the kernel memory — process cannot naiively free its own memory as it exits
- It can definitely free the VAS excluding the kernel memory
OS is responsible for reclaiming these resources after termination
- One option to free the process memory is to make the next process clear it, or the let the parent free the memory
- Another option is to switch to kernel only VAS
- Kernel stack must still be freed by parent (or some other process)
wait() (xk = waitpid(-1)) / waitpid() (unix)
- Allows calling process to wait until a child has exited
- Example: Parent waits for child process to finish execution until it itself will resume executing
- Kernel needs to track the parent/child relationship for each process
- Parent pointer being stored in PCB
- Potentially blocking operation (blocking syscall), cannot return until condition is true
- No longer running on CPU, in blocked state
- When child has called exit, process must be able to unblock
- Using sleep() and wakeup() to implement wakeup
- Initial process, when parent exiting make init process adopt the children
- Init proc is responsible for cleaning up abandoned children
- Alternative strategy instead of making the next process clean the children memory
- Better to just make the parent responsible of cleaning up the children if the parent never returns from waiting
Scheduling
- Policy (OS level) that decides who to run next on the CPU
- Processes and threads run on the CPU
- Task based evaluation rather than policy based
- Metrics
- Latency (turnaround time)
- User perceived time of how long the task took (start at arrival includes wait time & run time)
- Throughput
- Rate of task completion, how many tasks to be completed in a certain amount of time
- Fairness & Starvation
- Fairness is similar time on CPU & similar time waiting
- Starvation is asking: does the policy cause a process to wait forever
- Scheduling overhead
- Cost of doing scheduling itself (includes time of run scheduling policy & context switching time)
- Latency (turnaround time)
- Scheduling policies
- FIFO
- Run each task to completion (or until a process voluntarily yields) in FIFO order, kernel never stops process from running (preemption)
- No starvation, low overhead, minimal task switching (scheduling overhead)
- Performance is highly dependent on the arrival order
- A: 10ms B: 20ms C: 100ms
- [A, B, C] - scheduled as such
- Latency A = 10, Latency B = 30, Latency C = 130
- Total Latency = 170ms
- [C, B, A] - scheduled opposite
- Latency C = 100, Latency B = 120, Latency A = 130
- Total Latency = 350ms
- Preemptive Shortest Job First
- OS might preempt processes
- Attempts to minimize average latency, avoid having shorter task wait behind longer task
- Always schedule the job with shortest CPU time
- If a new task arrives and has a shorter time, then the new task will preempt the current running task, and be the new scheduled task (runs)
- Order of input is ignored, tasks are preempt and overtakes longer latency processes
- More switching is involved during preemption
- But, it does fruit optimal latency, and has good throughput
- Not a fair policy, and can lead to starvation (if shorter jobs come in infinitely)
- What happens when task blocks?
- When the task blocks, it goes into a blocked state, removed from the processing queue
- But when the task unblocks and reenters the processing queue, it might preempt the current running task
- Round Robin
- FIFO with a time quantum (10ms - 100ms)
- Preemptive scheduling policy
- More realistic than PSJF, does not require knowing how long a task will take to process
- Example:
- [C, B, A], time slice = 10ms
- A = 30ms, B = 50ms, C = 130ms
- Total latency = 210
- Scheduling overhead goes up because every 10ms we switch tasks
- Compromise of the naiive FIFO ordering
- Relatively fair, throughput is acceptable
- FIFO
Week 3
Scheduling methods
- Round Robin
- FIFO w/ time quantum, preemptive
- Unfair to jobs not using the full time quantum
- Must balance IO bound and CPU bound jobs
- Option 2:
- Jobs with less time on CPU get scheduled first
- Completely fair scheduling (CFS)
- Time ordered red black tree
- Option 2: Different time quantums
- Different levels of queues, different time quantums
- Multi-level feedback queue (MLFQ)
- Short: less wait time for IO bound (interactive) jobs
- High priority, will work its way down
- Long: less interruption for longer jobs (less scheduling overhead)
- Upon init, all new tasks are short. We don’t know what tasks will take what amount of time.
- If a task blocks, the time slice is larger than what it actually needs (execution time < time quantum), ensuring short tasks get highest priority and lowest total latency
- During execution, if a task blocks, it will stay at the same execution level on the back of the queue
- Else if a task completes its time quantum / the time slice is used entirely (execution time = time quantum), the task is moved down a queue (needs longer time on CPU without re-scheduling overhead)
- There is a concern for lower queued tasks being starved of CPU by alternating tasks in the highest queues
- Potential priority boost
- Resource accounting, maximum time limit for each queue (if you have spent over X amount of time, move the program down)
- This prevents exploitation from user processes
- Even if a current task priority is running, a higher priority task will preempt lower priority task
Threads
- A unit of task/execution
- Multithreaded program → a program that logically divides into independent tasks
- Parallelism: things executing at the same time (not possible for xk, with only one core)
- Concurrency: even with one core, still switches between tasks, each making progress
- Kernel is concurrent
Process
- An address space + OS resources (open files) + 1 + threads
Thread
- Shares access to entire address space, code, heap, data
- But each has their own stack (execution stream) in user space
- Has its own pc (program counter), sp (stack pointer), registers
- Threads are managed by the kernel
- Managed by thread control block (TCB)
- Switching between tasks is context switching
- Threads are the actual execution, context switching is switching thread states
- Save thread execution state
- Run scheduling policy
- Switch to another thread
- Restore states from the thread to run (each kernel stack of each thread)
- xk: old thread → scheduler thread → next thread (if it belongs to the same process, no need to switch address space, otherwise need to reload memory translation table and flush TLB)
- Threads share the same VAS, but have different stacks (sections of VAS, could be in heap or elsewhere)
- Threads are cheaper to create compared to a process
- Each thread just needs its own stack that accesses the VAS, and kernel TCB
- New process requires a whole new address space and set up a new translation table, PCB, etc.
- If we have a number of tasks to perform within the process, instead of forking a new process to perform those tasks, the cost is much smaller if we just created another thread to perform the task
Creating threads, pthread API
- Not like fork (per-process), pthread_create(thread_functions, args);
- thread_functions: starting point, instead of fork where it starts at ip/sp
- returns tid
- pthread_join(tid); // waitpid
- pthread_exit();
- pthread_detach(); // can’t be waited on, cannot join it, resources cleaned up upon exit
Threads
- A unit of task/execution
- Multithreaded program → a program that logically divides into independent tasks
- Parallelism: things executing at the same time (not possible for xk, with only one core)
- Concurrency: even with one core, still switches between tasks, each making progress
- Kernel is concurrent
Process
- An address space + OS resources (open files) + 1 + threads
Thread
- Shares access to entire address space, code, heap, data
- But each has their own stack (execution stream) in user space
- Has its own pc (program counter), sp (stack pointer), registers
- Threads are managed by the kernel
- Managed by thread control block (TCB)
- Switching between tasks is context switching
- Threads are the actual execution, context switching is switching thread states
- Save thread execution state
- Run scheduling policy
- Switch to another thread
- Restore states from the thread to run (each kernel stack of each thread)
- xk: old thread → scheduler thread → next thread (if it belongs to the same process, no need to switch address space, otherwise need to reload memory translation table and flush TLB)
- Threads share the same VAS, but have different stacks (sections of VAS, could be in heap or elsewhere)
- Threads are cheaper to create compared to a process
- Each thread just needs its own stack that accesses the VAS, and kernel TCB
- New process requires a whole new address space and set up a new translation table, PCB, etc.
- If we have a number of tasks to perform within the process, instead of forking a new process to perform those tasks, the cost is much smaller if we just created another thread to perform the task
Creating threads, pthread API
- Not like fork (per-process), pthread_create(thread_functions, args);
- thread_functions: starting point, instead of fork where it starts at ip/sp
- returns tid
- pthread_join(tid); // waitpid
- pthread_exit();
- pthread_detach(); // can’t be waited on, cannot join it, resources cleaned up upon exit
Threads
- Results from threads differ based on scheduling order (firing) → race condition if we run without protections
- We might observe undefined behavior due to race conditions
- Thread reasoning is very difficult without the support of thread coordination (synchronization)
- Synchronization primitive
Locks
- A synchronization primitive that guarantees exclusive access to a designated section of code
- Exclusive access (mutual exclusion)
- Designated section of code (critical section)
- APIs
- lock_acquire(*lock_pointer): does not return until we actually acquire the lock
- If multiple threads attempt to grab the lock, none will return until success
- lock_release(): how we release the lock, available for other threads to acquire
- lock_acquire(*lock_pointer): does not return until we actually acquire the lock
- Safety contract: only one thread in the critical section at a time
- Progress (liveness) property: a thread can enter the critical section if no one else is in it; someone must be able to acquire the lock eventually
- Bounded wait (fairness): upper bound to how long a thread has to wait before entering
- In practice, lk_acquire to access shared state, then lk_release
Lock Implementation
struct lk { // this struct itself is a shared state
int locked; // boolean
holder;
id/name;
}
// What protects lock_acquire from other threads attempting
// to lock simultaneously
void lock_acquire(*lk lock_pointer) {
while(lk->locked) {
continue;
}
lk->locked = 1; // Lock the lock, how we acquire it
return;
}
// Alternate implementation
void lock_acquire(*lk lock_pointer) {
if (!locked) {
lk->locked = 1 // Lock the lock
return;
} else {
while(lk->locked) {
continue;
}
}
}
lock_release(*lk lock_pointer) {
}
Problem: multiple threads can attempt to lock simultaneously
Solution: requires hardware support for atomically performing read & modify
- Test & set instruction
- Sets the value to 1 only if the current value is 0 (hardware, atomically)
- Otherwise, it will not write a value, but returns the value that was read
- Takes longer to execute this instead of a normal instruction (load & store), performance overhead
void lock_acquire() {
while (test&set & locked) {
// upon exit the locked is already 1
// then test&set = 1, and locked = 0, breaking loop
}
}
Types of locks
- Spinlock: when the lock is not free, keep checking the status (running on CPU), until it acquires the lock (spinning, busy waiting)
- When we expect the critical section to be short (execution time), smaller CPU busy waiting overhead
- Few waiters/contenders
- Sleeplock/posix mutex: when the lock is not free, block/sleeps until it’s free
- When we expect the critical section to be long, IO access
- Many waiters/contenders
Types of locks
- Spinlock: wait (spins) on CPU while the lock is busy
- Why should we use spinlock in a single core system?
- Because we need live CPU in places where we cannot block, like the interrupt handler
- Interrupts need to be serviced very timely, otherwise interrupts might get lost
- xk: disables interrupts to acquire spinlock
- Because we need live CPU in places where we cannot block, like the interrupt handler
- Why should we use spinlock in a single core system?
- Sleeplock: blocks/sleeps while the lock is busy
- Context switch (scheduling, switch VAS)
- Unblock
Lock granularity
- Protect how much shared data with a single lock?
- A single lock for an entire array
- Must acquire the lock, then perform operations, the release the lock?
- Works for safety, poor performance for large arrays
- Must acquire the lock, then perform operations, the release the lock?
- Coarse-grained vs fine-grained locking
- Coarse-grained: used for simplicity
- Fine-grained: acquire multiple locks, undesirable behavior, deadlock
- Example use cases:
- icache, uses a spinlock because functions like idup have fast critical sections
- inode, uses a sleeplock because functions like concurrent_readi will take longer to read from the CPU, and CPu should not be wasted waiting so the new read call is put to sleep
- Both functions can process concurrently
Disable interrupt
- Avoid preemption
- Does not enable mutual exclusion like locks because
- It is a per-core operation, multiple cores can still have access to the same memory
- It is a privileged instruction, only kernel can use it
- pushcli();
We need to coordinate threads based on events → Monitors
Monitors
- Design pattern & synchronization construct that provides coordination between threads based on events
- Monitor → a lock + resource state (track event/conditions) + condition variables (synchronization primitive)
- Manages waiters of a condition
- Lock: does not matter what type of lock, xk only uses spinlock
- Resource state: synchronize on a particular state (resource we are tracking)
- Condition variable: tasked to manage waiters, if false put to sleep if true wake up the waiter for condition; wait(), keeps track of a list of wait()
- When true, call signal() on condition variable, removes a waiter and unblocks it
- broadcast(), wakes up all waiters, removes everybody from waiting list
- If we have N threads to sleep for X amount of time, and upon time X we can unblock all N threads, it is safe because all threads will recheck the condition variable
Week 4
The Morning Coffee Problem
Lock lk;
int coffee = 0;
Condvar coffee_cv;
get_coffee() {
lk.acquire();
// For condition variables, we should expect
// spurious/fake wakeups with missing resources
while (coffee <= 0) {
coffee_cv.wait(lk);
}
coffee--;
lk.release();
}
make_coffee() {
lk.acquire();
coffee++;
coffee_sv.signal();
lk.release();
}
Implement a Sleeplock
Lock lk;
bool free = False;
Condvar lock_cv;
lock_acquire() {
lk.acquire();
while (!free) {
lock_cv.wait(lk);
}
free = False;
lk.release();
}
lock_release() {
lk.acquire(); // assert(!free)
free = True;
lock_cv.signal();
lk.release();
}
Implement a Fair Sleeplock with Bounded Waiting
// Lock must be acquired in FIFO order
// New thread can't acquire the lock untill
// all prior waiters have done so.
// Maybe we could implement a queue?
Lock lk;
bool free = False;
Condvar lock_cv;
wait_queue = [];
Condvar wait_cv;
lock_acquire() {
lk.acquire();
wait_queue.append(self); // Wait for our turn to acquire the lock
while (wait_queue.head() != self) {
wait_cv.wait(lk);
}
while (!free) {
lock_cv.wait(lk);
}
free = False;
wait_queue.remove_head();
wait_cv.signal();
lk.release();
}
lock_release() {
lk.acquire();
free = True;
lock_cv.signal();
lk.release();
}
Bounded buffer problem
Common system design regarding locking
Lock lk;
buffer[N]
int consume_ofs = 0;
int produce_ofs = 0;
int total_items = 0;
produce(item) {
lk.acquire();
while (total_items == N) {
not_full_cv.wait(lk); // Releases the lock
}
total_items++;
buffer[produce_ofs] = (produce_ofs + 1) % N;
not_empty_cv.signal(); // Signal consumers
lk.release();
}
consume() {
lk.acquire()
while (total_items == 0) {
non_empty_cv.wait(lk); // Releases the lock
}
total_item--;
item = buffer[consume_ofs] % N;
not_full_cv.signal(); // Signal producers
lk.release;
}
Pipe
-
Interprocess communication (IPC)
- Will open a pipe, file descriptors for read and write
- Cannot be interleaving, writes must be atomic, “aaabbb” is valid while “ababab” is not
- Need to make sure the same writer
-
Pipe is only valid for the number of fds still open (the ref count)
Read writer locks
- Multiple threads accessing shared data, same kernel memory
- PCB is private, looking at its own fields
- Shared data is something like file open, file close
- Lock
- If we have array of structs read only, no need for protection
- But for write, modifiable data, we need protection
Top Comment Example (scrolling down)
- We have mostly reads, and occasional write
- One implementation: Lock on every read and write
- Poor performance
- Don’t lock reads, lock writes
- Simultaneous reads, serial writes
- Problem is possibly reading partial writes, maybe half of a comment A and half of a comment B, very bad
- Reader writer locks
- Concurrent reads
- No writes can occur while there are reads, avoiding reading partial write
- Exclusive write
- No reads or writes can can happen when a writer writes
- Read acquire and release + Write acquire and release
- Concurrent reads
Read writer locks (RWL) in depth
- read_acquire, read_release, write_acquire, write_release
- Do we prefer readers or writers? Read preferring/write preferring
- Read: lets new reads join in even when there are waiting writes
- Write: lets pause NEW readers, waiting writers run after all read locks are released
Write Preferring RWL
Lock lk;
Condvar reader_cv;
Condvar writer_cv;
// Need to know if reader active to block
int active_readers = 0; // Don't need waiting readers
int waiting_writers = 0; // Need to track number to wake readers later
bool active_write = False; // Check for current writer
read_acquire() {
lk.acquire();
// Waiting writers condition prioritizes writers
// Notice: Readers can be interleaved with no active writers
while (active_write || waiting_writer > 0) {
reader_cv.wait(lk);
}
active_readers++;
...
lk.release();
}
write_acquire() {
lk.acquire();
waiting_writers++; // Write about to happen, prioritize writers
// Check if another writer inprogress
// But wait to finish up readers before writing new
while (active_write || active_readers > 0) {
writer_cv.wait(lk);
}
waiting_writers--;
active_write = True;
...
lk.release();
}
Write Preferring Reader Writer Locks (RWL)
Read and write acquire spec:
Lock lk;
Condvar reader_cv;
Condvar writer_cv;
// Need to know if reader active to block
int active_readers = 0; // Don't need waiting readers
int waiting_writers = 0; // Need to track number to wake readers later
bool active_write = False; // Check for current writer
read_acquire() {
lk.acquire();
// Waiting writers condition prioritizes writers
// Notice: Readers can be interleaved with no active writers
while (active_write || waiting_writer > 0) {
reader_cv.wait(lk);
}
active_readers++;
...
lk.release();
}
write_acquire() {
lk.acquire();
waiting_writers++; // Write about to happen, prioritize writers
// Check if another writer inprogress
// But wait to finish up readers before writing new
while (active_write || active_readers > 0) {
writer_cv.wait(lk);
}
waiting_writers--;
active_write = True;
...
lk.release();
}
Read and write release spec:
read_release() {
lk.acquire();
active_readers--;
// Signal write because we write-preferring
if (active_readers == 0 && waiting_writers > 0) {
write_cv.signal();
}
lk.release();
}
write_acquire() {
lk.acquire();
active_write = False;
// Writers can block both readers and writers
// Checking and waking writers before waking readers
// is the implementation of write-preferring policy
if (waiting_writers > 0) {
writer_cv.signal();
} else {
// As a writer, we can unblock any number of concurrent readers
// We do not care if multiple readers, as long as no current writers
// Or no waiting writers
reader_cv.broadcast();
}
lk.release();
}
Read Preferring vs Write Preferring
-
If we read prefer, we might starve writers, writers must wait until all readers serviced
-
Variations in policy (track # of waiting writers), but we don’t want writers to block forever
- Possibly set a threshold of waiting writers before we let them write, say if 5 waiting writers, threshold is 10, upon 10 let them write
- Possibly track the wait time
-
For write preferring, we could possibly track the # of waiting readers (threshold to avoid reader starvation, block new readers coming in)
-
Hybrid policy is common in design patterns
Deadlocks
- Cycle of waiting among a set of threads, threads mutually waiting for one another to take some action
Deadlock Example 1:
// Classic deadlock
Lock A;
Lock B;
thread_func1() {
A.acquire();
B.acquire();
}
thread_func2() {
B.acquire();
A.acquire();
}
Deadlock example 2: Even without locks, blocking conditions
// 2 Bounded Buffer Deadlock
buffer A[];
buffer B[];
thread_func1() {
A.consume(); // Blocks because A is empty, waits
B.produce(item);
}
thread_func2() {
B.consume(); // Blocks because B is empty, waits
A.produce(item);
}
Deadlock example 3: Monitor lock vs General Lock
// Nested locking issue, common issue
// Can happen with general lock and pipe lock etc.
Lock lk;
Boundedbuf A[];
thread_func1() {
lk.acquire(); // General lock
...
A.consume(); // Monitor-specific lock release
...
lk.release();
}
thread_func2() {
// General lock
lk.acquire(); // Cannot be acquired because t1 holds
...
A.produce(item); // Will not be run
...
lk.release();
}
Dining philosophers
- 5 diners, 5 chopsticks
- Each philosopher picks their left chopstick, then waits for their right
- Will each wait indefinitely because they wait on one another
- 4 necessary conditions of deadlock
- Bounded resource: finite number of resources (infinite number is no problem)
- No preemption, cannot forcibly take resource away
- Hold & wait, philosophers wont give up their own resource upon knowing the next resource is blocked
- Circular wait, a cycle of dependency, each philosopher waits for another
- Are these 4 conditions sufficient for deadlock? Is implication true?
- For single instance resource (it is sufficient)
- Multiple instances of a resource (not sufficient)
- A new resource may break a deadlock condition
- What to do to break deadlock?
- Breaking any of the necessary conditions
- Exists 3 types of approaches
- Deadlock prevention by limiting system/program behaviors
- Bounded resources → reserve some resources (not always possible)
- No preemption → allow for resource to be preempted (timer, timeouts, leases)
- Hold & wait → acquire all or release all locks if we fail to do one or the other
- try_acquire, if any of them fail then releases all locks
- Circular wait → global locking order, maybe use the ID as a comparison ABC (most common)
- Deadlock avoidance by letting the program do whatever, but system determines when it is safe to grant resources at runtime (Admission control)
- Remember, it is not that the logic is faulty 100% of the time, but the interleaving of events will possibly cause deadlock
- All resources must be granted by the system
- Chopstick fairy analogy, hands out one chopstick at a time, make sure a philosopher finishes eating before handing resource to another (maximally permissive)
- Hand out chopstick freely until the last one
- Handout the last one to someone already holding a chopstick
- Bankers algorithm
- When is it safe for a system to grant a particular request
- 3 states, safe, unsafe, deadlock
- Safe → at least 1 ordering to grant request such that everyone can finish
- Unsafe → At least 1 possible future request that will lead to deadlock, no matter scheduling order
- Deadlock → A future request resulting in deadlock
- Can move from unsafe state to safe if thread finishes future request
- But, we should try to keep the system in a continually safe state
Week 5
Safe, unsafe, and deadlock states
- Safe → At least one possible way of granting resource that lets every member finish their request
- Unsafe → At least one future request can cause system to deadlock, unsafe but not deadlock (future request has not been made yet)
Example:
-
Available is the actual current state of the system
-
To be available is the prophecy, imagining the potential state of the system
-
3 chopsticks, 3 philosophers, each need a max of 2 chopsticks, but will request one at a time
-
Let philosophers A, B, C start with 0 chopsticks
- Grant A 1 chopstick, available = 2, to be available = 2
1.1 When A gets another chopstick, to be available = 3 because 2 (TBA) - 1 (give to A) + 2 (A returns 2) = 3
1.2 We can check this for each member A, B, C for the current state, and determine that we remain in a safe state
- Grant C a chopstick, available = 1, to be available = 2 or 3
2.1 To be available = 2 assumes that we give the chopstick to either A or C before B
- Grant B a chopstick, available = 0, to be available = 0
3.1 No parties can finish (to be available will never be > 0), system is in an unsafe state
3.2 Unsafe until a party chooses to execute a future request (request made) that it becomes a deadlock state
Last deadlock prevention, deadlock detection
- If it is rare, detect and recover when deadlock occurs
- Detection can be done via resource allocation graph
- P1 waiting for → r1 held by → P2 waiting for → r2 held by → P1 (cycle)
- High number of false positives but comes with performance penalties
- Abort & recover after detection
- Ostrich algorithm
- Just let user kill or reboot, relieves responsibility from system
Physical memory management
- Volatile, byte addressable, order of GBs
- Access latency is ~200 cycles, L1 cache is (~7 cycles/single digits)
- Process’s code & data needs to be in memory
Problem: Many processes, limited physical memory, possible solutions includ
-
Attempt 1: If we are single core, maybe process can use the entire physical memory
- Removes the need for memory translation and overhead
- Context switch, very high scheduling overhead
-
Write A’s VAS somewhere else (disk, slow but lots of room)
-
Then load B’s VAS into memory
-
How do we enforce kernel/user memory separation?
-
Cannot do this in a multicore setting
-
Attempt 2: Divide up physical memory, support multiple processes, divvy up memory per process A, B, C regions
-
Less context switch overhead
-
Still needs virtual → physical translation
-
Base = where does a process’s memory start in DRAM (we need to + C some constant to offset the proper physical location in DRAM)
-
Bound = size of VAS (sizes of program region)
-
How does this support growth? We get lucky if a new process terminates. Otherwise we reject growing the memory space.
-
If we terminate a region, lets say B, where B lies between A and C (A, B, C), and a new process D requires a lot of memory but A and C prevent it, D must stall until a suitable region becomes available
-
External fragmentation, common for memory allocation
-
We can perform compaction (squishing A and C up together) to avoid fragmentation, must update base and bound registers for each compacted process
- Expensive operation, must wait for compaction, variability in performance, non-consistent results
- Often times we want to share library code, cow fork, but base and bound approach makes this impossible, cannot specify which parts of each region is shared
-
Translation done on the entire VAS, cannot share part of it
Attempt 3: Finer-grained translation & permission management
- Process’s VAS can map its memory to different regions in DRAM
- More control of sharing, many multiple processes can point to same physical location
- Want to use a fixed size chunk (page, typically 4096 KB)
- No more external fragmentation
- But can result in internal fragmentation, because let’s say our process only uses 1 byte of a virtual page, but allocated a whole page to it, wasted space because too much
- Versus external fragmentation, where there is space but not enough, both lead to wasted space
- As long as any room is physically available, then process can grow its own VAS, DRAM no longer contiguous
- We can only load pages in use to memory, we can have holes in the VAS and only allocate to DRAM when we need to access it
- Demand paging
- Access new DRAM page, no existing translation → page fault (trap 14), stores address causing page fault in cr2 and %rip causing access
Kernel must track what VAS are valid and should be brought into memory, versus what address should not be accessed
- Segfault is when process tries to access invalid memory
- Page fault is seamless, when pointer tries to access page of VAS not mapped to physical memory, so memory management unit (MMU) must fetch from disk
- https://stackoverflow.com/questions/6950549/segmentation-fault-vs-page-fault
Strategy is called paging: translation table
Attempt 3: divide virtual & physical memory
- Mapping VAS to physical memory, called frames or physical pages
- Different processes can share the read-only portions of code
- Having finer granularity, more complicated management, requires translation on a page level
- Versus a naiive approach where we only maintain VA + base approach for lookups
- Memory translation table
- Page to physical page (frame)
-
Any offset in a page will be the same offset in the physical page, no need for translation
-
Offset will be lower bits (from the right side, typically 12 bits because 2^12 = 4096)
-
- Page to physical page (frame)
Page table, storing translations for all pages
-
Giant array [], index using page # as the index, if it is mapped to physical address, then store the frame number in the corresponding entry
- Then get the value, concat with the offset to find the physical address
-
Memory translation done by hardware, page table walk
-
Address is stored in the page table base register, %cr3, address is a physical address
-
Each frame has an access mode, R/W permissions
-
We are doubling memory access every time, slow without caching
Translation Lookaside Buffer (TLB)
- Dedicated cache for memory translation
- After a failed cache hit and after a page table lookup, we store the frame back in the cache
- Cost of single array page table
- Per-process data structure, same kernel mapping in VAS maps to same physical address
- Large array can take up a lot of space, we reserve space for it regardless of if it is even being used
- We want to reduce the space of this page table
- Some may create super page/huge page with 2 MB and 1 GB, less bits available to represent pages because larger chunks
- Internal fragmentation, low memory utilization, but cache hits would be more frequent because larger pages, less total number of pages in the array
- Still many page tables, lots of memory
- Inverted page table
- Typically we map a virtual page → frame, but instead we have a global inverted page table
- Indexed by frame #, storing page #, must also store PID because pages are per process-dependent
Multilevel page table
-
In a traditional 1 level page table, we may have 2^52 pages (2^64 (address bits) - 2^12 (4 KB page offset), we have subsets of tables
-
Lets say we only use one a single page, instead of allocating an array of 2^52 pages, we only allocate one level 3 page table, one level 2 page table, and one level 1 page table (all used as indices to get to the level 3)
Example:
- We have a 21 bit virtual address, page size 4096 bits → 12 bit representation
- Index 1 is 4 bits, index 2 is 5 bits, offset is 12 bits = 21 bits VA
- Let code region be 0x1000 - 0x1fff
- Let stack region be 0xfe000 - 0xfefff
- First, we allocate a first level page table, 2^4 (16) entries
- Then, we allocate a second level page table, 2^5 (32) entries
- Maximum number of pages = 512, equivalent to the number of bits (2^9 = 512)
x86 Page Table
- Trap 13 is memory accesses outside of the possible 2^48 bit VAS
- Sign extend is usually just 1s or 0s, and addresses with random bits are invalid, invokes a general protection fault
- At any point, if we traverse and the present bit is 0, then invoke a page fault
Page Faults
- %rip accesses memory, stores the fault address into a register
- Exception: problem with page table walk, or there is a permission mismatch (marked read/write only or kernel only page)
- Error code in “tf→err” if the bit is present, user/kernel, read/write
- Demand paging is one cause of page fault (pages that kernel knows exists)
- Growing the stack, as we push more items to stack it will be page fault if exceeds size, not currently mapped
- OS is responsible for determining fault address
- Heap growth may also cause page fault
- Memory map
- Permission mismatching
- Copy on write (cow) fork, when process has write access, we map it to read only to detect when a write occurs, then establish new mapping with write permissions
- Kernel must track whether page is COW or not
- Write to read only page (actual permission violation)
- TLB caches permissions, so when we update the permissions we must flush the TLB
- When we make a new mapping and add to the TLB, we don’t need to flush, but when we change the permission such as when we do COW, then we need to flush
- Kernel needs to track additional info about VAS
- Machine independent, specified by architecture, set of bookkeeping structure, describes state of the virtual address
- Machine dependent page table → loaded in page table base register (cr3)
- vspaceupdate() updates entries into the machine dependent page table (presumably after updating some page update such as a permission update)
Week 6
Page Fault
- Valid page fault, kernel can do something about it (demand paging, mmap, cow)
- Invalid page fault (NULL pointer), terminate the process
Handling valid page fault
- Allocate a frame, frame not built yet, what do we do if we have no memory available?
- Perhaps we can block until memory becomes available
- Perhaps we can forcefully make memory available
- Write and store the frame to a different storage device
- Perhaps we could kill a process, general purpose OS will avoid this (hardware caches pages, tries to swap pages)
Eviction mechanism (swapping)
- We can write a frame to somewhere on disk (disk has 512 B pages, SSD has 4 KB pages)
- Disk is persisted, memory is not persisted
- Scans through the blocks (dedicate a separate swap partition not used by default file system)
- If we run out of memory space and swap partition upper limit, nothing left to do
- Swap partition is a range of blocks
- Bitmap, giant bit array (page has 4096 bytes, or 4096 * 8 bits), if a block is used = 1 if unused = 0, stored somewhere in kernel memory
For eviction, we need to allocate swap space from the bitmap (allocate on disk/persistent memory)
- Must remove the old page to frame mapping from the page table, and TLB shootdown (remember to acquire lock here), possible that multiple pages are mapped to the same frame (per frame metadata core map entry)
- Then write out the frame to the disk block
- Must track swap location of the faulting page
- Stored in the PTE field
- After that, reallocate the frame to the faulting page, establish mapping
Eviction policy
- How do we choose an eviction candidate
- Kernel frame eviction? Pinned into physical memory, kernel will bypass
- What if it is a shared library?
- Frames within the faulting process? (local page replacement policy) Good for performance isolation, Windows uses this and global mix
- Frames across the system other process memory overhead? (global page replacement policy), Linux uses this
- Page replacement policies
- FIFO → a queue of frames as we allocate them, evict from the front of the queue
- Easiest algorithm to implement
- Access pattern, cyclical
- Belady’s anomaly, increasing # of frames, causes more page faults if we use FIFO
When we are under memory pressure, when we need to allocate a frame when there aren’t an
- Allocate a frame when there are none
- Track a threshold of free frames
- Possibly mark the frame as free
Eviction with swap partition
- Can use swapfile or swap block region
- Evict a frame → requires us to allocate a swap space (write down where a page is being swapped to)
- When we need the swap page → page fault handler swap it back from swap partition into memory
- If the process is killed/exited then we might never bring it back from partition
- But must remember to free up the swap entries
Eviction policies
- FIFO → evicting frames based on allocation order
- Queue at the time of allocation, low overhead
- Suffers from Belady’s anomaly, more page faults and evictions as we add more frames available
Optimal algorithm for page replacement policy
- We know our frame #, all memory access, we want to reduce the # of page faults
- Evict furthest in the future, but we don’t know which memory access will occur (non-deterministic)
- Least recently used (LRU)
- Evict oldest accessed page
How to implement LRU
-
Count # of pages accessed across the system since a page is accessed
- On every memory access, perform arbitrary amount of counter updates
-
Queue LRU
- Quick access, move a page to the end of the queue, less metadata
-
On eviction time, scan through all PTEs to find page to evict
Clock
-
Approximates LRU
-
Use the access bit
-
We only set the accessed bit when the memory is being performed on, such as on a page table walk, and searching the TLB (assumes bit already set because we put it in the TLB)
-
When we are looking for frame/pages to evict, we look at accessed bit, and if it is 0 then we evict, otherwise we clear the accessed bit and continue searching
-
Clock hand points to a list of frames, then each frame has mapping to page, then check PTE for accessed bit
Cost of eviction
- Eviction of a code page, program already on disk and is loaded into memory, so because it is already in disk we do not need to re-write to swap
Signals: Interprocess communication (IPC), a predefined set of events that processes can use to communicate to each other
-
Pipe is a form of IPC, variable content
-
Signal is a form of IPC of fixed events
Mechanism of signals (software exceptions)
- We want to send signals
- kill(pid, signumber)
- SIGSEGV, invalid mem ref.
- SIGFPE, floating point error
- SIGILL, illegal instruction
- These are typically sent by the kernel upon hardware exceptions to process causing problem
- We send these signals to the process because some exceptions might be handled by the process
How would we implement the “kill” call?
- Track a pending set of signals for each process (possibly in the PCB), signals sent but not yet delivered
- We don’t want arbitrary processes to send signals to each other, especially the kill signal
- We don’t want only the parent to be able to kill the process, but maybe admin privileges?
- Draw the boundary at the user level, processes at the same level (within the user level) can communicate to one another, example a root process in the user level can kill
And receive signals
- Signal delivery is implicit to the receiver
- OS tracks pending signals as a set (multiple sends results in a single delivery)
- Default actions defined by the kernel for each signal (default handler)
- User process can define custom handlers for most signals (but cannot define custom behaviors for sigkill and sigstop)
- Handlers defined in user code, in user memory
- How might we implement this?
- Must be in the kernel first to be able to deliver the signal
- When the receiver enters the kernel, context switch (timer interrupt), right before return to the user then process handles the kernel
- User calls signal(sigNum, handler()) syscall, %rip points to handler, goes back into kernel right after return from kernel
User level threads
- Kernel level threads (everything so far), context switch goes into the kernel (not cheap operation, costs hundreds of cycles, must push states and pop states)
- Managed and scheduled by the kernel
- TCB, kernel stack, kernel scheduler
- Threads enable us to write concurrent code, structure our code into independent tasks makes program more logically reasonable, but it is expensive to create and schedule threads
- User level threads (all done within user space), context switching is much cheaper
- Threads managed and scheduled by user libraries/runtime
- Created and maintained by user code, kernel is unaware of these existence
Week 7
User threads
- Managed & scheduled by the user libraries/runtime (user space)
- Why use them?
- Cheaper, no need for mode switching
- User can specify custom scheduling policy, more efficient
- Kernel uses preemption (fair sharing of CPU) vs User thread cooperative scheduling (voluntarily yield)
- How do they even run?
- Kernel thread (every process starts with one)
- Runs and switches between different user level threads
- N:1 user threads to kernel thread
- N:M model (threadpool)
- N number of user threads (tasks), M number of kernel (worker) threads
- M kernel threads proportional to number of CPU cores
What happens if a user thread blocks?
- Assume N:1 model
- Depends on how a user thread blocks
- If it blocks due to a synchronization primitive like sleeplock, we can transfer control to user-level scheduler, to schedule other user level threads
- If it blocks due to an IO, syscall, or page fault (kernel exception), can cause other user threads to block
- Maybe use a nonblocking system call
- Before making a syscall, perhaps schedule another thread to run?
- Wake up another kernel level thread to take over remaining user threads
Storage Devices
- Persistent, nonvolatile, large capacity
- Hard drive/spinning disk
- Cheap GB per $, (10-20$ per TB)
- Physical motion (slow access latency, 10-20ms
- SSD
- More expensive (~3x)
- No physical motion, faster access 10-100ns
Hard drives
- Track, divided into sectors
- Sector is the unit of read
- Typically only outer tracks are used because larger storage space
- Disk request
- Request sent to controller
- Move the arm to track (seek time: 1-20ms)
- Wait for sector to spin under the disk head (rotational time: 4-15ms on RPM)
- Transfer time back to host (transfer rate/time)
- Total: about 120ms
- Latency
- Total time = seek time + rotational time + transfer time
- Cost of reading 1 sector (512 bytes)
- Seek time = 10ms
- Rotation = 7200 RPM = 120 RPS = 0.12 RPMS = 8.3 ms per rotation / 2 = 4 ms for average rotational time (divide by 2 for average time of half spin)
- Bandwidth = 120 MiB/s → B/s = 512/125829120 * 1000ms = 0.004 ms
- 10ms + 4ms + 0.004ms = 14.004 ms for a single read
- Reading consecutive sectors better, only increases the bandwidth latency
- 10ms + 4ms + 10 * 0.004 ms = 14.04 ms for 10 consecutive sector reads
- Reading random sectors becomes some X multiple of bandwidth latency
- 10 * (10 ms + 4 ms + 0.004 ms) = 140.04 ms for 10 random sector reads
IOPS
- IO operations per second
- Number of IO requests / Total latency
Solid State Drive
- Persistent, block-addressable, large capacity
- Multiple NAND flash memory packages
- Blocks, group of pages
- Block contains number of pages, page = 4KB
- Read/write page, page level operation is atomic
- NAND flash packages connected by multiple channels
-
Highly concurrent/parallel architecture
-
No moving parts, much better access latency
-
Operations SSD performs
-
Read a page (4KB), fast access latency (tens of microseconds)
-
Write a page (4KB), only write to a page that has not been written before, clean page
-
(program)
-
To write, we set bits to 0, but by default a clean page has all bits set to 1, (hundreds of microseconds)
-
Erase operation on a block (several MBs 1-8), sets all bits of all pages within the block to 1s (slow operation, ms)
-
Block CONTAINS pages, so each block has 1-8 pages
-
If we want to write to a pre-existing location (overwrite), because we need to wipe, we write previously written pages elsewhere and restore the content back, then erase the temp clean page earlier
-
We can only write to clean pages, and we have to write sequentially
- Any free page is valid for temp block storage
-
Whole process is invisible to users
Reliability of SSD
- Lots of erase & program repeatedly on a single page
- Endurance problem, 10-100k writes/programs
- As a user, it is nice to have overwrites but SSD lifespan is reduced upon each write
- Logical block address → translate it to its physical block address
- User can interact with (kernel)
- Wear leveling, implemented through the flash translation layer (FTL)
- Translation/management of mappings
- Garbage collection
Latency of SSD request
- Read a 4KB page, 10ms read
- total latency = access latency + transfer time (bandwidth)
- Typical SSD bandwidth is ~500MiB/s → 7.8 microseconds of transfer time, total = 17.8 microseconds
- Much smaller gap for sequential & random access
- Random can sometimes be faster than sequential
- Depending on the random access pattern (does allow for parallel access)
Abstraction
- Files = named data, persistent (shm = named memory)
- Composed of data and metadata
- Metadata includes size of file, owner of file, types of files, accessed/creation time, location of actual data (data layout)
- Directories
- Way to organize multiple files
- Represent directory as file: type between dir and file
- Metadata and data of a directory
- Metadata of dir includes similar info, size of the dir, owner, type, access/creation time
- What data?
- Other files and directories under this directory
- Content (directory entries) stored in an array in a directory data block
- Directory entry includes filename and location of file metadata (on disk)
- First two entries in dir include . and .., the current dir and the parent dir
Path
-
Starts with a slash, starting from the root directory, absolute path
-
Relative path uses current working dir
Filesys Implementation
- Manages disk blocks usage and allocation
- Usage info stored in bitmap, single bit for each block, persisted, stored on disk
- Bitmap tracks its own sectors/blocks
- Usage info stored in bitmap, single bit for each block, persisted, stored on disk
- Metadata for each file/directory
- Because of usage pattern of data, we create a metadata table, reserves sectors tracking metadata
- On SSD it needs 4KB to track a page
- Depending on metadata size, we can fit >1 metadata unit inside a data sector (whether it be 512 byte disk block or 4 KB SSD page)
- inodetable (metadata table) are pre-allocated sectors that hold inodes (metadata)
- Root directory inode stored at a known pre-defined location
- Metadata of filesystem (superblock)
- Stores start of bitmap sectors & the size of sectors
- Stores start sector of inodetable & size of number of blocks
- Superblock stored at a known pre-defined location (block 0 or block 1 etc.)
Data layout
- How data content is stored on disk
- Contiguous allocation
- No matter how large our content, we know start and end
- Simple, small storage space, easy to compute data block location
- Given the offset, we know exactly the block to look at (remember + 1 edge case)
- xk starter code uses contiguous
- Downside: cannot grow more space
- Linked allocation
- Metadata stores head (or tail too), and each block reserves some pointer data to the next block
- Take into account reserved space for pointer (subtract size of a pointer like 4 B)
- Upside: can continue growing without limits, arbitrary size
- Downside: bad for performance, more seek time overhead
Week 8
Data layout
-
Contiguous allocation, linked allocation
-
Maybe we can combine the two
-
Contiguous chunk = extent (contiguous region of sectors)
- Variable sized contiguous region
- Extent needs to track start and size
- Group by extents (group of sectors) instead of individual sectors
- Maybe we can just have an array of extents
- But, in order to find the proper offset, because each extent’s size is variable, we must look through the size of each extent
-
But, because we’d store the extent array in the inode, it may become too big, and then it will be corrupted because inodes will leak into >1 data block
-
One way to fix this problem is using index block/indirection
-
Store a block pointer to an indirection block
-
But, more cost because we must include read to the indirection block itself
-
This can lean into multilevel indexing
Each pointer may point to another indirection block, doubly indirect block, triply indirect etc.
-
Case Study: Fast File System (FFS)
-
ext, ext2 linux file systems
-
Designed for fast disk performance
- Keep contiguous data as much as possible
- Delay allocation, just buffer as much memory as possible, no contract for instant persistency, allows batching of many bytes
- Mostly small files, need support for large files (most files are <20 MB)
- Keep contiguous data as much as possible
-
An inode contains a data layout array
-
Contains 12 direct pointers (to blocks)
- 1 indirect block pointer
- 1 doubly indirect block pointer
- 1 triply indirect block pointer
- Even though disk block (sector division) is 512, the OS uses 4096 B, taking up 8 sectors, each direct pointer takes up 8 consecutive chunks
- Bitmap size might get reduced
- To create a new file
-
We must allocate a new inode (update to the inodetable)
- Set up the inode
- Parent directory data block change/metadata change
- Say we want to write to file, we allocate the first DP then put it back into the inode, then write data to block on disk, everything else is NIL/0
POSIX API: lseek (lets a process to set offset to anywhere)
-
We can set the offset past the end of file
-
Sparse file, we allocate the data structure to the corresponding block number
-
If user accesses NIL pointers within gap, return zeros as content, waste of space, kernel might lie saying there is no more space, even though the gap is just zeroes
FFS Locality Heuristics
- Sequential access is faster
- Place related things together (same block group)
- Each file’s data metadata is allocated in the same block group
- Files within the same directory
- Place unrelated things into different groups
- Different dirs and files from different dirs
- Now we have different block levels, each block group tracks their individual free space bitmap, inodes, and data blocks
- How do we deal with large files (GB)?
-
Exceptions for large files, we spread the data to different block groups to not flood smaller files
-
NTFS: New Technology File System
- Master file table
- Entries for file record (like an inode)
- 1 KB sized record
- First half of record includes metadata
- Small data can live (resident) within the record itself
- If doesn’t fit, store data extents to keep track of data
- If that isn’t enough, then store doubly pointer to record of record pointing to extents
- Entries for file record (like an inode)
Filesys operations
- Create, read, write
- inodes (inodetable)
- bitmap
- data blocks
- When we create a new file (updating multiple structures)
- Allocate a new inode (inode bitmap), write new inode into inodetable
- Data block of the parent directory, new directory entry
- Update parent’s inode with size (modification time)
- Write to the new file:
- Allocate data block (changes to the data bitmap)
- Update the file’s inode to track the newly allocated data (new extent)
- Write to the data blocks!
Filesys = inode cache, block cache (bio.c)
-
fs operations will operate on cached blocks
-
with modification, when to synchronize?
- Write through cache: changes immediately persist on disk
- Makes every fs op persistent upon return
- Problem: can’t buffer updates to frequently modified blocks
- Slow, data not all persisted
The fs ops above does not happen, instead, by default we give control to application
-
Only persisted upon a user request
-
Sync: flushes all cached (dirty) blocks to disk
-
fsync(fd): flush cached blocks related to a specific file (fd)
- If a new file needs to fsync parent dir yourself
-
By default, kernel periodically flushes all cached pages (10-30s)
- Write through cache: changes immediately persist on disk
Crash Consistency
- Upon fsync we issue disk writes
- inode, data bitmap, data block (different disk writes)
- Reorders the the request (could be data bitmap, data block, then inode)
- Completion means interrupt by disk IO
- What if we crash before all requests are completed… (before fsync returns)
- Maybe no writes made it to disk → nothing changes! (OK)
- Only data bitmap is written → data block leak, inconsistency in fs metadata (NOT OK)
- Inconsistent in the filesystem’s metadata (bitmap)
- What if only the file inode is written? (NOT OK)
- New data layout (tracking new data block)
- When user trying to read the inode, will see garbage
- The data bitmap thinks its free, we could have an information leak as an inode could hijack data from another inode that already alloc’d the “free” data blocks
- What if only the data block is written? (OK)
- inode doesn’t know that the data exists, fs metadata is consistent
Resolving inconsistency
- The file system checker (fsck)
- We have superblock: metadata for fs
- Should tell us where to find the inodetable, and the bitmaps
- bitmap vs inodes correction
- If bitmap not allocated yet it is referenced by an inode, then we just go ahead and allocate (mark) it, even if it is just garbage data
- If the bitmap is allocated, but no inodes reference this data block, then no pointer points to it, mark it as un-allocated (unmark)
- fsck
- Also cares about fs namespace
- Because user can only find files through the path, what happens when inode exists in memory, but no pointers to it? Remember, kernel has access to all files in the inodetable
- What if we write to all structures (inodetable, block bitmap), but we fail to update the parent directory’s data block to reflect the new inode because of a crash?
- Scans through each directory’s dir entry to see if all allocated inodes are referenced (if there is a path to the inode), if no pointers to inode, then:
- Put file in lost and found folder
- Maybe to reduce overhead, completely avoid inconsistency
- We are directly writing changes to their actual locations
- Maybe we would like to group these writes as an atomic unit
Transaction: group arbitrary number of updates into a single atomic unit
- tx_begin, tx_write(b1), tx_write(b2), … , tx_commit
- Journaling/WAL
- Reserve log space
- Write the transaction to the log first, then apply the transaction to their actual location
- If we crash, then upon recovery, we know that the tx log is persisted, then just continue the transaction
- Downside: write twice, once to the log and once to the actual location, suffer performance loss