Parallel and Distributed Computing

21/6/2023
20-minute read

This is my revision notes for Parallel And Distributed Computing. In this post, it will cover fundamental concepts and trade-offs behind parallel and distributed applications, designs and implementations for parallel and distributed applications, as well as performance analysis.

Parallel Programming

Why we need ever-increasing performance
- As our computational power increases, the number of problems that we can seriously consider solving also increases. E.g. climate modelling / protein folding / drug discovery / energy research / data analysis
Why building parallel systems
- As the speed of transistors increases, their power consumption increases. Most of this power is dissipated as heat, and when an integrated circuit gets too hot, it becomes unreliable.
- Transistor density can increase for a while, but slower than before.
Write parallel program
- Basic idea: partition the work to be done among cores
- Two widely used approaches
  - Task-parallelism: partition the various tasks carried out in solving the problem among the cores
    - Each tutor grades one question
  - Data-parallelism: partition data used in solving the problem among the cores, and each core carries out more or less similar operations on its part of the data
    - Each tutor grades one pile of exam paper
Coordination of cores
- Communication: one or more cores send their message to another core
- Load balancing: make sure the cores all to be assigned roughly the same number of works
- Synchronization: the cores need to wait before other cores
Two types of parallel systems
- Shared-memory system: we can coordinate the cores by having them examine and update shared-memory locations
- Distributed-memory system: each core has its own, private memory, and the core must communicate explicitly by doing something like sending messages across a network
Concurrent, parallel and distributed
- In concurrent computing: a program is one in which multiple tasks can be in progress at any instant
- In parallel computing: a program is one in which multiple tasks cooperate closely to solve a problem
- In distributed computing: a program may need to cooperate with other programs to solve a problem

Parallel Hardware and Software

von Neumann Architecture
- The classical von Neumann architecture consists of main memory, a central processing unit (CPU), and an interconnection between the memory and the CPU.
  - Main memory consists of a collection of locations, each of which is capable of storing both instructions and data.
  - The central processing unit is divided into a control unit and an arithmetic and logic unit (ALU).
    - The control unit is responsible for deciding which instructions in a program should be executed. The control unit has a special register called the program counter which stores the address of the next instruction to be executed.
    - The ALU is responsible for executing the actual instructions. Data in the CPU and information about the state of an executing program are stored in very fast storage called registers.
  - Instructions and data are transferred between the CPU and memory via the interconnect. This has traditionally been a bus, which consists of a collection of parallel wires and some hardware controlling access to the wires.
- von Neumann bottleneck: The separation of memory and CPU is often called the von Neumann bottleneck, since the interconnect determines the rate at which instructions and data can be accessed.
Multitasking: The OS provides support for the apparent simultaneous execution of multiple programs. This is possible even on a system with a single core, since each process runs for a small interval of time (time slice). After one running program has executed for a time slice, the OS can run a different program.
Process: When a user runs a program, the OS creates a process which is an instance of a computer program that is being executed.
Threading: provides a mechanism for programmers to divide their programs into more or less independent tasks with the property that when one thread is blocked another thread can be run. In addition, it is possible to switch between threads much faster than switching between processes.
Thread: Threads are contained within process, so they can use the same executable, and they usually share the same memory and the same I/O devices. The two most important exceptions are that they will need a record of their own PC and call stacks so that they can execute independently of each other. When a thread is started, it folks off the process, when a thread terminates, it joins the process.
Flynn’s taxonomy: classify computer architectures
- SISD (single instruction, single data): executes a single instruction at a time and it can fetch or store one item of data at a time. (classical von Neumann system)
- SIMD (single instruction, multiple data): operates on multiple data streams by applying the same instruction to multiple data items.
  - Vector processors: operates on arrays or vectors of data. Vector systems have very high memory bandwidth, and every data item that is loaded is actually used. However, they do not handle irregular data structure as well as other parallel architectures, and there seems to be a very finite limit to their scalability.
  - Graphic processing units: converts the internal representation into an array of pixels that can be sent to a computer screen. Several of the stages of this pipeline are programmable. The behavior of the programmable stages is specified by functions called shader functions. All GPUs use SIMD parallelism, GPUs are not pure SIMD systems since current generation GPUs can have dozens of cores, which are capable of executing independent instructions streams.
- MIMD (multiple instruction, multiple data): supports multiple simultaneous instruction streams operating on multiple data streams. Thus, MIMD systems typically consist of a collection of fully independent processing units or cores, each of which has its own control unit and its own ALU (asynchronous).
  - Shared-memory systems (multiple multicore processors)
    - Uniform memory access (UMA): The time to access all memory locations will be the same for all cores / Easier to program
    - Nonuniform memory access (NUMA): Faster access to the directly connected memory / Have the potential to use larger amount of memeory
  - Distributed-memory systems (clusters)
    - Clusters are composed of a collection of commodity systems (PC), or connected by a commodity interconnection network (Ethernet).
Difference between SPMD and SIMD
- SPMD refers to a programming model where a single program is executed by multiple parallel processing units, such as multiple threads or processes. Each processing unit operates on its own portion of the data, but they all execute the same program. SPMD allows for flexible control flow and can be used to express both data parallel and task parallel computations.
- SIMD refers to a type of parallelism where a single instruction is executed simultaneously by multiple processing elements on different data elements. SIMD architectures typically have a single control unit that broadcasts instructions to multiple processing units, allowing for efficient parallel execution of the same operation on multiple data elements in parallel.
Communication and synchronization
- In distributed-memory programs, we often implicitly synchronize the process by communicating among them.
- In shared-memory programs, we often communicate among the threads by synchronizing them.
Embarrassingly parallel
- Programs that can be parallelized by simply dividing the work among the processes / threads.

Programming with Message Passing using MPI

Distributed-memory systems using message-passing
- Process: a program running on one core-memory pair
- Message-Passing Interface (MPI): one process calls a send function and the other calls a receive function. Processes typically identify each other by ranks in the range 0, 1, …, p - 1, where p is the number of processes.
MPI functions
- MPI_Init(): tells MPI system to do all of the necessary setup. It might allocate storage for message buffers, and it might decide which process gets which rank. No other MPI functions should be called before the program calls MPI_Init().
- MPI_Finalize(): tells MPI system that we are done using MPI, and that any resources allocated for MPI can be freed. No MPI functions should be called after the call to MPI_Finalize().
- MPI_Comm_size(MPI_COMM_WORLD, &comm_size): returns in its second argument the number of processes in the communicator
- MPI_Comm_rand(MPI_COMM_WORLD, &comm_rank): returns in its second argument the calling process’s rank in the communicator
- MPI_Send(msg_buf_p, msg_size, msg_type, dest, tag, communicator)
  - msg_buf_p: a pointer points to the block memory containing the contents of the message
  - msg_size: amount of data to be sent
  - msg_type: type of data to be sent
  - dest: specifies the rank of the process that should receive the message
  - tag: it can be used to distinguish messages that are otherwise identical
  - communicator: a message sent by a process using one communicator cannot be received by a process that’s using a different communicator
- MPI_Recv(msg_buf_p, buf_size, buf_type, source, tag, communicator, status_p)
  - msg_buf_p: points to the block of memory
  - buf_size: determines the number of objects that can be stored in the block
  - buf_type: indicates the type of the objects
  - source: specifies the process from which the message should be received (can use MPI_ANY_SOURCE)
  - tag: should match the communicator used by the sending process (can use MPI_ANY_TAG)
  - communicator: match the communicator used by the sending process
  - status_p: it is a structure with at least three members MPI_SOURCE, MPI_TAG, MPI_ERROR, after a call to MPI_Recv in which &status is passed as the last argument, we can determine the sender and tag.
Semantics of MPI_Send and MPI_Recv
- The sending process will assemble the message
- If the sending process buffers the message, the MPI system will place the message into its own internal storage
- If the system blocks, it will wait until it can begin transmitting the message, and the call to MPI_Send may not return immediately
- MPI_Recv always blocks until a matching message has been received
- MPI requires that messages be non-overtaking. This means that if process q sends two messages to process r, then the first message sent by q must be available to r before the second message. However, there is no restriction on the arrival of messages sent from different processors.
Blocking and Non-blocking sends and receives
- Blocking
  - Send is complete when the message buffer has been fully transferred to the MPI system.
  - Receive is complete when the message data has arrived at the destination and is available for use.
- Non-blocking
  - They just continue with no regards for completion status
  - Can be useful to help avoid deadlock

Commands for compile and run

mpicc -g -Wall -o mpi_hello mpi_hello.c
mpiexec -n 4 ./mpi_hello

Collective operations
- MPI_Bcast(): Copies data from root node to the same memory location in every other node
- MPI_Gather(): Each node sends the contents of the send buffer to the root node, and root node stores them in rank order.
- MPI_Scatter(): Root process splits buffer into equal chunks and sends one chunk to each processor
- MPI_AlltoAll(): Each node performs a Scatter operation on its own data. Thus every node receives some data from every other node.
- MPI_Reduce(): MPI_Reduce operation combines the values from all processes and produces a single result, which is typically stored on a designated root process.
Output
- Most MPI implementation allow all processes to execute printf and fprintf(stderr, ...).
- However, most MPI implementations don’t provide any automatic scheduling of access to these devices, since MPI processes are competing for access to the shared output devices, stdout, and it is impossible to predict the order in which the processes’s output will be queued up. Hence, we can have each process other than 0 send its output to process 0, and process 0 can print the output in process rank order.
Input
- Most MPI implementations only allow process 0 in MPI_COMM_WORLD access to stdin. In order to write MPI programs that can use scanf, we need to branch on process rank, with process 0 reading in the data and then sending it to the other processes.

Interconnection Networks

Shared-memory interconnects
- Bus: a collection of parallel communication wires together with some hardware that controls access to the bus. The key characteristic of a bus is that the communication wires are shared by the devices that are connected to it. However, as the number of devices connected to the bus increases, the likelihood that there will be contention for use of the bus increases, and the expected performance of the bus decreases.
- Crossbars: switched interconnects use switches to control the routing of data among connected devices. Crossbar allows simultaneous communication among different devices, so they are much faster than buses. However, the cost of the switches and links is relatively high.
Distributed-memory interconnects
- Interconnects
  - Direct interconnects: each switch is directly connected to a processor-memory pair, and the switches are connected to each other.
    - Ring: If there are p processors, the number of links is 2p.
    - Toroidal mesh: If there are p processors, the number of links is 3p.
    - Fully connected network: It is used as a basis for evaluating other interconnects. However, it is impractical since it requires a total of p² / 2 + p / 2 links, and each switch must be capable of connecting to p links.
    - Hypercube: A hypercube of dimension d has a p = 2^d nodes, and a switch in a d-dimensional hypercube is directly connected to a processor and d switches.
  - Indirect interconnects: the switches may not be directly connected to a processor. They are often shown with unidirectional links and a collection of processors, each of which has an outgoing and an incoming link, and a switching network.
    - Crossbar: As long as two processors do not attempt to communicate with the same processor, all processors can simultaneously communicate with another processor. (Crossbar > MIN > Bus)
    - omega network: There are communications that cannot occur simultaneously.
Bisection width
- It refers to a measure of the communication capacity or bandwidth between two halves of a system when it is divided into two equal parts.
- The bisection width is a metric that quantifies the communication capacity between these two groups. It represents the maximum amount of data that can be exchanged between the two halves of the system simultaneously.
- An alternative way of computing the bisection width is to remove the minimum number of links needed to split the set of nodes into two equal halves. The number of links removed is the bisection width.
Bisection bandwidth
- often used as a measure of network quality
- it sums the bandwidth of links
Latency: the time that elapses between the source’s beginning to transmit the data and the destination’s starting to receive the first byte.
Bandwidth: the rate at which the destination receives data after it has started to receive the first byte
Time to transmit a message of n bytes
- message transmission time = latency + n / bandwidth

Parallel Program Design

Foster’s methodology
- Partitioning
  - Divide the computation to be performed and the data operated on by the computation into small tasks. The focus here should be on identifying tasks that can be executed in parallel.
- Communication
  - Determine what communication needs to be carried out among the tasks identified in the previous step.
- Agglomeration or aggregation
  - Combine tasks and communications identified in the first step into larger tasks. For example, if task A must be executed before task B can be executed, it may make sense to aggregate them into a single composite task.
- Mapping
  - Assign the composite tasks identified in the previous step to processors / threads. This should be done so that communication is minimized, and each process / thread gets roughly the same amount of work.

Performance Analysis

Linear speedup: If we call the serial run-time T_serial and our parallel run-time T_parallel, then the best we can hope for is T_parallel = T_serial / p. However, in practice, we are unlikely to get linear speedup becuase the use of multiple processes / threads almost invariably introduces some overhead.
Speedup of a parallel program
- Speedup refers to the performance improvement achieved by executing a program on multiple processors compared to running it on a single processor.
- S = T_serial / T_parallel
Efficiency of a parallel program
- E = S / p = (T_serial / T_parallel) / p
Effect of problem size
- When we increase the problem size, the speedups and the efficiencies increase, while they decrease when we decrease the problem size (when p is not small). The relationship between problem size and speedup depends on various factors, including the nature of the problem, the parallelization technique used, the hardware architecture, and the efficiency of the parallel algorithm.
- Since T_parallel = (T_serial / p) + T_overhead, if there’s more work for the processes / threads to do, the relative amount of time spent coordinating the work of the processes / threads should be less.
Amdahl’s law
- Roughly, that unless virtually all of a serial program is parallelized, the possible speedup is going to be very limited - regardless of the number of cores available.
- More generally, if a fraction r of our serial program remains unparallelized, then Amdahl’s law says we cannot get a speedup better than 1/r.
Gustafson’s law
- For many problems, as we increase the problem size, the ‘inherently’ serial fraction of the problem decreases in size. (No need to worry about Amdahl’s law)
Scalability
- Suppose we now increase the number of processes / threads that are used by the program. If we can find a corresponding rate of increase in the problem size so that the program always has efficiency E, then the program is scalable.
- Types of scalable cases
  - Strongly scalable: if when we increase the number of processes / threads, we can keep the efficiency fixed without increasing the problem size.
  - Weakly scalable: if we can keep the efficiency fixed by increasing the problem size at the same rate as we increase the number of processes / threads.
Taking timings
- Resolution: the unit of measurement on the timer. It is the duration of the shortest event that can have a nonzero time.
- We first execute a barrier function that approximately synchronizes all of the processes / threads. We then execute the code and each process / thread finds the time it took. Then all the processes / threads call a global maximum function, which returns the largest of the elapsed times, and process / thread 0 prints it out.
- If we run the same experiment several times, we usually report the minimum time.
- We rarely run more than one thread per core.
- We usually not include I/O in our reported run-times.
Barrier
- A barrier is a mechanism that imposes global synchronization, ensuring that all participating threads or processes reach a specific point before continuing, thereby enabling coordination and synchronization.

Multithreading Programming

Dynamic and static threads
- Dynamic thread
  - There is often a master thread and at any given instant a (possibly empty) collection of worker threads. The master thread typically waits for work requests, and when a new request arrives, it folks a worker thread, the thread carries out the request, and when the thread completes the work, it terminates and joins the master thread.
- Static thread
  - In this paradigm, all of the threads are forked after any needed setup by the master thread and the threads run until all the work is completed.
  - The static thread paradigm has the potential for better performance than the dynamic paradigm.
  - Static thread paradigm is often used.
Nondeterminism
- A computation is nondeterministic if a given input can result in different outputs.
Race condition
- When threads or processes attempt to simultaneously access a resource, and the accesses can result in an error.
Critical section
- A block of code that can only be executed by one thread at a time.
Mutual exclusion lock / Mutex / Lock
- The most commonly used mechanism for insuring mutual exclusion
- Before a thread can execute the code in the critical section, it must ‘obtain’ the mutex by calling a mutex function, and when it finishes executing the code in the critical section, it should relinquish the mutex by calling an unlock function.
- While one thread owns the lock, any other thread attempting to execute the code in the critical section will wait in its call to the lock function.
Busy-waiting
- A thread enters a loop whose sole purpose is to test a condition. However, it can be very wasteful of system resources.
Thread safety
- A static variable that is declared in a function persists from one call to the next. Hence, static variables are effectively shared among any threads that call the function, and this can have unexpected and unwanted consequences.
Deadlock
- Deadlock refers to a situation in a concurrent system where two or more processes or threads are unable to proceed because each is waiting for a resource that is held by another process or thread in the system.

Programming with OpenMP

OpenMP provides what’s known as a “directive-based” shared-memory API. In C and C++, this means that there are special preprocessor instructions known as pragmas. Pragmas are typically added to a system to allow behaviors that are not part of the basic C specification. Compilers that do not support the pragmas are free to ignore them. So, in principle, if you have a carefully written OpenMP program, it can be compiled and run on any system with a C compiler, regardless of whether the compiler supports OpenMP.

Commands for compile and run

gcc -g -Wall -fopenmp -o omp_hello omp_hello.c
./omp_hello 4

Parallel for
- The parallel for directive forks a team of threads to execute the following structured block. However, the structured block following the parallel for directive must be a for loop. The system parallelizes the for loop by dividing the iterations of the loop among threads.
Loop-carried dependence
- A loop in which the results of one or more iterations depend on other iterations cannot, in general, be correctly parallelized by OpenMP.
Reduction clause
- OpenMP creates a private variable for each thread, and the run-time system stores each thread’s result in this private variable. OpenMP also creates a critical section and the values stored in the private variables are added in this critical section.
- Syntax: reduction(<operator>: <variable list>)
Scope of variables
- A variable that can be accessed by all the threads in the team has shared scope, while a variable that can only be accessed by a single thread has private scope.

OpenCL

Definition
- OpenCL (Open Computing Language) is an open standard and framework for programming heterogeneous computing platforms, including CPUs, GPUs, and other accelerators. It provides a unified programming model and API (Application Programming Interface) that allows developers to write code that can run efficiently across different hardware architectures.
OpenCL memory model
- The OpenCL memory model provides a framework for managing different types of memory in OpenCL computations, including private, local, global, and constant memory. It establishes rules for data organization, movement, and synchronization, allowing efficient utilization of memory resources and proper synchronization in parallel computations.
OpenCL execution model
- Application runs on a host which submits work to devices
  - Work item: the basic unit of work on an OpenCL device
  - Kernel: the code for a work item (basically a C function)
  - Program: collection of kernels and other functions
- Context
  - The environment within which work-items execute. It includes devices and their memories and command queues.
- Command queue: a queue used by the host application to submit work to a device
  - Work is queued in-order, one queue per device
  - Work can be executed in-order or out-of-order
Steps that a host program must include, when using OpenCL for parallel work.
- Platform discovery / Device selection / context creation / Command queue creation / Program compilation / Kernel creation / Memory object creation and data-transfer / Kernel execution / Synchronization and result retrieval / Cleanup
Restrictions
- Limited Functionality of Standard C Library
  - OpenCL restricts the functionality of the standard C library that is available for use within kernel code.
  - The reason is to maintain portability across diverse hardware platforms. Different OpenCL devices may have varying capabilities and may lack support for certain standard C library functions.
- Limited Data Type
  - OpenCL restricts the available data types in kernel code compared to standard C. More complex data types, such as structures and unions, are not directly supported in kernel code.
  - This restriction aims to facilitate efficient memory access and vectorization in parallel computations.

Performance Issues

High performance
- Balance workload onto available execution resources
- Reduce communications
- Reduce extra work (overhead) performed to increase parallelism, manage assignment
Communication to Computation Ratio
- amount of communication / amount of computation
- Amount of low ratio to effectively utilize modern parallel processors
Communication
- Inherent communication
  - Information that fundamentally must be moved between processors to carry out the algorithm given the specified assignment
- Artificial communication
  - All other communication
Contention
- Contention occurs when many requests to a resource are made within a small window of time
Improving program performance
- Identify and exploit locality
- Reduce overhead
- Reduce contention
- Maximize overlap of communication and processing
Batch scheduling
- Batch scheduling is a method of managing and executing computational tasks on high-performance computers in a non-interactive manner. In batch scheduling, users submit their jobs or tasks to a job scheduler, which allocates computing resources to the execution of these jobs based on various policies and priorities.
- Advantages: efficient resource utilization, fairness, workload management capabilities, and system stability

Notes