CS 213 — Multiprocessor Architecture and Programming
Instructor: Elaheh Sadredini
Homework Assignment 1 - Basics of Parallel Processing
1. (5 points) Describe the differences between symmetric multiprocessing (UMA) and
asymmetric multiprocessing (NUMA). Identify one advantage and one disadvantage
of UMA compared to NUMA.
Solution: What it is: All processors see the same latency to main memory (centralized
memory). Simpler to program/data placement. Less scalable due to a shared memory
bottleneck as cores grow.
One advantage vs. NUMA: Simplicity: uniform latency; the programmer needn’t worry
about where data lives.
One disadvantage vs. NUMA: Scalability/peak performance: centralized memory
becomes a bottleneck; lower peak performance as core counts increase.
NUMA (Distributed Shared Memory / Non-Uniform Memory Access)
What it is: Each processor has fast local memory; remote memory is slower, but the
system is more scalable. Programmer/data placement matters to get high performance.
2. (15 points) Program P runs to completion in 100 seconds on a single, in-order
processor. When Program P is parallelized, 20 seconds must be spent in serial
execution that cannot be parallelized, and the rest of execution time can be perfectly
parallelized on any number of processors.
(a) (5 points) When Program P runs on 16 processors, how much time would it
take to run to completion?
(b) (5 points) Assume we have a system with an infinite number of processors.
How much time would it take to run Program P to completion?
(c) (5 points) If we have a fixed area on which we can build one of the following
two systems:
- System 1 contains 16 in-order single-issue processors like the ones in
part (a)
- System 2 contains 4 out-of-order 4-wide superscalar processors. Program P
runs to completion on one of these processors in 50 seconds. When parallelized,
4 seconds must be spent in serial execution, and the rest can be perfectly
parallelized. Which of these two systems would you choose to run Program
P?
3. (10 points) Assume that we have a function for an application of the form F(i, p),
which gives the fraction of time that exactly i processors are usable given that a
total of p processors is available. That means that
p
i=1
F (i, p) = 1
Assume that when i processors are in use, the applications run i times faster.
Rewrite Amdahl’s law so it gives the speedup as a function of p for some application.
4. (30 points) In this exercise, we examine the effect of the interconnection network
topology on the clock cycles per instruction (CPI) of programs running on a 64processor distributed-memory multiprocessor. The processor clock rate is 3.3 GHz
and the base CPI of an application with all references hitting in the cache is 0.5.
Assume that 0.2% of the instructions involve a remote communication reference
(i.e., miss in local memory). The cost of a remote communication reference is (100
+ 10h) ns, where h is the number of communication network hops that a remote
1reference has to make to the remote processor memory and back (i.e., hop count
refers to the number of network devices through which data passes from source to
destination). Assume that all communication links are bidirectional.
(a) (12 points) Calculate the worst-case remote communication cost when the
64 processors are arranged (I) as a ring, (II) as an 8×8 processor grid (i.e.,
crossbar), or (III) as a hypercube. (Hint: The longest communication path on
a 2n hypercube has n links.)
(b) (12 points) Compare the base CPI of the application with no remote communication to the CPI achieved with each of the three topologies in part (a).
(c) (6 points) How much faster is the application with no remote communication
compared to its performance with remote communication on each of the three
topologies in part (a).
5. (30 points) Consider a simple 2D finite difference scheme where at each step every point in the matrix is updated by a weighted average of its four neighbors,
A[i,j]=A[i,j]-w(A[i-1,j]+ A[i+1,j]+ A[i,j-1]+ A[i,j+1]). All the values are 64-bit
floating-point numbers.
(a) (15 points) Assuming one element per processor and 1024×1024 elements, how
much data must be communicated per step?
(b) (15 points) Explain how this computation could be mapped onto 64 processors
to minimize the data traffic. Compute how much data must be communicated
per step.
6. (10 points) Multiprocessors usually show performance increases as you increase the
number of the processors, with the ideal being n× speedup for n processors. The
goal here is to make a program that gets worse performance as you add processors.
This means, for example, that one processor on the multiprocessor runs the program
fastest, two are slower, four are slower than two, and so on. Explain one reason
that a program execution gives inverse linear speedup in multiprocessors.
2