CS 213 Homework 1: Basics of Parallel Processing

CS 213 — Multiprocessor Architecture and Programming Instructor: Elaheh Sadredini Homework Assignment 1 - Basics of Parallel Processing 1. (5 points) Describe the differences between symmetric multiprocessing (UMA) and asymmetric multiprocessing (NUMA). Identify one advantage and one disadvantage of UMA compared to NUMA. Solution: What it is: All processors see the same latency to main memory (centralized memory). Simpler to program/data placement. Less scalable due to a shared memory bottleneck as cores grow. One advantage vs. NUMA: Simplicity: uniform latency; the programmer needn’t worry about where data lives. One disadvantage vs. NUMA: Scalability/peak performance: centralized memory becomes a bottleneck; lower peak performance as core counts increase. NUMA (Distributed Shared Memory / Non-Uniform Memory Access) What it is: Each processor has fast local memory; remote memory is slower, but the system is more scalable. Programmer/data placement matters to get high performance. 2. (15 points) Program P runs to completion in 100 seconds on a single, in-order processor. When Program P is parallelized, 20 seconds must be spent in serial execution that cannot be parallelized, and the rest of execution time can be perfectly parallelized on any number of processors. (a) (5 points) When Program P runs on 16 processors, how much time would it take to run to completion? (b) (5 points) Assume we have a system with an infinite number of processors. How much time would it take to run Program P to completion? (c) (5 points) If we have a fixed area on which we can build one of the following two systems: - System 1 contains 16 in-order single-issue processors like the ones in part (a) - System 2 contains 4 out-of-order 4-wide superscalar processors. Program P runs to completion on one of these processors in 50 seconds. When parallelized, 4 seconds must be spent in serial execution, and the rest can be perfectly parallelized. Which of these two systems would you choose to run Program P? 3. (10 points) Assume that we have a function for an application of the form F(i, p), which gives the fraction of time that exactly i processors are usable given that a total of p processors is available. That means that p i=1 F (i, p) = 1 Assume that when i processors are in use, the applications run i times faster. Rewrite Amdahl’s law so it gives the speedup as a function of p for some application. 4. (30 points) In this exercise, we examine the effect of the interconnection network topology on the clock cycles per instruction (CPI) of programs running on a 64processor distributed-memory multiprocessor. The processor clock rate is 3.3 GHz and the base CPI of an application with all references hitting in the cache is 0.5. Assume that 0.2% of the instructions involve a remote communication reference (i.e., miss in local memory). The cost of a remote communication reference is (100 + 10h) ns, where h is the number of communication network hops that a remote 1reference has to make to the remote processor memory and back (i.e., hop count refers to the number of network devices through which data passes from source to destination). Assume that all communication links are bidirectional. (a) (12 points) Calculate the worst-case remote communication cost when the 64 processors are arranged (I) as a ring, (II) as an 8×8 processor grid (i.e., crossbar), or (III) as a hypercube. (Hint: The longest communication path on a 2n hypercube has n links.) (b) (12 points) Compare the base CPI of the application with no remote communication to the CPI achieved with each of the three topologies in part (a). (c) (6 points) How much faster is the application with no remote communication compared to its performance with remote communication on each of the three topologies in part (a). 5. (30 points) Consider a simple 2D finite difference scheme where at each step every point in the matrix is updated by a weighted average of its four neighbors, A[i,j]=A[i,j]-w(A[i-1,j]+ A[i+1,j]+ A[i,j-1]+ A[i,j+1]). All the values are 64-bit floating-point numbers. (a) (15 points) Assuming one element per processor and 1024×1024 elements, how much data must be communicated per step? (b) (15 points) Explain how this computation could be mapped onto 64 processors to minimize the data traffic. Compute how much data must be communicated per step. 6. (10 points) Multiprocessors usually show performance increases as you increase the number of the processors, with the ideal being n× speedup for n processors. The goal here is to make a program that gets worse performance as you add processors. This means, for example, that one processor on the multiprocessor runs the program fastest, two are slower, four are slower than two, and so on. Explain one reason that a program execution gives inverse linear speedup in multiprocessors. 2

CS 213 Homework 1: Basics of Parallel Processing

Related documents

Products

Support

CS 213 Homework 1: Basics of Parallel Processing

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib