Exam in Computer Architecture (EDA111)

Time: December 22, 2006 after lunch in the M building

Person in charge of the exam: M. M. Waliullah, 031-772 1653 or Per Stenström 031-772

1761

Supporting material/tools: None

**Exam Review:** January 4, 2006 between 10-11.45 in Per Stenstrom's office.

## **Grading intervals:**

• **Fail**: Result < 20

Grade 3: 20 <= Result < 29</li>Grade 4: 30 <= Result < 39</li>

• **Grade 5:** 40 <= Result

**Important note:** Answers can be given in Swedish. The reason this exam is presented in English is that the course is also offered to the International Master's Program on Dependable Computer Systems.

## **GOOD LUCK!**

Per Stenström

CHALMERS UNIVERSITY OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
412 96 Göteborg

Visiting address: Rännvägen 5

Phone: 031-772 1761 Fax: 031-772 3663

Org. Nr: 556479-5598 E-mail: pers@ce.chalmers.se



[General disclaimer: If you feel that all assumptions are not made to solve the problem 1) ask the teacher when he shows up or 2) make your own additional assumptions. This will be accepted if they are reasonable]

#### **ASSIGNMENT 1**

**A)** Three programs (A, B, and C) are run on three computer systems (C1, C2, C3) with the execution times shown in the table below:

|    | A  | В  | С |
|----|----|----|---|
| C1 | 10 | 20 | 2 |
| C2 | 20 | 10 | 1 |
| C3 | 10 | 5  | 2 |

Under the assumption that all programs are run with the same probability, which computer system is the fastest? (4 points)

# B) Establish the geometric mean of the execution time of the fastest computer system normalized to the slowest. (2 points)

C) Let us assume that the execution time for a program on a computer system is T. 50% of the program execution consists of a loop with no loop-carried dependences. Each iteration of that loop does a load, adds two numbers, and stores the result in memory. Further, the program is executed on a VLIW-architecture with M addition units and a single integer/load/store unit. The number of iterations is M. Use Amdahl's Law to calculate the speedup on the VLIW-architecture compared to an architecture that can execute a single operation at a time. In both cases, you may assume that the loop is unrolled an arbitrary number of times. (4 points)

#### **ASSIGNMENT 2**

A) The diagram on the next page shows the distribution of control instructions across Call/Return, Jump, and conditional branches. Let us assume that every fifth instruction is a control instruction. In a computer system, the prediction accuracy is 50% for the outcome of a control instruction. Further, the CPI for all instructions (except control instructions) is 1. For control instructions, the CPI is 2 if the outcome is correctly predicted whereas it is 4 if it is wrongly predicted. What is the average CPI? (4 points)



© 2003 Elsevier Science (USA). All rights reserved.

- B) For each item below, explain if the component goes up/down and why when a program is compiled with an optimizing compiler which only does redundant code elimination and register allocation: (4 points)
  - Number of executed loads and stores
  - Number of executed branches
  - Fraction of branches of all executed instructions
  - Number of executed instructions in a loop
- C) Explain how a stack-architecture executes the following code: A=B+C (2 points)

#### **ASSIGNMENT 3**



The above diagram shows a pipeline with Tomasulo's algorithm. Consider the following code:

SUB F0, F1,F2 (8 cycles) DIV F3,F0,F4 (10 cycles) ADD F4, F5,F6 (6 cycles)

- A) Show all data and name dependences in the code. (2 points)
- **B)** Assuming that there are two addition functional units and a division functional unit and that a single instruction is issued every cycle. The instruction latencies thorugh the functional units are listed above. **Establish when the second addition instruction can start its execution.** (2 points)
- C) Explain why there is no possibility that the ADD instruction, which finishes its execution before the DIV instruction starts its execution, can affect the content of F4 used by the DIV instruction. (4 points)
- D) Explain how the content of F0 is supplied to the DIV instruction. (2 points)

### **ASSIGNMENT 4**

The following loop is executed on a pipeline with that causes stall cycles as marked:

LOOP: LD F0,0(R1) Stall

ADDD F4,F0,F2

Stall

SD 0(R1),F4 SUBI R1,R1,#8 BNEZ R1,LOOP

- A) Eliminate the stall cycles by unrolling the loop once, applying register renaming, and instruction scheduling. (4 points)
- B) What is the speedup of the code in A) as compared to the original code assuming that the CPI for *all* instructions is one but taking the explicit stall cycles into account? (3 points)
- C) Assume that the code in A) will be executed on a VLIW-architecture with two memory ports, one integer functional unit, and two addition functional units. How many cycles does it take to execute the code? (3 points)

#### **ASSIGNMENT 5**

- A) A computer architect is considering two cache designs. The first cache is two-way set-associative and an access time including tag check of 2 ns. The second cache design is direct-mapped and an access time including tag check of 1 ns. In addition, the second cache design has a fully-associative victim cache and the access time of that cache is 1.5 ns. Through measurements on a set of representative benchmarks, the computer architect has observed that the miss rate of the first cache design is 99% whereas the miss rate for the second cache design (i.e., the likelihood that the block is neither found in the direct-mapped cache nor in the victim cache) is 93%. Assuming a miss penalty of 10 ns, which design yields shortest average memory access time? (2 points)
- B) Explain what the two techniques "critical-first-word" and "early restart" are and how they can reduce the miss penalty. (2 points)
- C) Explain what mechanism is used in a lock-up free cache to keep track of multiple outstanding miss requests (2 points)

- D) What does it mean that a read and a write operation are *performed* as opposed to *issued*? (2 points)
- E) Show an example of how a bus transaction can be cut into partial operations in a pipelined bus to allow for pipelining multiple bus transactions at the same time. (2 points)

\*\*\* GOOD LUCK! \*\*\*