Data Loading...

Master/Slave Speculative Parallelization Flipbook PDF

State updates are speculatively buffered until the task has been veriﬁed. Values produced by the mas-ter are discarded w

125 Views
70 Downloads
FLIP PDF 173.66KB

Appears in the proceedings of the 35th International Symposium on Microarchitecture (Micro-35), November 20-22, 2002

Master/Slave Speculative Parallelization Craig Zilles

Gurindar Sohi

Department of Computer Science University of Illinois at Urbana-Champaign [email protected]

Computer Sciences Department University of Wisconsin at Madison [email protected]

Abstract Master/Slave Speculative Parallelization (MSSP) is an execution paradigm for improving the execution rate of sequential programs by parallelizing them speculatively for execution on a multiprocessor. In MSSP, one processor—the master—executes an approximate version of the program to compute selected values that the full program’s execution is expected to compute. The master’s results are checked by slave processors that execute the original program. This validation is parallelized by cutting the program’s execution into tasks. Each slave uses its predicted inputs (as computed by the master) to validate the input predictions of the next task, inductively validating the entire execution. The performance of MSSP is largely determined by the execution rate of the approximate program. Since approximate code has no correctness requirements (in essence it is a software value predictor), it can be optimized more effectively than traditionally generated code. It is free to sacrifice correctness in the uncommon case to maximize performance in the common case. A simulation-based evaluation of an initial MSSP implementation achieves speedups of up to 1.7 (harmonic mean 1.25) on the SPEC2000 integer benchmarks. Performance is currently limited by the effectiveness with which our current automated infrastructure approximates programs, which can likely be improved significantly.

1 Introduction Most microprocessor vendors are shipping or have announced products that exploit explicit thread-level parallelism at the chip level either through chip multiprocessing (CMP) or simultaneous multithreading (SMT). These architectures are compelling because they enable efficient utilization of large transistor budgets—even in the presence of increasing wire delay—for multi-threaded or multi-programmed workloads. Although we expect the availability of these processors to encourage some programmers to explicitly parallelize their programs, anecdotal evidence that many software vendors ship un-optimized binaries suggests that many programmers cannot justify (either to themselves or their employers) the additional effort of correctly parallelizing their code. As a result there will remain an opportunity for “transparent” parallelization.

Transparent parallelization can be achieved by analyzing the program’s dependences, partitioning the program into independent subsets (tasks), and inserting the necessary synchronization. For non-numeric programs, a complete dependence analysis is prohibitively difficult. As a result, parallelization of these programs is facilitated by speculating in the presence of ambiguous dependences and providing hardware support for the detection of and recovery from actions that violate the ordering dictated by a sequential execution. Many proposals for speculatively parallelizing programs have been published [1, 6, 8, 12, 20, 23]. Speculation allows the execution to ignore potential dependences that do not occur in practice, but it does little to alleviate true dependences. True inter-task data dependences can sequentialize task execution, negating much of the performance potential of speculative parallelism. In fact, because the inter-task dependences generally incur the latency of inter-processor communication, performance can be even worse than that of a uniprocessor execution. Thus, one of the main challenges remaining in speculative parallelization is handling true dependences. In this paper, we present an execution paradigm for speculative parallelization that breaks inter-task true dependences by predicting the communicated values. This approach, in itself, is not novel [1, 15, 16, 21]; what is novel is the manner in which the predictions are made. Rather than use a history-based hardware widget, our approach uses a piece of code that, when executed, predicts the values communicated between tasks. We call this code the distilled program because it is created by “distilling” the original program down to a reduced computation that will likely compute the necessary values. Since results computed by the distilled program are only used as predictions, there are no correctness constraints on the distilled program; all predictions will be validated before they are allowed to alter architected state. This freedom from correctness constraints bestows unlimited flexibility to the construction of the distilled program, allowing us to guide optimizations solely based on common case program behavior rather than what could be proven correct statically. As a result, the distilled program can significantly outperform, while closely paralleling in function, the original program, but it provides no correctness guarantees.

2 Overview In this section, we introduce the concept of code approximation used to create distilled programs and provide a high-level overview and analysis of the MSSP paradigm.

2.1 Code Approximation The MSSP paradigm predicts all of a task’s live-in values (i.e., those values that are used before they are defined). Unlike most value predictors, the one used in MSSP is a piece of software, called the distilled program, that produces predictions when executed. Like any predictor, we desire that it have a high prediction accuracy, but mistakes can be tolerated. The distilled program is constructed by approximating the program being executed, which we will refer to as the original program. Approximation exploits the fact that most programs are more general than is required by any particular execution. As a result, much of a program’s functionality is not needed in the common case. This observation has been exploited previously in feedback-directed program transformations that optimize the common case at the expense of the uncommon case. Approximation is an extreme form of such transformations that further optimizes the common case by sacrificing correctness in the uncommon case. 1. For simplicity of exposition we use the word “processor”, but the ideas are equally applicable to multithreaded processors.

B 0

exit

D EE

C 3M

6 0 3M 6

E 0 3M

0

printf

D

6 0

printf

B

6 0

0 0

C 3M

0

E

3M

6

0

D 0

B

exit

C

0

3M

0 0

0 3M

A

0 0

C

3M

printf

printf fprintf

3M 0

c)

3M

A BB

fprintf

b) A

0

fwrite

a)

fwrite

In our execution paradigm there are two distinct processor1 roles. One processor, the master, is assigned to execute the distilled program. This execution orchestrates a parallel execution of the original program on the remaining slave processors. At defined places in the distilled program’s execution, the master “spawns” a task onto an idle slave processor, providing it with a starting program counter (PC) and predictions for the live-in values. The slave processors execute the un-modified original program, and only they update architected state. State updates are speculatively buffered until the task has been verified. Values produced by the master are discarded when no longer needed as live-in predictions. The relationship between master and slave is reminiscent of that between a speculative core and a DIVA checker [2], except with the purpose of detecting errors in code transformation rather than processor design. We call this execution paradigm Master/Slave Speculative Parallelization (MSSP). We first provide an overview of the opportunity for distilling programs and the MSSP paradigm in Section 2. In Section 3, we detail the automatic construction of distilled programs using profile information. In Section 4, we describe the mechanisms required by MSSP and provide a detailed example of its execution. In Section 5, we present results from an initial simulated implementation of MSSP. This paradigm has been heavily influenced by prior work, which is discussed throughout the paper.

6

D 0 3M

3M

E

Figure 1. The control flow graph (CFG) for the function spec_getc from the benchmark bzip2: a) initially, b) as transformed by approximation, c) as transformed by superblock formation.

Approximation is accomplished by removing behaviors from the program that appear not to be frequently exercised. This process is most easily illustrated through an example: Figure 1a shows a control flow graph (CFG) of a small function annotated with an edge profile. In approximating the function, non-dominant control flow edges are removed. As this function has a single dominant path, after approximation the function is a single block (Figure 1b). In general, this will not be the case; unbiased branches along important paths will remain. Approximation transformations are not limited to control-flow manipulations; others are described in Section 3.3. Approximation transformations are like speculative compiler transformations but without the checks that ensure the expected behavior is present. The dominant path in this example can also be exploited by superblock formation [13] resulting in the CFG shown in Figure 1c. While superblock formation removes side entrances from the trace (by replicating code), side exits must be left intact, as they are required to detect and handle the correct execution of non-dominant paths. In the approximate code, these checks (the branches) are removed, resulting in an incorrect computation when a non-dominant path execution is required. These approximation transformations create new opportunities to apply traditional (safe) optimizations. By eliminating dead code (mostly the branch predicate computations) and avoiding stack allocation, saves and restores (the approximate function has lower register pressure and is now a leaf function), the dominant path through the function in Figure 1 is reduced from 51 to 12 instructions. This small static size facilitates inlining, which can create additional opportunities for optimization. Approximation has many similarities to the instruction removal performed in Slipstream [22]; unlike Slipstream, we create a distinct executable to enable traditional optimizations to be applied as well. A process akin to approximation has been used to create predictors for branches and memory addresses [7, 18, 25]. The particular form of approximate code used in the MSSP paradigm—what we call a distilled

tagged with the current task number. The writes for a given task form the change of state (the difference or diff) effected by the task. A predicted future checkpoint of state can be constructed by overlaying the differences associated with in-flight tasks (in order from oldest to youngest) on top of the architected register and memory state, as shown in Figure 2b. Checkpoints can be constructed piece-wise and on demand, so the amount of storage required is a function of task working set size and not the size of the program’s memory image. The remaining processors (P1, P2, and P3) are slaves that are initially idle. As directed by the distilled program, the master periodically spawns tasks for the slave processors to execute. To spawn, the master increments its task number, selects an idle slave processor, and provides the slave three inputs: 1) the master’s current task number, 2) a starting program counter (PC), and 3) access to the predicted checkpoint state. At marker ➊ in Figure 2a, the master processor spawns Task A onto processor P3. In parallel with the spawning process, P0 continues execution of the distilled program, entering the distilled program segment that corresponds to Task A, which we refer to as A´. P3 begins execution (➋) some time later as the spawn involves inter-processor communication. As the slave executes Task A, it retrieves any necessary live-in values (i.e., registers and memory locations that are read before they are written by the task) from the checkpoint. Since the values retrieved from the checkpoint are predictions, writes by the slave are not directly committed to archi-

program—also encodes the information necessary to manage the parallel execution of the slave processors. We discuss the construction of distilled programs in Section 3, after an overview of the MSSP paradigm.

2.2 MSSP Overview We now describe the MSSP paradigm at a high level; a description of the required mechanisms and a more detailed example are presented in Section 4. Like other speculative parallelization paradigms, MSSP executes coarse-grain units of work, called tasks, generally consisting of hundreds of instructions. Task boundaries are selected during the construction of the distilled program and encoded within it. An MSSP execution consists of three separate components, each with a different purpose: 1) the master processor executes the distilled program to compute predicted checkpoints of state at task boundaries, 2) using these checkpoints, slave processors execute from one task boundary to the next in the original program constructing a summary of the task, and 3) a reuse-like [19] mechanism verifies the correctness of the task summaries and commits their results in order. These roles are demonstrated by the example four-processor MSSP execution shown in Figure 2a. One processor (P0) is assigned to be the master processor and executes the distilled program. As the master executes, it performs register and memory writes, but these writes are not committed to architected state. Instead, when a write is “retired” by the master, it is held in a special buffer and Master

a)

P0

➊

FORK

P1

P2

P3

b)

Slaves

differences for task N differences for task N-1 differences for task N-2 architected state

checkpo

int

A’

➎

c)

Task C

➌

...

C’

Time

Task B

Bad Checkpoint

FORK

➋ Execute Task

Task A FORK B’

checkpoint for task N+1

Spawn Task

Task Completed

CHKPT

Task Sum mary (live-ins, liv e-outs)

Squashed

Verification/ Commit Unit

Task: load add store store

Task Summary

r14, r14, r14, zero,

0[r12] 1, r14 0[r12] 8[r12]

Verify Live-ins

➍

r12

Commit Live-outs Verify Live-ins

Restart

➐ C’

Task C

➑

Live-outs r14 = 17 M = 17 M+8 = 0

architected state r14 M M+8 18 12 16 live-ins match

M

6

M

17

r12

r14

M

M+8

N

6

12 16

18

Scenario 1: task commit

Commit Live-outs

architected state

Live-ins r12 = M M = 16

0 12 17 commit live-outs

Verify Live-ins

➏ Misspeculation Detected

Scenario 2: task misspeculation

live-in mismatch

Figure 2. MSSP execution example: a) The master, executing the distilled program on processor P0, forks tasks, providing them live-in values in the form of checkpoints. The slave processors execute the original program to construct task summaries, which are verified before they are committed. Misspeculations, resulting from incorrectly computed checkpoints, cause the master to be restarted with the architected state. b) Writes by the master are buffered as checkpoint differences, which together with architected state can be used to construct predicted checkpoints of future state. c) Task execution is summarized by its set of live-in and live-out values; summaries are verified by comparing the live-ins to architected state. If the live-ins match (Scenario 1), the architected state is updated by committing the live-outs. Otherwise (Scenario 2), a task misspeculation is signalled.

tected state. Instead, the live-ins and live-outs (the last write to each location written by the task) are collected (both name and value) and buffered as a task summary. When the slave completes its task (➌), the task summary is sent to the verification/commit unit. This unit tries to commit the summaries in program order, as indicated by the associated task number. The verification/commit process is much like the reuse test [19] or memoization. As a precondition for committing the task, all of the live-in values used in the task’s execution must match the value held in architected state. If this condition holds, the live-out values can be committed to architected state (➍). When a task is committed the corresponding checkpoint differences and task summary can be deallocated. If one or more of the live-in values do not match with architected state (because a checkpoint was computed incorrectly (➎)), a task misspeculation is signalled. On detection of the misspeculation (➏), the master is squashed, as are all other in-flight tasks. At this time, the master is restarted at Task C´ (➐) using the current architected state committed by Task B. In parallel, execution of the corresponding task in the original program (Task C) begins (➑). The recovery process is expensive, but it is infrequent when using well-constructed distilled programs. The verification/commit process is shown in Figure 2c. Note that the checkpoints themselves are never verified directly. Only the task live-ins (i.e., those values that could corrupt architected state) are checked. Distilled program construction exploits this fact to avoid computing any value that has a low likelihood of being a task live-in.

2.3 Analysis of MSSP Execution Several characteristics distinguish MSSP from other speculative parallelization paradigms. In programs with significant amounts of true inter-task dependences (e.g., most non-numeric programs), the critical path of a traditional speculative parallel execution follows sequentially-dependent communications to forward values (Figure 3a). Long inter-processor communication latencies reduce the degree to which task executions are overlapped. In contrast, in an MSSP execution, values are not communicated between slave tasks. The production of values to satisfy inter-task dependences is decoupled from the production of values to update architected state. All inter-task true dependences are predicted from a central source, the master. As a result, inter-processor communication for different tasks can be performed in parallel, and this communication appears on the critical path only on a value misprediction. Trace processors [17] similarly proposed using value prediction to tolerate communication latency between processing elements. If mispredictions can be made rare, then the critical path will be through the slower of two paths: the master’s execution and the verification/commit unit (Figure 3b). Neither

Traditional

a)

P1

P2

P3

Master/Slave

b)

P0

P1

P2

P3

FORK

FORK

FORK

Critical Path Critical Paths

Figure 3. In a traditional speculative multithreading execution model, the critical path of the execution typical goes through values that are forwarded from one task to the next (a). MSSP has two parallel potential critical paths, the execution of the distilled program and verification/commitment (b).

path sequentializes inter-processor communication, providing MSSP with a degree of latency tolerance. Increasing communication latency only increases the occupancy of slave processors and verification latency (not throughput), which is only exposed on task misspeculations. To summarize, the first requirement of an effective MSSP execution is that the distilled program accurately predict task live-in values. If that is achieved, the master will determine the execution rate of the whole execution, so the distilled program should execute much faster than the original program. Finally, so as not to become the bottleneck, the throughput of the verification/commit unit must exceed the rate tasks are spawned by the master. These first two issues are a function of the construction of distilled program (discussed in the next section). The verification/commit mechanism is described in Section 4.

3 Distilled Programs While in the previous section we indicate that the distilled program should have a high prediction accuracy and be fast, there are absolutely no requirements on the distilled program necessary to ensure a correct execution. It is only a predictor, and, if it is behaving improperly, the execution reverts to a traditional uniprocessor execution. Nevertheless, MSSP’s performance depends on the distilled program performing the following two tasks correctly most of the time: 1) specifying a division of the program’s execution into tasks by denoting task boundaries, and 2) accurately computing the set of live-in values at these task boundaries. In this section, we describe a technique to construct distilled programs. The resulting distilled program will be located in a separate region of memory than the original program. Distillation consists of the seven steps listed in

3.1 Task Selection Our sensitivity analysis results (in [24]) indicate our current automatic distiller prototype is largely insensitive to the exact task boundaries selected (performance varies less than 5% for most benchmarks) provided the tasks are large enough (greater than 100 instructions on average) to amortize the communications overheads. As a result, we only highlight a few important issues here; a complete description of our task selector can be found in [24]. Tasks are selected by identifying instructions in the original program to be the beginning of new tasks. A fork instruction is inserted into the distilled program at the task 1. collect profile information 2. build internal representation (IR) 3. select task boundaries 4. perform liveness analysis 5. apply approximation transformations 6. apply traditional optimizations 7. layout and generate code Figure 4. Logical steps to build a distilled program. for (i = 0; i < regset_size; i++) { register REGSET_ELT_TYPE diff = live[i] & ~maxlive[i]; A if (diff) { register int regno; maxlive[i] |= diff; B for (regno = 0; diff && regno < REGSET_ELT_BITS; regno++) { C if (diff & ((REGSET_ELT_TYPE) 1