Data Loading...

Compile-rDirected Scratch Pad Memory Hierarchy Design and ... Flipbook PDF

Compiler-Directed Scratch Pad Memory Hierarchy Design and Management M. Kandemir Microsystems Design Lab Pennsylvania St

111 Views
13 Downloads
FLIP PDF 183.78KB

Compiler-Directed Scratch Pad Memory Hierarchy Design and Management A. Choudhary

M. Kandemir

Microsystems Design Lab Pennsylvania State University University Park, PA 16802, USA

ECE Department Northwestern University Evanston, IL 60208, USA

[email protected]

[email protected]

ABSTRACT One of the primary challenges in embedded system design is designing the memory hierarchy and restructuring the application to take advantage of it. This task is particularly important for embedded image and video processing applications that make heavy use of large multidimensional arrays of signals and nested loops. In this paper, we show that a simple reuse vector/matrix abstraction can provide compiler with useful information in a concise form. Using this information, compiler can either adapt application to an existing memory hierarchy or can come up with a memory hierarchy. Our initial results indicate that the compiler is very successful in both optimizing code for a given memory hierarchy and designing a hierarchy with reasonable performance/size ratio.

Categories and Subject Descriptors B.3 [Hardware]: Memory Structures; D.3.4 [Programming Languages]: Processors—Compilers;Optimization

General Terms

also be accounted for. Software-managed memories (called scratchpad memories [9]) are an important part of design process, particularly in image and video processing applications that make heavy use of large multi-dimensional arrays of signals and nested loops. Compiler can restructure a given code to adapt it to a given memory hierarchy; but, it can also help designer to come up with a memory hierarchy for a given application. This hierarchy can then be used as a starting point for a more sophisticated memory hierarchy optimization. While it might be possible (for some applications) to adopt a plain (single level) scratch-pad memory architecture and obtain large savings in performance and energy [9, 7], many access patterns encountered in embedded image and video applications exhibit data reuse in multiple loop levels, thereby necessitating a multi-level memory hierarchy if we are to obtain the best results. Therefore, we believe that as embedded codes get more and more complex, memory architectures that contain multi-level scratch-pad memories will be more widely employed. This will also make it extremely important to provide compiler support for optimizing applications in such hierarchies. In this paper, we present a compiler-directed memory hierarchy design and code optimization strategy. More specifically, we make the following major contributions: We show how an optimizing compiler can manage the flow of data to/from a scratch-pad memory (SPM) hierarchy using a mathematical abstraction based on reuse vectors and reuse matrices. We present an algorithm that, given an input code, tries to come up with an SPM hierarchy. This hierarchy might be used as an input for a more detailed systematic analysis and optimization framework. We present an algorithm that enforces a memory hierarchy for a given code. For example, using this algorithm, it might be possible to transform a given code in such a way that a majority of array references in the code would work best with a specific depth of memory hierarchy. We report experimental data showing the effectiveness of our approach. Our results indicate that the compiler is very successful in both optimizing code for a given hierarchy and designing a hierarchy with reasonable performance/size ratio. Our approach is different from many previous studies on softwaredirected memory management. First, in contrast to studies in [9, 10, 7, 2] where the main focus is a single-level memory management, we focus on a memory hierarchy. We focus on both designing a memory hierarchy and exploiting a given hierarchy for current data access pattern. As compared to the design methodology in [3], our approach is simpler as it uses a reuse vector/matrix abstraction (which can be extracted by an optimizing compiler during data access pattern/data dependence analysis). By using this abstraction, we can easily decide the flow of data through memory hierarchy. In addition, this abstraction also allows us to come up with an initial hierarchy design very quickly. The rest of this paper is organized as follows. Section 2 presents background material and gives details of our optimization strategy. Section 3 discusses our implementation, introduces our experimental framework, and presents performance and energy consumption data. Section 4 summarizes our major results.

Design, Experimentation, Performance

Keywords Data Reuse, Scratch Pad Memory, Memory Hierarchy

1.

INTRODUCTION

Embedded system design has undergone a major renaissance in the last five years. An important characteristic of this change is that software is playing an ever increasing role in system design. Consequently, automatic compiler support for optimizing embedded software is of critical importance. Classical compiler methods alone, however, may not be sufficient for attaining the highest levels of performance from an embedded platform. Instead, embedded system-specific issues should

This work is supported in part by NSF Career Award #0093082 and by a grant from PDG.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2002, June 10-14, 2002, New Orleans, Louisiana, USA Copyright 2002 ACM 0-58113-461-4/02/0006 ...$5.00.

2.

2.1 Background The iteration space of a loop nest contains one point for each iteration. An iteration

vector can be used to label each such point in . We use to represent the iteration vector for a loop nest of depth , where each corresponds to a loop index (or its value at a specific point in time), starting with for the outermost loop index. In fact, an iteration space can be viewed as a polyhedron bounded by the loop limits. The subscript function for a reference to an array is a mapping from the iteration space to the data space . The data space can also be viewed as a polyhedron bounded by array bounds. A subscript function defined this way maps an iteration vector to an array element. In this paper, we assume that subscript mappings are affine functions of the enclosing loop indices and symbolic constants. Many array references found in embedded image and video processing codes fall into this category. Under this assumption, a reference to an array can be represented as where ! is a linear transformation matrix called the access (reference) matrix, is the offset vector, and is the iteration vector [12]. If the loop is -deep and the array is " dimensional, is "$#% and has " elements. For example, the reference &(' )+*' ,-* in Figure 1(a) can be written as:

0

array reference is transformed to SWV . Second, the loop bounds are transformed accordingly. Finding the new loop bounds can necessitate the use of Fourier-Motzkin elimination, details of which can be found in [12]. It can be shown that if S is a linear transformation matrix, @ is a reuse vector, and M is a reuse matrix, S @ and SXM are the transformed reuse vector and the transformed reuse matrix, respectively [8]. In Sections 2.2 through 2.5, we assume the existence of an SPM hierarchy and focus on how to manage data flow through this hierarchy. In Section 2.6, we show how to design a memory hierarchy based on data reuse. Till Section 2.4, we exclusively focus on a two-level memory hierarchy which contains a large (slow and energy-consuming) memory and a small (fast and less energy-consuming) memory. The problem is then to decide what data to bring to the fast memory at what time and how to decide when data in the fast memory are not useful anymore. We use the terms (software-controlled) memory hierarchy and SPM hierarchy interchangeably.

2.2 Reuse Based Data Transfers Recall that D' *' )+* in Figure 1(a) leads to a reuse vector of 3 6 3 . Suppose that we have a section of array in Figure 1(a), D' *' 6ZY\[ * (the th row), residing in the fast memory. Since all loop iterations 6 6 ]

6 ]^_] ` 6 [ ] ` ^a 6 ] ` ^a ]^_] ` ^+ [ ]

b [ 6 ] ` [ ^_]

` [ [

21435376 >1?3 37683:9; , 3:9 )=< , According to Li [8], if and are two iteration vectors that access ,BA is called the same memory location, @ a reuse vector (assuming that , is lexicographically greater than ). For a loop of depth , the reuse vector is dimensional. The loop corresponding to the first /.

.

non-zero element from top is the one that carries the corresponding reuse. If there are more than one reuse vector in a given direction, they are represented by the lexicographically smallest one. The reuse vector gives us useful information about how array elements are used by different loops in a given nest. In this paper, we mainly focus on self reuses (that is, the reuses originating from individual references) as in very rare circumstances group reuse (that is, the reuse between different references to the same array) brings additional benefits that cannot be exploited C by self reuse [8]. There is a temporal reuse due to iterations reference if and only if there exist two different E:FH G @JI and , such that , ; that is, ,DA @ LK . ,BA This last expression indicates that the temporal reuse vector belongs to the kernel set (null set) of [12]. As an example, let us focus on the matrix multiply nest shown in Figure 1(a). The reuse vector due to D' *' )+* is 3 6 3 . Informally, what this means is that, for fixed values of loops and ) , the successive iterations of the , loop access the same element from this array. That is, the , loop carries (exhibits) temporal reuse for this reference. The reuse vectors coming from individual references make up a reuse matrix, M . In considering the matrix multiply code again, we find that:

M

T

OUR APPROACH

353N6 6O3 ; 37 683P3
#?8 . ( /* ( ,*"@. ( A:%* ( B*4 (d)

4C 8D? CE7;:< 4C CA 8FHGI 47"JKC :LM, CEFHNO4HJKCA:<!K C1 #?8 . ( C4* ( CA:C/*?@. ( C :%* ( 4CA:CB*4

Figure 3: Memory hierarchy design algorithm.

imizing data transfers by overlapping or combining them with each other. Consider, for example, the code fragment shown in Figure 1(g). Let us focus on a two-level memory hierarchy and on data transfers due to array . Since both the sub-nests in this code has the same reference D' *' )+* , the data transfer due to this reference can be performed just before the first , loop; there is no need to perform data transfer (due to this reference) before the second , loop. The second sub-nest has another reference to array : D' )+*' * . The data transfer due to this reference can be performed along with that due to D' *' )a* before the first , loop. After the execution of the second , loop, the elements of array which reside in the fast memory can be discarded (as their reuse ends). Our current implementation handles perfectly-nested and imperfectly-nested loops of any depth. But, it does not attempt to optimize the hierarchical SPM usage across multiple, independent imperfectly-nested loops. That is, it does not try to take advantage of inter-nest data reuse. However, it should also be noted that handling imperfectly-nested loops means that our approach can be extended operate on an entire program as the entire code can be thought of as an imperfectly nested loop enclosed by an imaginary outermost loop that iterates only once.

(e)

!"#$ %#& %# !"/PQ KPRS KP T @ UV. ( +* ( #*0W. ( +* ( #*+E:2 ( ,* ( P*4 (f)

!"#$ %#& %# 5 6 . ( +* ( #*0W2 ( ,* ( #*4 !"#$ %#& %# T 5 @. ( #* ( /*". ( /* ( #9*/ (g)

+!XO !XW !XR !XAYZ 4A 1 !" !XK =[