39 citations found. Retrieving documents...
M. Franklin, " The Multiscalar Architecture," Ph.D. Thesis, Computer Sciences Technical Report #1196, University of Wisconsin-Madison, November 1993.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents

Improving Dynamic Cluster Assignment for Clustered Trace.. - Bhargava, John (2003)   (4 citations)  (Correct)

....strategy. Dynamic cluster assignment is also evaluated for several alternate cluster designs as well as for media benchmarks. 1. Introduction A clustered microarchitecture design allows for wide instruction execution while reducing the amount of complexity and long latency communication [4, 5, 6, 8, 12, 20]. The execution resources are partitioned into smaller units. Within a cluster, communication is fast, but inter cluster communication is more costly. Therefore, the key to high performance on a clustered microarchitecture is assigning instructions to clusters in a way that limits data ....

M. Franklin. The Multiscalar Architecture. PhD thesis, Univ. of Wisconsin-Madison, 1993.


Speculation-Based Techniques for Lockfree Execution of Lock-Based .. - Rajwar (2002)   (Correct)

....and Moss [66] used the same mechanism for implementing Transactional Memory. Gharachorloo et al. 45] used cache coherence protocols for detecting violations to memory ordering. Franklin proposed the use of the address resolution buffer for detecting data races in shared memory multiprocessors [40]. We have presented concepts key for understanding the thesis and have provided a background into related work in the area of synchronization, concurrency control, and speculative execution. We use concepts developed in database concurrency control and we use much of the hardware support proposed ....

....and Moss [66] used the same mechanism for implementing transactional memory. Gharachorloo et al. 45] used cache coherence protocols for detecting violations to memory ordering. Franklin proposed the use of the address resolution buffer for detecting data races in shared memory multiprocessors [40]. Speculative buffering and retirement. Prior work exists in microarchitectural support for speculative retirement [48, 143] and buffering speculative data in caches [42, 52] Our work can leverage these techniques and coexist with them. However, none of these earlier techniques dynamically ....

Manoj Franklin. The Multiscalar Architecture. PhD thesis, University of Wisconsin, Madison, WI, 1993.


Cluster Assignment Strategies for a Clustered Trace Cache.. - Bhargava, John (2003)   (Correct)

....strategy (1.9 ) Dynamic cluster assignment is also evaluated for several alternate cluster designs as well as media benchmarks. 1 Introduction A clustered microarchitecture design allows for wide instruction execution while reducing the amount of complexity and long latency communication [2, 3, 5, 7, 11, 21]. The execution resources and register file are partitioned into smaller and simpler units. Within a cluster, communication is fast while inter cluster communication is more costly. Therefore, the key to high performance on a clustered microarchitecture is assigning instructions to clusters in a ....

M. Franklin. The Multiscalar Architecture. PhD thesis, Univ. of Wisconsin-Madison, 1993.


Dynamic Parallel Media Processing Using Speculative Broadcast.. - Fritts, Wolf (2001)   (3 citations)  (Correct)

....dependence in concert with execution, these methods all provide a recovery mechanism that enables restoration of the old processor state in the event a dependence conflict occurs during execution. Three methods exist that support both fully and partially parallel loops, the Multiscalar project [8], ThreadLevel Data Speculation (TLDS) 9] and Thread Level Speculation (TLS) 10] With each of these, there are mechanisms in either hardware and or software for storing speculative processor state, restoring the old processor state on a misspeculation, and checking for dependence conflicts ....

....More information on these methods can be found in Fritts [11] 3. Speculative Broadcast Loop We propose the Speculative Broadcast Loop (SBL) method for the speculative execution of parallel loop iterations. This new vector like run time method is a simplified version of the multiscalar [8] and multithreaded [9] 10] methods. It combines SIMD parallelism with large scale speculative execution for supporting data parallelism in multimedia. Unlike the multiscalar and multithreaded architectures, which provide independent control streams for separate processing units, the SBL method ....

[Article contains additional citation context not shown here]

Manoj Franklin, "The Multiscalar Architecture," Ph.D. Thesis, Department of Computer Science, University of Wisconsin at Madison, 1993.


A Study of Control Independence with a Single Flow of Control - Rotenberg, Jacobson, Smith   (Correct)

....instructions. Treating control mispredictions as a total dependence barrier may mean lost opportunities for exploiting instruction level parallelism. Only a subset of subsequent instructions may be truly control dependent on the misprediction. The other instructions are control independent [4,5,6] and do not necessarily have to be squashed and re executed. A limit study on control independence [6] showed that substantial performance improvements may be possible. However, as a limit study, most implementation constraints were not considered. It is our objective in this paper to consider ....

....may require buffering speculative state for thousands of instructions. Other research in control independence has focused on specific microarchitectures and microarchitecture mechanisms that can exploit control independence. Do Not Distribute. Submitted to ASPLOS 98. Multiscalar processors [4,5] exploit control independence by pursuing multiple flows of control. This is done with multiple physical program counters, but only one logical program counter. The compiler partitions the program into tasks, or subgraphs of the CFG. Arbitrary control flow may exist within a task, and the compiler ....

M. Franklin. The Multiscalar Architecture. PhD thesis, University of Wisconsin, Nov 1993.


Evaluating the XMT Parallel Programming Model - Naishlos, Nuzman, Tseng, Vishkin (2001)   (Correct)

....shared memory and message passing programming models on multiprocessor systems. Our work attempts to examine parallel programming with respect to the different assumptions implied by an on chip environment. Various other projects explore on chip parallel architectures: CMP [HNO97] Multiscalar [Franklin93], SMT [TLE 99] and Raw [WTS97] The current paper is targeted toward exploring shared memory parallel algorithms as applied to scalable on chip architecture. 6. Conclusion This paper presented features of XMT, a parallel programming model designed for exploiting on chip parallelism. With ....

M. Franklin, "The Multiscalar Architecture," Ph.D. thesis. Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, December 1993.


A Dynamic Multithreading Processor - Akkary, Driscoll   (72 citations)  (Correct)

....a simultaneous multithreading pipeline to increase processor utilization, except that the threads are created dynamically from the same program. Although the DMT processor is organized around dynamic simultaneous multiple threads, the execution model draws a lot from the multiscalar architecture [4,5]. The multiscalar implements mechanisms for multiple flows of control to avoid instruction fetch stalls and exploit control independence. It breaks up a program into tasks that execute concurrently on identical processing elements connected as a ring. Since the tasks are not independent, ....

M. Franklin. The Multiscalar Architecture. Ph.D. Thesis, University of Wisconsin, Nov 93.


Towards a First Vertical Prototyping of an Extremely .. - Naishlos, Nuzman.. (2001)   (1 citation)  (Correct)

....for a CMP is the Stanford Hydra architecture [HNO97] Research in this area has tended to focus on multiprogramming, and on speculative execution to extract threads from a single program. Other proposed multi threaded architectures, such as Simultaneous Multithreading (SMT) TEL95] or Multiscalar [Franklin93] also feature multiple program counters and make useful points of comparison. Recent work on SMT [TLE 99] has proposed light weight synchronization methods for multithreading. In fact the Acquire primitive is very similar to the suspend primitive presented here. The two instructions share ....

M. Franklin, "The Multiscalar Architecture," Ph.D. thesis. Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, December 1993.


Evaluating the XMT Parallel Programming Model - Dorit Naishlos Joseph (2001)   (Correct)

....the shared memory and message passing programming models on multiprocessor systems. Our work attempts to examine parallel programming with respect to the different assumptions implied by an on chip environment. Various other projects explore on chip parallel architectures: CMP [10] Multiscalar [8], SMT [13] The current paper is targeted toward exploring shared memory parallel algorithms as applied to scalable on chip architecture. Note that XMT, with Radix 16384: Speedups 0 0.5 1 1.5 2 2.5 3 1 4 16 tcus Speedups xm t splash the parallel prefix sum for example, aspires to scale ....

M. Franklin, "The Multiscalar Architecture," Ph.D. thesis. Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, December 1993.


Dynamic Parallel Media Processing Using Speculative Broadcast.. - Fritts, Wolf (2001)   (3 citations)  (Correct)

....loop execution we propose is Speculative Broadcast Loop (SBL) execution. 2 Speculative Broadcast Loop We propose the Speculative Broadcast Loop (SBL) method for the speculative execution of parallel loop iterations. This new vector like run time method is a simplified version of the multiscalar [8] and multithreaded [9] 10] speculative methods that combines SIMD parallelism with large scale speculative execution for supporting data parallelism in multimedia. The SBL run time technique uses profiling and register dependence analysis (i.e. memory profiling and register dependence analysis are ....

Manoj Franklin, "The Multiscalar Architecture," Ph.D. Thesis, Department of Computer Science, University of Wisconsin at Madison, 1993.


Evaluating Multi-threading in the Prototype XMT Environment - Naishlos, Nuzman, Tseng.. (2000)   (1 citation)  (Correct)

....parallelism is that occupied by chip multiprocessors (CMP) HNO97] Research in this area has tended to focus on multiprogramming, rather than fine grained multithreading of a single task. Other proposed multi threaded architectures, such as Simultaneous Multithreading (SMT) or Multiscalar [Franklin93], also feature multiple program counters and make useful points of comparison. Recent work on SMT [TLE 99] has proposed light weight synchronization methods for multithreading. In fact the Acquire primitive is very similar to the suspend primitive presented here. The two instructions share ....

M. Franklin, "The Multiscalar Architecture," Ph.D. thesis. Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, December 1993.


Architecture of the Atlas Chip-Multiprocessor.. - Codrescu, Wills, Meindl (1999)   (9 citations)  (Correct)

....to thread parallelism averages 3.4 on 8 processors. The contribution of this paper is to present and to evaluate the architecture of a chip multiprocessor that dynamically parallelizes sequential binary applications. 1.1. Related work Speculative multithreading was introduced by the Multiscalar [11][33] architecture. This design uses the compiler to divide the program into threads and schedule inter thread register communication. Hardware is responsible for thread control predictions, speculative buffering, memory disambiguation, synchronizing register communication, and misspeculation ....

M. Franklin, "The Multiscalar Architecture" Ph.D thesis, University of Wisconsin -- Madison, 1993


Profiling for Input Predictable Threads - Codrescu, al. (1998)   (Correct)

....the instruction level, such as limits on ILP [26] 11] 1] using value predictors to increase ILP [13] 9] 8] locating value predictable instructions through profiling [4] etc. Recently, there has been interest at studying single program speculative execution at the thread level. The Multiscalar [7][20] work introduced and popularized this idea. The Multiscalar processor favors a hardware centric approach and synchronizes register flow between tasks. The XIMD [27] M Machine [6] Simultaneous Mutithreading [23] SPSM [5] Hydra [17] Stampede [21] Raw [25] Impact [10] and Superthreading ....

M. Franklin, "The Multiscalar Architecture" PhD thesis, University of Wisconsin -- Madison, 1993


Hardware Techniques To Improve The Performance Of The.. - Burger (1998)   (10 citations)  (Correct)

....At the register level, clustered architectures, such as the Alpha 21264 (or proposed MultiCluster architecture [37] distribute the register interface to multiple banks of functional units, thus achieving high, yet cost effective, bandwidth out of the global register files. Multiscalar processors [41, 114] increase instruction fetch bandwidth by distributing the instruction fetch (at the L1 I cache interface) as 21 well as the register banks. The Multiscalar work assumed centralized L1 data caches, although more recent proposals distribute the L1 data caches as well [53] To our knowledge, the ....

Manoj Franklin. The Multiscalar Architecture. PhD thesis, University of Wisconsin, 160 Madison, WI, December 1993.


A Study of Control Independence in Superscalar Processors - Eric Rotenberg Quinn (1999)   (10 citations)  (Correct)

....depend on the branch outcome. These instructions are control dependent on the branch. Other instructions deeper in the window may be control independent of the mispredicted branch: they will be fetched regardless of the branch outcome, and do not necessarily have to be squashed and re executed [9, 10]. This can be illustrated with a simple example. FIGURE 1. An example of control independence. Figure 1 shows a control flow graph (CFG) containing four basic blocks. Basic blocks are used for simplicity and may be substituted with arbitrary control flow. The branch r5 r5 r4 r5 r4 1 2 3 ....

....important aspects of programs themselves were not modeled; in particular, a significant subset of data dependences were ignored due to the trace driven nature of the study. Several microarchitecture implementations have since been proposed that incorporate control independence in some form [10,12 19]. In these studies, however, either the impact of control independence is not isolated, or insight into the reported performance gains is limited and obscured by artifacts of the particular design. In this paper we have three primary objectives and contributions. The first objective is to ....

[Article contains additional citation context not shown here]

M. Franklin. The Multiscalar Architecture. PhD thesis, Univ. of Wisc., Nov 1993.


Control Independence in Trace Processors - Rotenberg (1999)   (6 citations)  (Correct)

....on the order of 30 . The proposed mechanisms are complex due to the non hierarchical superscalar organization, and there is a reliance on the compiler to provide complete control dependence information. Nonetheless, the study is useful for understanding control independence. Multiscalar processors (Franklin, 1993; Sohi et al. 1995) Dynamic Multithreading (Akkary Driscoll, 1998) and other multithreaded architectures (Oplinger et al. 1997; Steffan Mowry, 1998; Dubey et al. 1995; Tsai Yew, 1996) exploit control independence by pursuing multiple flows of control. Either the compiler or hardware ....

Franklin, M. (1993). The multiscalar architecture. Ph.D. thesis, Computer Sciences Department, University of Wisconsin - Madison.


AR-SMT: Coarse-Grain Time Redundancy for High Performance.. - Eric Rotenbe Rg   (Correct)

....time implications) adding an extra register rename map, and designing the control logic and datapaths for SMT. All of this is in addition to the Delay Buffer storage 1 . 4. 0 Trace processors as a platform for AR SMT In this paper we use a new processor microarchitecture called trace processors [18,19,20,21] as a platform for AR SMT. A trace is a long, dynamic sequence of instructions captured and stored by hardware. It may contain any number of control transfer instructions. The primary constraint on a trace is a hardwaredetermined maximum length, but there may be any number of other ....

M. Franklin. The Multiscalar Architecture. PhD thesis, University of Wisconsin, Nov 1993.


A Study of Control Independence in Superscalar Processors - Rotenberg, Jacobson, Smith (1999)   (10 citations)  (Correct)

....depend on the branch outcome. These instructions are control dependent on the branch. Other instructions deeper in the window may be control independent of the mispredicted branch: they will be fetched regardless of the branch outcome, and do not necessarily have to be squashed and re executed [10, 11]. This can be illustrated with a simple example. Figure 1 shows a control flow graph (CFG) containing four basic blocks. Basic blocks are used for simplicity and, in general, may be substituted with arbitrary control flow. The conditional branch terminating block 1 is mispredicted, with dashed ....

....important aspects of programs themselves were not modeled; in particular, a significant subset of data dependences were ignored due to the trace driven nature of the study. Several microarchitecture implementations have since been proposed that incorporate control independence in some form [11, 13, 14, 15, 16, 17, 18, 1]. In these studies, however, either the impact of control independence is not isolated, or insight into the reported performance gains is limited and obscured by artifacts of the particular design. In this paper we have three primary objectives and contributions. The first objective is to ....

[Article contains additional citation context not shown here]

M. Franklin. The Multiscalar Architecture. PhD thesis, University of Wisconsin, Nov 1993.


Improving Superscalar Instruction Dispatch and Issue by.. - Vajapeyam, Mitra (1997)   (51 citations)  (Correct)

....The key differences of the dependence based architecture are (i) the register file is not separated into local and global files, and (ii) register renaming information is not reused, thus having the same instruction dispatch rate as traditional superscalar processors. The Multiscalar architecture[Fra92b, Fra93a] takes a different approach to window partitioning and achieving large windows. Here code segments identified at compile time are executed in parallel on multiple PEs organized as a circular chain. The program s instruction window consists of the combined instruction windows of all the PEs. ....

M. Franklin, "The Multiscalar Architecture," Ph.D. Thesis, University of Wisconsin-Madison, 1993. 11 PREPRINT To appear in the ACM SIGARCH 24th ISCA


The PEWs Microarchitecture: Reducing Complexity Through.. - Ranganathan, Franklin   Self-citation (Franklin)   (Correct)

....stream. Thus, instructions that are controldependent on a conditional branch tend to be assigned to the hardware window to which the branch has been allocated. In this case, instructions wait near where their control dependences will be resolved. Examples for this approach are the multiscalar [5] [12] 5 PS M [1] superthreading [14] and trace processors [11] 15] where each instruction group is a task or a trace depending on whether the proces sor pursues multiple (independent) flows of control or not. Control dependence based decentralization fits well with the control driven program ....

M. Franklin, "The Multiscalar Architecture," Ph.D. Thesis, Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, 1993.


Block-Level Prediction for Wide-Issue Superscalar Processors - Dutta, Franklin (1995)   (1 citation)  Self-citation (Franklin)   (Correct)

....length, and number of targets. Therefore, the instructions in a subgraph are most likely fetched sequentially. Such a control flow prediction scheme, with unrestricted subgraph structures, is more apt for execution models that pursue multiple flows of control, such as the multiscalar processor [2], and is not the subject of this paper. For processors that follow the superscalar model of execution, we need to restrict the subgraph structure, and identify a path within the subgraph to fetch in structions from. The approach considered in [13] is to use tree like subgraphs, and select a path ....

M. Franklin, "The Multiscalar Architecture," Ph.D. Thesis, Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, 1993.


Branch Prediction in Multi-Threaded Processors - Gummaraju, Franklin   (5 citations)  Self-citation (Franklin)   (Correct)

....on a multiscalar simulator indicates that these techniques, especially a hybrid of extrapolation and correlation, can substantially lower the branch misprediction ratios. 1 Introduction There has been a growing interest in the use of multithreading to speed up the execution of a single program [1] [2] 6] 9] 11] 12] The compiler or the hardware extracts threads from a sequential program, and the hardware executes multiple threads in parallel, most likely with the help of multiple processing elements (PEs) Whereas a single threaded processor can only extract parallelism from a group of ....

....less likely to cause interference. 2.2.4 Intra Thread Control Flow and Thread Execution Style The exact nature of threads and their execution style have a strong bearing on branch history and branch prediction. For instance, if threads are initiated speculatively (as in the multiscalar processor [1] [11] and the superthreaded processor [12] then some of the active threads may get squashed because of incorrect threadlevel control speculation. When using a shared branch predictor, if the updates are done at branch prediction time, then thread level misprediction requires setting back some of ....

[Article contains additional citation context not shown here]

M. Franklin, "The Multiscalar Architecture," Ph.D. Thesis, Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, 1993.


High-Performance Frontends for Trace Processors - Jacobson (1999)   (Correct)

No context found.

M. Franklin, " The Multiscalar Architecture," Ph.D. Thesis, Computer Sciences Technical Report #1196, University of Wisconsin-Madison, November 1993.


Instruction History Management for High-Performance Microprocessors - Bhargava (2003)   (Correct)

No context found.

M. Franklin. The Multiscalar Architecture. PhD thesis, Univ. of Wisconsin-Madison, 1993. 178


Running Parallel Applications on an MP With.. - Krishnan, Zhang..   (Correct)

No context found.

M. Franklin. The Multiscalar Architecture. PhD thesis, University of Wisconsin, 1993.

First 50 documents

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC