39 citations found. Retrieving documents...
M. Franklin, " The Multiscalar Architecture," Ph.D. Thesis, Computer Sciences Technical Report #1196, University of Wisconsin-Madison, November 1993.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Improving Dynamic Cluster Assignment for Clustered Trace.. - Bhargava, John (2003)   (4 citations)  (Correct)

....strategy. Dynamic cluster assignment is also evaluated for several alternate cluster designs as well as for media benchmarks. 1. Introduction A clustered microarchitecture design allows for wide instruction execution while reducing the amount of complexity and long latency communication [4, 5, 6, 8, 12, 20]. The execution resources are partitioned into smaller units. Within a cluster, communication is fast, but inter cluster communication is more costly. Therefore, the key to high performance on a clustered microarchitecture is assigning instructions to clusters in a way that limits data ....

M. Franklin. The Multiscalar Architecture. PhD thesis, Univ. of Wisconsin-Madison, 1993.


Speculation-Based Techniques for Lockfree Execution of Lock-Based .. - Rajwar (2002)   (Correct)

....and Moss [66] used the same mechanism for implementing Transactional Memory. Gharachorloo et al. 45] used cache coherence protocols for detecting violations to memory ordering. Franklin proposed the use of the address resolution buffer for detecting data races in shared memory multiprocessors [40]. We have presented concepts key for understanding the thesis and have provided a background into related work in the area of synchronization, concurrency control, and speculative execution. We use concepts developed in database concurrency control and we use much of the hardware support proposed ....

....and Moss [66] used the same mechanism for implementing transactional memory. Gharachorloo et al. 45] used cache coherence protocols for detecting violations to memory ordering. Franklin proposed the use of the address resolution buffer for detecting data races in shared memory multiprocessors [40]. Speculative buffering and retirement. Prior work exists in microarchitectural support for speculative retirement [48, 143] and buffering speculative data in caches [42, 52] Our work can leverage these techniques and coexist with them. However, none of these earlier techniques dynamically ....

Manoj Franklin. The Multiscalar Architecture. PhD thesis, University of Wisconsin, Madison, WI, 1993.


Cluster Assignment Strategies for a Clustered Trace Cache.. - Bhargava, John (2003)   (Correct)

....strategy (1.9 ) Dynamic cluster assignment is also evaluated for several alternate cluster designs as well as media benchmarks. 1 Introduction A clustered microarchitecture design allows for wide instruction execution while reducing the amount of complexity and long latency communication [2, 3, 5, 7, 11, 21]. The execution resources and register file are partitioned into smaller and simpler units. Within a cluster, communication is fast while inter cluster communication is more costly. Therefore, the key to high performance on a clustered microarchitecture is assigning instructions to clusters in a ....

M. Franklin. The Multiscalar Architecture. PhD thesis, Univ. of Wisconsin-Madison, 1993.


Dynamic Parallel Media Processing Using Speculative Broadcast.. - Fritts, Wolf (2001)   (3 citations)  (Correct)

....dependence in concert with execution, these methods all provide a recovery mechanism that enables restoration of the old processor state in the event a dependence conflict occurs during execution. Three methods exist that support both fully and partially parallel loops, the Multiscalar project [8], ThreadLevel Data Speculation (TLDS) 9] and Thread Level Speculation (TLS) 10] With each of these, there are mechanisms in either hardware and or software for storing speculative processor state, restoring the old processor state on a misspeculation, and checking for dependence conflicts ....

....More information on these methods can be found in Fritts [11] 3. Speculative Broadcast Loop We propose the Speculative Broadcast Loop (SBL) method for the speculative execution of parallel loop iterations. This new vector like run time method is a simplified version of the multiscalar [8] and multithreaded [9] 10] methods. It combines SIMD parallelism with large scale speculative execution for supporting data parallelism in multimedia. Unlike the multiscalar and multithreaded architectures, which provide independent control streams for separate processing units, the SBL method ....

[Article contains additional citation context not shown here]

Manoj Franklin, "The Multiscalar Architecture," Ph.D. Thesis, Department of Computer Science, University of Wisconsin at Madison, 1993.


A Study of Control Independence with a Single Flow of Control - Rotenberg, Jacobson, Smith   (Correct)

....instructions. Treating control mispredictions as a total dependence barrier may mean lost opportunities for exploiting instruction level parallelism. Only a subset of subsequent instructions may be truly control dependent on the misprediction. The other instructions are control independent [4,5,6] and do not necessarily have to be squashed and re executed. A limit study on control independence [6] showed that substantial performance improvements may be possible. However, as a limit study, most implementation constraints were not considered. It is our objective in this paper to consider ....

....may require buffering speculative state for thousands of instructions. Other research in control independence has focused on specific microarchitectures and microarchitecture mechanisms that can exploit control independence. Do Not Distribute. Submitted to ASPLOS 98. Multiscalar processors [4,5] exploit control independence by pursuing multiple flows of control. This is done with multiple physical program counters, but only one logical program counter. The compiler partitions the program into tasks, or subgraphs of the CFG. Arbitrary control flow may exist within a task, and the compiler ....

M. Franklin. The Multiscalar Architecture. PhD thesis, University of Wisconsin, Nov 1993.


Evaluating the XMT Parallel Programming Model - Naishlos, Nuzman, Tseng, Vishkin (2001)   (Correct)

....shared memory and message passing programming models on multiprocessor systems. Our work attempts to examine parallel programming with respect to the different assumptions implied by an on chip environment. Various other projects explore on chip parallel architectures: CMP [HNO97] Multiscalar [Franklin93], SMT [TLE 99] and Raw [WTS97] The current paper is targeted toward exploring shared memory parallel algorithms as applied to scalable on chip architecture. 6. Conclusion This paper presented features of XMT, a parallel programming model designed for exploiting on chip parallelism. With ....

M. Franklin, "The Multiscalar Architecture," Ph.D. thesis. Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, December 1993.


A Dynamic Multithreading Processor - Akkary, Driscoll   (72 citations)  (Correct)

....a simultaneous multithreading pipeline to increase processor utilization, except that the threads are created dynamically from the same program. Although the DMT processor is organized around dynamic simultaneous multiple threads, the execution model draws a lot from the multiscalar architecture [4,5]. The multiscalar implements mechanisms for multiple flows of control to avoid instruction fetch stalls and exploit control independence. It breaks up a program into tasks that execute concurrently on identical processing elements connected as a ring. Since the tasks are not independent, ....

M. Franklin. The Multiscalar Architecture. Ph.D. Thesis, University of Wisconsin, Nov 93.


Towards a First Vertical Prototyping of an Extremely .. - Naishlos, Nuzman.. (2001)   (1 citation)  (Correct)

....for a CMP is the Stanford Hydra architecture [HNO97] Research in this area has tended to focus on multiprogramming, and on speculative execution to extract threads from a single program. Other proposed multi threaded architectures, such as Simultaneous Multithreading (SMT) TEL95] or Multiscalar [Franklin93] also feature multiple program counters and make useful points of comparison. Recent work on SMT [TLE 99] has proposed light weight synchronization methods for multithreading. In fact the Acquire primitive is very similar to the suspend primitive presented here. The two instructions share ....

M. Franklin, "The Multiscalar Architecture," Ph.D. thesis. Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, December 1993.


Evaluating the XMT Parallel Programming Model - Dorit Naishlos Joseph (2001)   (Correct)

....the shared memory and message passing programming models on multiprocessor systems. Our work attempts to examine parallel programming with respect to the different assumptions implied by an on chip environment. Various other projects explore on chip parallel architectures: CMP [10] Multiscalar [8], SMT [13] The current paper is targeted toward exploring shared memory parallel algorithms as applied to scalable on chip architecture. Note that XMT, with Radix 16384: Speedups 0 0.5 1 1.5 2 2.5 3 1 4 16 tcus Speedups xm t splash the parallel prefix sum for example, aspires to scale ....

M. Franklin, "The Multiscalar Architecture," Ph.D. thesis. Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, December 1993.


Dynamic Parallel Media Processing Using Speculative Broadcast.. - Fritts, Wolf (2001)   (3 citations)  (Correct)

....loop execution we propose is Speculative Broadcast Loop (SBL) execution. 2 Speculative Broadcast Loop We propose the Speculative Broadcast Loop (SBL) method for the speculative execution of parallel loop iterations. This new vector like run time method is a simplified version of the multiscalar [8] and multithreaded [9] 10] speculative methods that combines SIMD parallelism with large scale speculative execution for supporting data parallelism in multimedia. The SBL run time technique uses profiling and register dependence analysis (i.e. memory profiling and register dependence analysis are ....

Manoj Franklin, "The Multiscalar Architecture," Ph.D. Thesis, Department of Computer Science, University of Wisconsin at Madison, 1993.


Evaluating Multi-threading in the Prototype XMT Environment - Naishlos, Nuzman, Tseng.. (2000)   (1 citation)  (Correct)

....parallelism is that occupied by chip multiprocessors (CMP) HNO97] Research in this area has tended to focus on multiprogramming, rather than fine grained multithreading of a single task. Other proposed multi threaded architectures, such as Simultaneous Multithreading (SMT) or Multiscalar [Franklin93], also feature multiple program counters and make useful points of comparison. Recent work on SMT [TLE 99] has proposed light weight synchronization methods for multithreading. In fact the Acquire primitive is very similar to the suspend primitive presented here. The two instructions share ....

M. Franklin, "The Multiscalar Architecture," Ph.D. thesis. Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, December 1993.


Architecture of the Atlas Chip-Multiprocessor.. - Codrescu, Wills, Meindl (1999)   (9 citations)  (Correct)

....to thread parallelism averages 3.4 on 8 processors. The contribution of this paper is to present and to evaluate the architecture of a chip multiprocessor that dynamically parallelizes sequential binary applications. 1.1. Related work Speculative multithreading was introduced by the Multiscalar [11][33] architecture. This design uses the compiler to divide the program into threads and schedule inter thread register communication. Hardware is responsible for thread control predictions, speculative buffering, memory disambiguation, synchronizing register communication, and misspeculation ....

M. Franklin, "The Multiscalar Architecture" Ph.D thesis, University of Wisconsin -- Madison, 1993


Profiling for Input Predictable Threads - Codrescu, al. (1998)   (Correct)

....the instruction level, such as limits on ILP [26] 11] 1] using value predictors to increase ILP [13] 9] 8] locating value predictable instructions through profiling [4] etc. Recently, there has been interest at studying single program speculative execution at the thread level. The Multiscalar [7][20] work introduced and popularized this idea. The Multiscalar processor favors a hardware centric approach and synchronizes register flow between tasks. The XIMD [27] M Machine [6] Simultaneous Mutithreading [23] SPSM [5] Hydra [17] Stampede [21] Raw [25] Impact [10] and Superthreading ....

M. Franklin, "The Multiscalar Architecture" PhD thesis, University of Wisconsin -- Madison, 1993


Hardware Techniques To Improve The Performance Of The.. - Burger (1998)   (10 citations)  (Correct)

....At the register level, clustered architectures, such as the Alpha 21264 (or proposed MultiCluster architecture [37] distribute the register interface to multiple banks of functional units, thus achieving high, yet cost effective, bandwidth out of the global register files. Multiscalar processors [41, 114] increase instruction fetch bandwidth by distributing the instruction fetch (at the L1 I cache interface) as 21 well as the register banks. The Multiscalar work assumed centralized L1 data caches, although more recent proposals distribute the L1 data caches as well [53] To our knowledge, the ....

Manoj Franklin. The Multiscalar Architecture. PhD thesis, University of Wisconsin, 160 Madison, WI, December 1993.


A Study of Control Independence in Superscalar Processors - Eric Rotenberg Quinn (1999)   (10 citations)  (Correct)

....depend on the branch outcome. These instructions are control dependent on the branch. Other instructions deeper in the window may be control independent of the mispredicted branch: they will be fetched regardless of the branch outcome, and do not necessarily have to be squashed and re executed [9, 10]. This can be illustrated with a simple example. FIGURE 1. An example of control independence. Figure 1 shows a control flow graph (CFG) containing four basic blocks. Basic blocks are used for simplicity and may be substituted with arbitrary control flow. The branch r5 r5 r4 r5 r4 1 2 3 ....

....important aspects of programs themselves were not modeled; in particular, a significant subset of data dependences were ignored due to the trace driven nature of the study. Several microarchitecture implementations have since been proposed that incorporate control independence in some form [10,12 19]. In these studies, however, either the impact of control independence is not isolated, or insight into the reported performance gains is limited and obscured by artifacts of the particular design. In this paper we have three primary objectives and contributions. The first objective is to ....

[Article contains additional citation context not shown here]

M. Franklin. The Multiscalar Architecture. PhD thesis, Univ. of Wisc., Nov 1993.


Control Independence in Trace Processors - Rotenberg (1999)   (6 citations)  (Correct)

....on the order of 30 . The proposed mechanisms are complex due to the non hierarchical superscalar organization, and there is a reliance on the compiler to provide complete control dependence information. Nonetheless, the study is useful for understanding control independence. Multiscalar processors (Franklin, 1993; Sohi et al. 1995) Dynamic Multithreading (Akkary Driscoll, 1998) and other multithreaded architectures (Oplinger et al. 1997; Steffan Mowry, 1998; Dubey et al. 1995; Tsai Yew, 1996) exploit control independence by pursuing multiple flows of control. Either the compiler or hardware ....

Franklin, M. (1993). The multiscalar architecture. Ph.D. thesis, Computer Sciences Department, University of Wisconsin - Madison.


AR-SMT: Coarse-Grain Time Redundancy for High Performance.. - Eric Rotenbe Rg   (Correct)

....time implications) adding an extra register rename map, and designing the control logic and datapaths for SMT. All of this is in addition to the Delay Buffer storage 1 . 4. 0 Trace processors as a platform for AR SMT In this paper we use a new processor microarchitecture called trace processors [18,19,20,21] as a platform for AR SMT. A trace is a long, dynamic sequence of instructions captured and stored by hardware. It may contain any number of control transfer instructions. The primary constraint on a trace is a hardwaredetermined maximum length, but there may be any number of other ....

M. Franklin. The Multiscalar Architecture. PhD thesis, University of Wisconsin, Nov 1993.


A Study of Control Independence in Superscalar Processors - Rotenberg, Jacobson, Smith (1999)   (10 citations)  (Correct)

....depend on the branch outcome. These instructions are control dependent on the branch. Other instructions deeper in the window may be control independent of the mispredicted branch: they will be fetched regardless of the branch outcome, and do not necessarily have to be squashed and re executed [10, 11]. This can be illustrated with a simple example. Figure 1 shows a control flow graph (CFG) containing four basic blocks. Basic blocks are used for simplicity and, in general, may be substituted with arbitrary control flow. The conditional branch terminating block 1 is mispredicted, with dashed ....

....important aspects of programs themselves were not modeled; in particular, a significant subset of data dependences were ignored due to the trace driven nature of the study. Several microarchitecture implementations have since been proposed that incorporate control independence in some form [11, 13, 14, 15, 16, 17, 18, 1]. In these studies, however, either the impact of control independence is not isolated, or insight into the reported performance gains is limited and obscured by artifacts of the particular design. In this paper we have three primary objectives and contributions. The first objective is to ....

[Article contains additional citation context not shown here]

M. Franklin. The Multiscalar Architecture. PhD thesis, University of Wisconsin, Nov 1993.


Improving Superscalar Instruction Dispatch and Issue by.. - Vajapeyam, Mitra (1997)   (51 citations)  (Correct)

....The key differences of the dependence based architecture are (i) the register file is not separated into local and global files, and (ii) register renaming information is not reused, thus having the same instruction dispatch rate as traditional superscalar processors. The Multiscalar architecture[Fra92b, Fra93a] takes a different approach to window partitioning and achieving large windows. Here code segments identified at compile time are executed in parallel on multiple PEs organized as a circular chain. The program s instruction window consists of the combined instruction windows of all the PEs. ....

M. Franklin, "The Multiscalar Architecture," Ph.D. Thesis, University of Wisconsin-Madison, 1993. 11 PREPRINT To appear in the ACM SIGARCH 24th ISCA


Control Independence in Trace Processors - Rotenberg, Smith (1999)   (6 citations)  (Correct)

....on the order of 30 . The proposed mechanisms are complex due to the non hierarchical superscalar organization, and there is a reliance on the compiler to provide complete control dependence information. Nonetheless, the study is useful for understanding control independence. Multiscalar processors [16,2], Dynamic Multithreading [11] and other multithreaded architectures [17, 18, 19, 20] exploit control independence by pursuing multiple flows of control. Either the compiler or hardware partitions the program into tasks threads, or subgraphs of the CFG, which may contain arbitrary control flow. ....

M. Franklin. The Multiscalar Architecture. Ph.D. thesis, University of Wisconsin, Nov 1993.


Dynamic Cluster Assignment Mechanisms - Canal, Parcerisa, González (2000)   (12 citations)  (Correct)

....also be applied to VLIW architectures [7] 13] In this case the partitioning is done at compile time. Other authors have proposed clustered microarchitectures in which the partitioning scheme focuses on reducing the control dependence penalties. Examples of such architectures are the Multiscalar [8] [18] SPSM [4] Superthreaded [19] Trace Processors [16] 20] Pews [10] Speculative Multithreaded [11] and Dynamic Multithreaded [1] In such architectures, each cluster executes a different thread of control, all except one being speculative. Partitioning to reduce branch penalties and data ....

M. Franklin, "The Multiscalar Architecture", Ph.D. Thesis, Technical Report TR 1196, Computer Sciences Department, Univ. of Wisconsin-Madison, 1993.


Dynamic Vectorization: A Mechanism for Exploiting.. - Vajapeyam, Joseph, Mitra (1999)   (2 citations)  (Correct)

....than 10K instructions are often needed to expose largescale ILP. Considerable recent research on highperformance processors has focused on fetching and dispatching multiple basic blocks per cycle to build larger instruction windows (for example, trace processors [13] and multiscalar processors [2, 11] ) While the proposed approaches enhance the instruction window size beyond that of current superscalar processors, they either fall short of the window sizes suggested by ILP limit studies or rely on considerable compiler support. In this paper, we propose dynamic vectorization as a mechanism ....

....a large static body but much smaller dynamic size. Second, and more important, the CONDEL architecture typically does not fetch instructions from beyond the loop during loop execution. The dynamic vectorization scheme overlaps post loop code s execution with the loop. The multiscalar architecture [2, 11] attempts to build a large instruction window by statically partitioning the program into multiblocks and having different PEs fetch and execute different multiblocks at runtime. A key problem faced by the multiscalar architecture is appropriate static choices of multiblocks, affected by things ....

M. Franklin, "The Multiscalar Architecture," Ph.D. Thesis, University of Wisconsin-Madison, 1993.


Prédiction de l'adresse des lectures pour tolérer la.. - Hai, Rochange.. (1997)   (Correct)

....en m emoire concern es ne sont pas les memes et qu il n y a pas de d ependances bien sur. Des techniques ont et e propos ees pour ex ecuter sp eculativement des lectures en pariant sur le fait qu il n y aurait pas d alias avec les ecritures en attente et dont l adresse n est pas encore connue [5][6] 8] On appelle cette technique d esambiguation m emoire . Les lectures sont donc ex ecut ees meme si l adresse des ecritures pr ec edentes ne sont pas connues mais la donn ee lue attend dans la station de r eservation que les adresses des ecritures pr ec edentes soient toutes connues avant ....

....Often, several stores are waiting in reservation stations for their operands being calculated and, thus cannot be bypassed since their addresses are not known. Some work has been done on techniques for speculating on the absence of alias with previous stores in order to execute loads earlier[5][6] 8] and is refered as memory disambiguation. Loads are executed even if some preceding stores are not known but the loaded data waits in the reservation station until addresses of all preceding stores are known. This scheme gives a little performance gain and has been considered in our baseline ....

M. Franklin, The Multiscalar Architecture, PhD thesis, University of Wisconsin, Madison, 1993.


Superconductor Multithreaded Subsystem for Petaflops Scale.. - Mikhail Dorojevets   (Correct)

....this technique was first implemented in the peripheral processors of CDC 6600. Several high performance multithreaded computers, including HEP [15] MARS M [6] and Tera [1] have been built, and different forms of multithreading have been investigated and implemented in many research projects [2,5,7,8,11]. Multithreading and prefetching are the key techniques of latency tolerance in the concept of thread percolation proposed for the HTMT computer system [10] After starting a program, PIMs look for ready to execute program entities (threads) and allocate their activation records in CRAM. When a ....

Franklin, M. The Multiscalar Architecture. PhD thesis, Univ. of Wisconsin, Madison, 1993.


DataScalar Architectures and the SPSD Execution Model - Burger, Kaxiras, Goodman (1996)   (Correct)

....parallel execution. Many of our ideas were inspired by the Massive Memory Machine proposal, from which we obtained the concept of ESP [15] Other research efforts are examining the running of uniprocessor programs much faster by using multiple program counters; the Multiscalar group at Wisconsin [13, 29] is one example. This is a complementary project, however, since we focus on the part of the system that is external to the processor (faster processors simply make our case stronger) Other projects are looking at processor memory integration, such as the IRAM project at Berkeley [24] the PPRAM ....

Manoj Franklin. The Multiscalar Architecture. Ph.D. thesis, University of Wisconsin, December 1993.


Trace Processors - Rotenberg, Jacobson, Sazeides, Smith (1997)   (103 citations)  (Correct)

....paper draws from significant bodies of work that either efficiently exploit ILP via distribution and hierarchy, expose ILP via aggressive speculation, or do both. For the most part, this body of research focuses on hardwareintensive approaches to ILP. Work in the area of multiscalar processors [8][9] first recognized the complexity of implementing wide instruction issue in the context of centralized resources. The result is an interesting combination of compiler and hardware. The compiler divides a sequential program into tasks, each task containing arbitrary control flow. Tasks, like ....

....flows of control not only avoid instruction fetch and dispatch complexity, but also exploit control independence. Because tasks are neither scheduled by the compiler nor guaranteed to be parallel, these processors demonstrate aggressive control speculation [10] and memory dependence speculation [8][11] More recently, other microarchitectures have been proposed that address the complexity of superscalar processors. The trace window organization proposed in [4] is the basis for the microarchitecture presented here. Conceivably, other register file and memory organizations could be ....

M. Franklin. The Multiscalar Architecture. PhD thesis, University of Wisconsin, Nov 1993.


Improving Single-Process Performance with Multithreaded.. - Farcy, Temam (1996)   (3 citations)  (Correct)

....and also serve synchronization means. Single process performance improvement is briefly discussed in this article: a 5. 8 speedup with respect to a conventional RISC processor is reported using 8 threads and two load store units (note that a perfect cache is used) Sohi [27] and Franklin [7] also proposed a more novel architecture called multiscalar processor, between a multiprocessor concept and a multithreaded processor. A multiscalar processor allows parallel execution of tasks each on a distinct execution unit . Those units share a common memory and communicate through a ring to ....

.... these studies, the possibility to improve single process performance is evaluated [27] showing speedups up to 6 with an 8 unit multiscalar processor over a conventional processor on SPECint92, and the data cache issues of processors running multiple shared context threads are examined in details [7]. Cache Architecture When multiple threads are run simultaneously, the number of cache accesses per cycle can be very high. With respect to instruction caches, the most simple solution is to use one cache per thread. The same solution can be considered for data caches. However, if a single ....

[Article contains additional citation context not shown here]

Manoj Franklin. The Multiscalar Architecture. PhD thesis, University of Wisconsin, Madison, 1993.


Improving the Parallelism and Concurrency in Decoupled.. - K.J., Naresh.C   (Correct)

....driving computer architects and designers to exploit as much parallelism as possible from the instruction stream. Exploitation of instruction level parallelism has led to the emergence of superscalar [13] 45] VLIW [8] 31] decoupled [38] 42] 12] 41] 33] 34] 23] 49] 9] multiscalar [10] [11] and other finegrain parallel architectures. Decoupled access execute (DAE) computer architectures [38] 42] 12] 41] 33] 34] 23] 49] attempt to separate access and execute processes in a job, provide concurrency between them, and yield high performance and increased flexibility. DAE ....

M. Franklin, "The Multiscalar Architecture", Ph. D. thesis, Computer Science Department, Univ of Wisconsin- Madison, 1993.


Simultaneous Multithreading: Maximizing On-Chip Parallelism - Tullsen, Eggers, Levy (1995)   (210 citations)  (Correct)

....[7] each processor cluster schedules LIW instructions onto execution units on a cycle by cycle basis similar to the Tera scheme. There is no simultaneous issue of instructions from multiple threads to functional units in the same cycle on individual clusters. Franklin s Multiscalar architecture [13, 12] assigns fine grain threads to processors, so competition for execution resources (processors in this case) is at the level of a task rather than an individual instruction. Hirata, et al. 16] present an architecture for a multithreaded superscalar processor and simulate its performance on a ....

M. Franklin. The Multiscalar Architecture. PhD thesis, University of Wisconsin, Madison, 1993.


Multiscalar Processors - Sohi (1995)   (249 citations)  (Correct)

....unit to the next [1] The data cache banks and the associated interconnect (between the data cache banks and the units) are straightforward (except for the scale) Updates of the data cache are not performed speculatively. Instead, additional hardware, known as an Address Resolution Buffer or ARB [3 5], is provided to hold speculative memory operations, detect violations of memory dependences, and initiate corrective action as needed 2 . The ARB may be viewed as a collection of the speculative memory operations of the active tasks. The values corresponding to these operations reside in the ....

M. Franklin, "The Multiscalar Architecture," Ph. D. Thesis, Computer Sciences Technical Report #1196, University of Wisconsin-Madison, Madison, WI 53706, November 1993.


The PEWs Microarchitecture: Reducing Complexity Through.. - Ranganathan, Franklin   Self-citation (Franklin)   (Correct)

....stream. Thus, instructions that are controldependent on a conditional branch tend to be assigned to the hardware window to which the branch has been allocated. In this case, instructions wait near where their control dependences will be resolved. Examples for this approach are the multiscalar [5] [12] 5 PS M [1] superthreading [14] and trace processors [11] 15] where each instruction group is a task or a trace depending on whether the proces sor pursues multiple (independent) flows of control or not. Control dependence based decentralization fits well with the control driven program ....

M. Franklin, "The Multiscalar Architecture," Ph.D. Thesis, Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, 1993.


Block-Level Prediction for Wide-Issue Superscalar Processors - Dutta, Franklin (1995)   (1 citation)  Self-citation (Franklin)   (Correct)

....length, and number of targets. Therefore, the instructions in a subgraph are most likely fetched sequentially. Such a control flow prediction scheme, with unrestricted subgraph structures, is more apt for execution models that pursue multiple flows of control, such as the multiscalar processor [2], and is not the subject of this paper. For processors that follow the superscalar model of execution, we need to restrict the subgraph structure, and identify a path within the subgraph to fetch in structions from. The approach considered in [13] is to use tree like subgraphs, and select a path ....

M. Franklin, "The Multiscalar Architecture," Ph.D. Thesis, Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, 1993.


Branch Prediction in Multi-Threaded Processors - Gummaraju, Franklin   (5 citations)  Self-citation (Franklin)   (Correct)

....on a multiscalar simulator indicates that these techniques, especially a hybrid of extrapolation and correlation, can substantially lower the branch misprediction ratios. 1 Introduction There has been a growing interest in the use of multithreading to speed up the execution of a single program [1] [2] 6] 9] 11] 12] The compiler or the hardware extracts threads from a sequential program, and the hardware executes multiple threads in parallel, most likely with the help of multiple processing elements (PEs) Whereas a single threaded processor can only extract parallelism from a group of ....

....less likely to cause interference. 2.2.4 Intra Thread Control Flow and Thread Execution Style The exact nature of threads and their execution style have a strong bearing on branch history and branch prediction. For instance, if threads are initiated speculatively (as in the multiscalar processor [1] [11] and the superthreaded processor [12] then some of the active threads may get squashed because of incorrect threadlevel control speculation. When using a shared branch predictor, if the updates are done at branch prediction time, then thread level misprediction requires setting back some of ....

[Article contains additional citation context not shown here]

M. Franklin, "The Multiscalar Architecture," Ph.D. Thesis, Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, 1993.


XMT-M: A Scalable Decentralized Processor - Berkovich, Nuzman, Franklin.. (1999)   Self-citation (Franklin)   (Correct)

....1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0.0 0.2 0.4 0.6 0.8 1.0 1.2 linkedlist listsort integersort max stream arrcomp d arrcomp i Benchmark Cycles Latency Input Size 250K Figure 9: Effect of Cross Chip (Global) Communication Latency on XMT M Performance The U. Wisconsin Multiscalar project [13] and the U. Washington simultaneous multi threading (SMT) project [27] with their use of multiple program counters and the computer architecture literature on multi threading (see, for instance [17] have also been very useful; however, the way XMT proposes to attack the completion time of a ....

M. Franklin, "The Multiscalar Architecture," Ph.D. thesis. Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, December 1993.


An Empirical Study of Decentralized ILP Execution Models - Ranganathan, Franklin (1998)   (9 citations)  Self-citation (Franklin)   (Correct)

....size of the hardware window. It is very important, therefore, to decentralize the dynamic scheduling hardware. The importance of decentralization is underscored in recently developed processors execution models such as the MIPS R10000 [20] and R12000, the DEC Alpha 21264 [7] the multiscalar model [4] [14] the superthreading model [17] the trace processing model [13] 15] 19] the MISC (Multiple Instruction Stream Computer) 18] the PEWs (Parallel Execution Windows) model [6] 11] and the multicluster model [3] All of these execution models split the dynamic instruction window across ....

....same PE. Thus, instructions that are control dependent on the same conditional branch are generally assigned to the PE to which the branch has been assigned, and instructions wait near where their control dependences will be resolved. Examples for this approach are the multiscalar execution model [4] [14] the superthreading model [17] and the trace processing model [13] 15] 19] 2 . Control dependence based decentralization fits well with the control driven program specification typically adopted in current ISAs. Because controldependent instructions tend to be grouped together in the ....

[Article contains additional citation context not shown here]

M. Franklin, "The Multiscalar Architecture," Ph.D. Thesis, Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, 1993.


High-Performance Frontends for Trace Processors - Jacobson (1999)   (Correct)

No context found.

M. Franklin, " The Multiscalar Architecture," Ph.D. Thesis, Computer Sciences Technical Report #1196, University of Wisconsin-Madison, November 1993.


Instruction History Management for High-Performance Microprocessors - Bhargava (2003)   (Correct)

No context found.

M. Franklin. The Multiscalar Architecture. PhD thesis, Univ. of Wisconsin-Madison, 1993. 178


Running Parallel Applications on an MP With.. - Krishnan, Zhang..   (Correct)

No context found.

M. Franklin. The Multiscalar Architecture. PhD thesis, University of Wisconsin, 1993.


Smart Register Files for High-Performance Microprocessors - Postiff, Mudge (1999)   (Correct)

No context found.

Manoj Franklin. The Multiscalar Architecture. University of Wisconsin, Madison Tech. Report. Nov, 1993.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC