45 citations found. Retrieving documents...
N. S. Arora, R. D. Blumofe, and C. G. Plaxton, Thread scheduling for multiprogrammed multiprocessors, in ACM Symposium on Parallel Algorithms and Architectures, 1998.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Online Scheduling of Parallel Programs on Heterogeneous.. - Bender, Rabin (2002)   (Correct)

....it puts a dollar in the work bucket. Observation 1 At the end of the computation there are a total of exactly W 1 dollars in the work bucket. We now use a potential function argument to prove a bound on the number of dollars in the steal bucket. This argument is an extension of the result in [1, 8] and begins with some de nitions. De nitions. For any (nonroot) node v, suppose that node u is the last of v s parents to be executed. Then we say that the execution of node u enables node v. Node u is called the designated parent of v and edge (u; v) is called the enabling edge. The graph ....

....the nodes, so that we can use these weights in a potential function argument. Let d(u) denote the depth of node u in the dag, i.e. the distance to the root node. Each node u has weight w(u) W1 d(u) so that nodes closer to the root have larger weight. We now present the potential function from [1, 8], which we will use. Let R t be the set of ready nodes at time t. Each node is either in some deque or assigned to and executed on some processor. For each ready node v 2 R t , we de ne its potential t (v) as t (v) 2 w(v) 1 if v is assigned; 2 w(v) otherwise. We let t (i) ....

[Article contains additional citation context not shown here]

N. Arora, R. Blumofe, and G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 119-129, 1998.


Atomic Instructions in Java - Hovemeyer, Pugh, Spacco (2002)   (4 citations)  (Correct)

.... swap instructions yields better scalability on a large multiprocessor than a queue implemented with lock based synchronization. 1 Introduction Wait free data structures and algorithms have been an active area of research in recent years[3, 6, 12, 18] and have spawned a variety of applications[4, 8, 9]. They have the desirable property that when multiple threads access a waitfree data structure, stalled threads cannot prevent other threads from making progress. This avoids a variety of problems encountered with lock based (blocking) synchronization, such as priority inversion and formation of ....

....them useful for operating systems [8, 9] where high priority system threads (such as interrupt handlers) may need to access data structures that are also accessed by low priority threads. A wait free double ended queue has also been used successfully in a work stealing parallel fork join framework[4]. Ad hoc implementations of atomic instructions have also found their way into commercial implementations of the Java programming language. Several companies have determined that use of compare and swap in the implementation of java.util.Random could substantially improve performance on some ....

[Article contains additional citation context not shown here]

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. In ACM Symposium on Parallel Algorithms and Architectures, pages 119--129, 1998.


Memory Management for High-Performance Applications - Berger (2002)   (2 citations)  (Correct)

....not provide significant performance gains, we believe that exploiting richer profiles and adapting to more complex application behavior can provide improved performance, especially on multiprocessors. Such optimizations include padding out allocations to avoid false sharing and using atomic deques [3] to manage memory between threads in producer consumer relationships. We have developed two different memory managers, Hoard and reaps, to address two aspects of memory management. Hoard provides scalable concurrent general purpose memory management, and reaps provide extra semantics for server ....

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 119--129, Puerto Vallarta, Mexico, June 1998.


Using Moldability to Improve the Performance of Supercomputer Jobs - Cirne (2001)   (Correct)

....the results of these investigations often cannot be applied to the on line problem. Likewise, much of the work in scheduling shared memory multiprocessors does not apply directly to distributed memory parallel computers. In principle, on line 102 schedulers designed for shared memory machines [5] [13] 15] 86] can execute on distributed memory supercomputers. However, migrating a task is much cheaper in shared memory multiprocessors than on distributed memory machines. For sharedmemory machines, the cost of restarting a task is roughly independent to the processor assigned to the task. ....

Nimar Arora, Robert Blumofe, and C. Greg Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Puerto Vallarta, Mexico, June 28 - July 2, 1998.


Portable High-Performance Programs - Frigo (1992)   (1 citation)  (Correct)

....arise when spin locks are used extensively. For example, even if a worker is suspended by the operating system during the execution of pop,the infrequency of locking in the THE protocol means that a thief can usually complete a steal operation on the worker s deque. Recent work by Arora et al. [14] has shown that a completely nonblocking work stealing scheduler can be implemented. Using these ideas, Lisiecki and Medina [101] have Program Size ## ## # ## ## ##### ##### ### ## 12.77 0.0005 25540 3.63 1.60 8.0 2.2 ########## #### 29.9 0.0044 6730 1.05 4.3 7.0 6.6 ######### #### 29.7 0.015 ....

N. S. ARORA,R.D.BLUMOFE, AND C. G. PLAXTON, Thread scheduling for multiprogrammed multiprocessors, in Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Puerto Vallarta, Mexico, June 1998.


Parallelizing NP-Complete Problems Using Tree Shaped Computations - Sanders (1999)   (Correct)

.... methods are also of central importance for parallel functional and logical programming languages (e.g. 1, 14] Tree shaped computations can be considered a generalization of the ff splitting model used in [16] A related model based on multithreaded computations is used in the Cilk project [4, 3, 2]. The ZRAM library [6] is another recent implementation effort. 2 The Abstract Model All the work to be done by a tree shaped computation is initially subsumed in a single root problem I root located on a processing element (PE) numbered 0. All other PEs start idle, i.e. they only have an empty ....

.... in the sense that that there are tree shaped computations which require at least as many splits [29] The algorithm also works well if the speed of the PEs in a network of workstation varies dynamically due to external load since the additional irregularity introduced by this is comparably small [29, 2]. We can even tolerate a complete Obviously, very regular instances with large h are possible. But in applications where this is frequently the case, one would look for a splitting function exploiting these regularities to decrease h. deactivation of a worker process as long as it still answers ....

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In 10th ACM Symposium on Parallel Algorithms and Architectures, pages 119--129, 1998.


Asynchronous Random Polling Dynamic Load Balancing - Sanders   (Correct)

....from the dilemma that subtrees which are not subdivided turn out to be too large for proper load balancing whereas excessive communication is necessary if the tree is shredded into too many pieces. We consider random polling dynamic load balancing [19] also known as randomized work stealing [5, 10, 2, 11]) a simple algorithm that avoids both problems: Every processing element (PE) handles at most one piece of work (which may represent a part of a backtracking tree) at any point in time. If a PE runs out of work, it sends requests to randomly chosen PEs until a busy one is found which splits its ....

....be considered as solved. Although tree shaped computations span a remarkably wide area of applications, an important area for future research is to generalize the analysis to models that cover dependencies between subproblems. The predictable dependencies modeled by multithreaded computations [2] are one step in this direction. But in many classic search problems the main difficulty are heuristics that prune the search tree in an unpredictable way. ....

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In 10th ACM Symposium on Parallel Algorithms and Architectures, pages 119--129, 1998.


A Pragmatic Implementation of Non-Blocking Linked-Lists - Harris (2001)   (8 citations)  (Correct)

....and deletion operations. The new algorithm provides substantial bene ts over previous schemes: it is conceptually simpler and our prototype operates substantially faster. 1 Introduction It is becoming evident that non blocking algorithms can deliver signi cant bene ts to parallel systems [MP91,LaM94,GC96,ABP98,Gre99]. Such algorithms use low level atomic primitives such as compare and swap through careful design and by eschewing the use of locks it is possible to build systems which scale to highly parallel environments and which are resilient to scheduling decisions. Linked lists are one of the most basic ....

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 119-129, Puerto Vallarta, Mexico, June 28{July 2, 1998. SIGACT/SIGARCH.


Even Better DCAS-Based Concurrent Deques - Detlefs, Flood, Garthwaite.. (2000)   (4 citations)  (Correct)

....in pointers. In the best case (no interference) it requires only one DCAS per push and one DCAS per pop. We also sketch a proof of correctness. 1 Introduction In academic circles and in industry, it is becoming evident that non blocking algorithms can deliver significant performance benefits [3, 20, 17] and resiliency benefits [9] to parallel systems. Unfortunately, there is a growing realization that existing synchronization operations on single memory locations, such as compare and swap (CAS) are not expressive enough to support design of efficient non blocking algorithms [9, 10, 12] and ....

....algorithms based on the DCAS operation. There have recently been several proposed designs for non blocking linearizable concurrent double ended queues (deques) using the double compare andswap operation [9, 2] Deques, as described in [15] and currently used in load balancing algorithms [3], are classic structures to examine, in that they involve all the intricacies of LIFO stacks and FIFO queues, with the added complexity of handling operations originating at both ends of the deque. Massalin and Pu [16] were the first to present a collection of DCAS based concurrent algorithms. ....

[Article contains additional citation context not shown here]

N. S. Arora, R. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proc. 10th ACM Symp. Parallel Algorithms and Architectures, 1998.


Parallel Garbage Collection for Shared Memory Multiprocessors - Flood, Detlefs (2001)   (6 citations)  (Correct)

....show that this combination of static and dynamic methods leads to e#ective parallelization of both the semispaces and markcompact collectors. It is our belief that the e#ectiveness of our dynamic partitioning is the result of a finely tuned lock free work stealing algorithm based on Arora et al. [1] whose low overhead allows us to balance our work at the individual object level. In both algorithms there are parts that are not easily parallelizable. In the semispaces algorithm these included: installing forwarding pointers, allocating in parallel, and scanning the card table for references ....

Nimar S. Arora; Robert D. Blumofe; and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, 1998.


An Incremental Stuttering Refinement Proof of a Concurrent.. - Sumners (2000)   (Correct)

....database transactions) In some cases, efficiency is a concern and a low level concurrent implementation is needed to solve a particular problem. In those cases it is paramount that the programmer carefully documents and or proves the correctness of his her algorithm. Arora, Plaxton, and Blumhofe[2] developed a program for maintaining a deque viewed and manipulated by an arbitrary number of concurrent processes which is used in a process scheduler based on work stealing. The optimality of the scheduler relies on the assumption that the programs manipulating the deque are wait free but make ....

....The functions owner and thief define the local step functions. Each local owner or thief step transforms the state variables depending on the current value of loc by performing the corresponding assignments and then updating the loc variable to its next value. The program steps were defined [2] to correspond to operations which could be performed atomically for a particular concurrent microarchitecture. For instance, the steps at owner loc 14 and thief loc 8 correspond to a common compare and swap operation which is often atomic. It should also be noted that the (RETURN itm) and (return ....

N. Arora, R. Blumhofe, and C. Plaxton. Thread Scheduling for multiprogrammed multiprocessors. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures, June 1998.


nanoProtean: Scalable System Software for a Gigabit.. - Craig, Kim.. (2001)   (Correct)

....voluntarily yields the resource. Once packet processing is completed it is scheduled for an output interface through a nonblocking output queue. The nonblocking output buffer is similar in design, though developed independently from a recently presented hybrid private parallel access run queue [19]. As a rule, hardware supported atomic operations are substantially more time costly are avoided unless contention mandates their use. The output queue is optimized for parallel enqueue from UNCs submitting reservations for completed packets. The timer interrupt only occurs on one CPU because a ....

N. Arora, R. Blumofe, and C. Plaxton, "Thread scheduling for multiprogrammed multiprocessors," in ACM Symposium on Parallel Algorithms and Architectures, 1998, pp. 119--129.


Low-Contention Depth-First Scheduling of Parallel Computations.. - Fatourou (2001)   (1 citation)  (Correct)

....number IST 1999 14186 (ALCOM FT) dynamic, unstructured parallelism. During the execution of a multithreaded computation, a thread may spawn child threads which can be executed in parallel, and it can synchronize with other currently executing threads. In most of the work in the literature [1, 4, 5, 6, 7, 9, 15, 16, 24, 25, 26, 27], a multithreaded computation is modeled as a directed acyclic graph (see Figure 1(a) Of much concern is how a multithreaded computation can be executed efficiently on a parallel computer. A parallel execution of a multithreaded computation specifies which processor executes each thread and ....

....is allowed to synchronize only with its parent. Algorithms that achieve a space bound O(PS1 ) were also obtained by Burton [10] for a different class of parallel computations. A time efficient algorithm for general computations which however ignores the space requirements, has been presented in [1]. The first depth first schedulers were presented in [4, 26] for languages with nested fine grained parallelism. These algorithms result in poor locality and low scheduling granularity, since threads close together in the computation graph are usually scheduled on different processors. Moreover, ....

[Article contains additional citation context not shown here]

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 119--129, Puerto Vallarta, Mexico, June-July 1998.


A Transparent Operating System Infrastructure for.. - Venetis..   (Correct)

....program should be able to both resume preempted computation, should it lose a processor and utilize a newly granted idle processor. The literature provides a wealth of solutions for attacking the performance bottlenecks that arise from the interference between multiprocessing and multiprogramming [1, 3, 6]. However, little e ort has been spent on the transparent integration of these solutions and multithreading programming models in a generic, model independent manner. Existing frameworks for ecient multiprogrammed execution either pose stringent requirements on the multithreading model, or depend ....

N. Arora, R. Blumofe, and G. Plaxton. Thread scheduling for Multiprogrammed Multiprocessors. In Proc. of the 10th ACM Symposium on Parallel Algorithms and Architectures, pages 119-129, Puerto Vallarta, Mexico, June 1998.


DCAS-Based Concurrent Deques - Agesen, Detlefs, Flood, Garthwaite.. (2000)   (3 citations)  (Correct)

....and is the first non blocking unbounded memory deque implementation. It too allows uninterrupted concurrent access to both ends of the deque. 1 Introduction In academic circles and in industry, it is becoming evident that non blocking algorithms can deliver significant performance [3, 23, 20] and resiliency benefits [11] to parallel systems. Unfortunately, there is a growing realization that existing synchronization operations on single memory locations, such as compare and swap (CAS) are not expressive enough to support design of e#cient non blocking algorithms [11, 12, 16] and ....

....body of e#cient data structures based on the DCAS operation. This paper presents two novel designs of non blocking linearizable concurrent double ended queues (deques) using the double compare and swap operation. Deques, originally described in [18] and currently used in load balancing algorithms [3], are classic structures to examine, in that they involve all the intricacies of LIFO stacks and FIFO queues, with the added complexity of handling operations originating at both ends of the deque. By being linearizable [17] and non blocking [14] our concurrent deque implementations are ....

[Article contains additional citation context not shown here]

Arora, N. S., Blumofe, B., and Plaxton, C. G. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (1998).


The Power of Two Random Choices: A Survey of Techniques .. - Mitzenmacher, Richa.. (2000)   (16 citations)  (Correct)

....the client server model. The load balancing problem can be classi ed according to the nature of the tasks themselves. The problem is more complex if the tasks have explicit dependencies; for instance, the tasks may represent a multi threaded computation modeled as a directed acyclic graph [BL94, ABP98] or the tasks may be generated by a backtrack search or branch and bound algorithm [KZ93] The situation where the tasks are independent is somewhat simpler and several models for generating and consuming independent tasks are considered in the literature [RSAU91, BFM98, BFS99] In the random ....

....its excess tasks. The matching of the heavily loaded processors to lightly loaded processors must be performed eciently and in a distributed manner. In a work stealing algorithm, a lightly loaded processor steals tasks from a suitable processor (e.g. ELZ86b, FMM91, FM87, HZJ94, Mit98, BL94, ABP98] A particular example of this approach is idle initiated work stealing where a processor that becomes idle seeks to obtain tasks from nonidle processors. Randomized algorithms have proven to be a critical tool in this matching process since the earliest investigations in this area. More ....

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth ACM Symposium on Parallel Algorithms and Architectures, pages 119-129, June 1998.


The Power of Two Random Choices: A Survey of Techniques .. - Mitzenmacher, Richa.. (2000)   (16 citations)  (Correct)

....the client server model. The load balancing problem can be classified according to the nature of the tasks themselves. The problem is more complex if the tasks have explicit dependencies; for instance, the tasks may represent a multi threaded computation modeled as a directed acyclic graph [BL94, ABP98] or the tasks may be generated by a backtrack search or branch and bound algorithm [KZ93] The situation where the tasks are independent is somewhat simpler and several models for generating and consuming independent tasks are considered in the literature [RSAU91, BFM98, BFS99] In the random ....

....excess tasks. The matching of the heavily loaded processors to lightly loaded processors must be performed efficiently and in a distributed manner. In a work stealing algorithm, a lightly loaded processor steals tasks from a suitable processor (e.g. ELZ86b, FMM91, FM87, HZJ94, Mit98, BL94, ABP98] A particular example of this approach is idle initiated work stealing where a processor that becomes idle seeks to obtain tasks from nonidle processors. Randomized algorithms have proven to be a critical tool in this matching process since the earliest investigations in this area. More ....

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth ACM Symposium on Parallel Algorithms and Architectures, pages 119--129, June 1998.


Scheduling Cilk Multithreaded Parallel Programs on Processors.. - Bender, Rabin (2000)   (1 citation)  (Correct)

....it puts a dollar in the work bucket. Observation 1. At the end of the computation there are a total of exactly W1 dollars in the work bucket. We now use a potential function argument to prove a bound on the number of dollars in the steal bucket. This argument is an extension of the result in [1, 7] and begins with some de nitions. De nitions. For any (nonroot) node v, suppose that node u is the last of v s parents to be executed. Then we say that the execution of node u enables node v. Node u is called the designated parent of v and edge (u; v) is called the enabling edge. The graph ....

....supplied with these de nitions, we describe the Structural Lemma of the deques. This lemma guarantees that for any deque at all times during the execution if the work stealing algorithm, the designated parents of the nodes in the deque lie on the root to leaf path in the enabling tree. Lemma 7 ([1, 7]) Let k be the number of (ready) nodes in a given deque at any time t, and let v1 ; v2 ; vk denote these nodes ordered from bottom to top. Let v0 be the assigned node. In addition, for i = 1 : k, let u i be the designated parent of v i . Then for i = 1 : k, node u i is an ancestor ....

[Article contains additional citation context not shown here]

N. Arora, R. Blumofe, and G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In SPAA: Annual ACM Symposium on Parallel Algorithms and Architectures, 1998.


Program Transformation and Runtime Support for Threaded MPI.. - Hong Tang Kai (2000)   (3 citations)  (Correct)

....However, their concern is how multiple threads can be invoked in each MPI node, but not how to execute each MPI node as a thread. Previous work has illustrated the importance of lock free management for reducing synchronization contention and unnecessary delay due to locks [Anderson 1990; Arora et al. 1998; Herlihy 1991; Lumetta and Culler 1998; Massalin and Pu 1991] Lock free synchronization has also been used in the process based SGI implementation [Gropp et al. 1996] Theoretically speaking, some concepts of SGI s design could be applied to our case after considerations for thread based ....

....is not documented and its source code is not available to public. Also, their design uses busy waiting when a process is waiting for events [Salo 1998] which is not desirable for multiprogrammed environments [Kontothanassis et al. 1997; Ousterhout 1982] Lock free studies in [Anderson 1990; Arora et al. 1998; Herlihy 1991; Lumetta and Culler 1998; Massalin and Pu 1991] restrict their queue models to be either FIFO or FILO. These models are not sufficient for MPI point to point communication, and sometimes too general with unnecessary overhead for MPI. A study that attempts to use lock free data ....

[Article contains additional citation context not shown here]

ARORA, N. S., BLUMOFE, R. D., AND PLAXTON, C. G. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the 10th Symposium on Parallel Algorithms and Architectures. Puerto Vallarta, Mexico, 119--29.


Scheduling Threads for Low Space Requirement and Good Locality - Narlikar (1999)   (6 citations)  (Correct)

....of deques from b can be either serialized and protected by a lock (for small ) or performed lazily in parallel (for large ) 4.2 Space bound We now analyze the space bound for a parallel computation executed by algorithm DFDeques . The analysis uses several ideas from previous work [2, 6, 34]. Let be the dag that represents the parallel computation being executed. Depending on the resulting parallel schedule, we classify its nodes (actions) into one of two types: heavy and light. Every time a processor performs a steal, the first node it executes from the stolen thread is called a ....

....in which the number of deques in b is less than . As with the proof of Lemma 4.2, we split type B timesteps into phases such that each phase has between and h.g steal attempts. We can then use a potential function argument similar to the dedicated machine case by Arora et al. [2]. Composing phases from only type B timesteps (ignoring type A timesteps) retains the validity of their analysis. We briefly outline the proof here. Nodes are assigned exponentially decreasing potentials starting from 10 As with the proof of Lemma 4.2, we can use the Chernoff bound here because ....

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In ACM symp. Parallel Algorithms and Architectures, 1998.


Fast Synchronization on Scalable Cache-Coherent.. - Nikolopoulos..   (Correct)

.... not appeared in the literature until recently [10, 17] Lock free synchronization has also attracted considerable attention due to its competitive performance compared to lock based synchronization and its robustness as a synchronization discipline in multiprogrammed shared memory multiprocessors [2, 13, 16, 17, 20]. Synchronization primitives on shared memory multiprocessors can be analyzed effectively through time decomposition of synchronization periods [5] A generic synchronization primitive can be decomposed into at most four distinct time intervals, the acquire, the waiting, the compute and the ....

N. Arora, R. Blumofe and C. Greg-Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. Proc. of the 10th ACM Symp. on Parallel Algorithms and Architectures, pp. 119--129, Puerto Vallarta (Mexico), Jun. 1998.


A Java Fork/Join Framework - Lea (2000)   (3 citations)  (Correct)

....this framework, the operating system should somehow be convinced to try to run other unrelated runnable processes or threads. The tools for achieving this in Java are weak, have no guarantees (see [6, 7] but usually appear to be acceptable in practice (as do similar techniques described for Hood[3]) A thread that fails to obtain work from any other thread lowers its priority before attempting additional steals, performs Thread.yield between attempts, and registers itself as inactive in its FJTaskRunnerGroup. If all others become inactive, they all block waiting for additional main tasks. ....

Arora, Nimar, Robert D. Blumofe, and C. Greg Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Puerto Vallarta, Mexico, June 28 - July 2, 1998.


A New Scheduling Algorithm for General Strict Multithreaded .. - Fatourou, Spirakis (1999)   (1 citation)  (Correct)

....processors steal work from other processors. The work stealing paradigm dates back at least as far as Burton and Sleep s research [11] on parallel execution of functional programs and Halstead s implementation of Multilisp [18] Since then a lot of work has been done in this direction (see e.g. [1, 4, 5, 6, 7, 8, 15]) Three significant performance parameters of any scheduling algorithm for multithreaded computations are the required space, their execution time and the communication cost incurred by them. The execution time is the total time needed by the algorithm to execute the instructions of all threads ....

....prove that for any ffl 0, with probablity at least 1 Gamma ffl the algorithm s execution time is O(T 1 =P hT1 log P log(1=ffl) while its communication complexity is O(P (hT1 log(1=ffl) 1 n d )S max ) with probability again at least 1 Gamma ffl. Substantial research (see e.g. [1, 20, 24, 28]) has been reported in the literature concerning the scheduling of multithreaded computations, ignoring though space requirements and communication costs. Burton shows in [10] how to limit space in certain parallel computations without causing deadlock. More recently, Burton [9] has developed and ....

[Article contains additional citation context not shown here]

N. S. Arora, R. D. Blumofe and C. G. Plaxton, "Thread Scheduling for Multiprogrammed Multiprocessors, " Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, Puerto Vallarta, Mexico, June--July 1998.


Dynamic Load Balancing Issues In The Earth Runtime System - Kakulavarapu (1999)   (Correct)

....and application grain size is very delicate. This is even more important for irregular and dynamic applications where the computation and communication patterns cannot be identified at compile time. While there has been a good understanding of load balancers behavior in distributed systems [16, 122, 21, 118, 22] , the study of dynamic load balancers for fine grain multithreaded systems is still in the early stages. Existing studies are often purely theoretical, based on queuing models or simulations. On the other hand, the results in EARTH are based on an actual multithreaded emulator built on top of ....

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. In Proc. of the Tenth Annual ACP Symposium on Parallel Algorithms and Architectures, Puerto vallarta, Mexico, pages 119--129, June-July 1998.


Portable High-Performance Programs - Frigo (1999)   (1 citation)  (Correct)

....arise when spin locks are used extensively. For example, even if a worker is suspended by the operating system during the execution of pop, the infrequency of locking in the THE protocol means that a thief can usually complete a steal operation on the worker s deque. Recent work by Arora et al. [14] has shown that a completely nonblocking work stealing scheduler can be implemented. Using these ideas, Lisiecki and Medina [101] have Program Size T1 T1 P c1 T8 T1=T8 TS=T8 fib 35 12.77 0.0005 25540 3.63 1.60 8.0 2.2 blockedmul 1024 29.9 0.0044 6730 1.05 4.3 7.0 6.6 notempmul 1024 29.7 0.015 ....

N. S. ARORA, R. D. BLUMOFE, AND C. G. PLAXTON, Thread scheduling for multiprogrammed multiprocessors, in Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Puerto Vallarta, Mexico, June 1998.


Adaptive Two-level Thread Management for Fast MPI Execution.. - Shen, Tang, Yang (1999)   (1 citation)  (Correct)

....is studied in [20, 21, 28, 31] and their threads are targeted at compiler generated fine grained parallelism with simple synchronization while our threads are designed for running coarse grained MPI programs with rich synchronization semantics. The thread system studied for the Cilk language [6] addresses lock free management and adaptive mapping of userlevel threads to OS kernel threads. Our study targets at MPI which has more complicated communication and synchronization primitives. The granularity of a thread in Cilk is typically much finer than that in MPI, thus our work focuses ....

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. In Proceedings of the 10th ACM Symposium on Parallel Algorithms and Architectures, June 1998.


Program Transformation and Runtime Support for Threaded MPI.. - Tang, Shen, Yang (1999)   (3 citations)  (Correct)

....invoked in each MPI node, but not how to execute each MPI node as a thread. These studies are useful for us to relax our assumptions in the future. Previous work has also illustrated the importance of lock free management for reducing synchronization contention and unnecessary delay due to locks [5, 6, 20, 25, 26]. Lock free synchronization has also been used in the process based SGI implementation [19] Theoretically speaking, some concepts of SGI s design could be applied to our case after considerations for thread based execution. However, as a proprietary implementation, SGI s MPI design is not ....

....low level functions and hardware support specific to the SGI architecture, which may not be general or suitable for other machines. Also, their design uses busy waiting when a process is waiting for events [31] which is not desirable for multiprogrammed environments [23, 28] Lock free studies in [5, 6, 20, 25, 26] either restrict their queue model to be FIFO or FILO, which are not sufficient for MPI point to point communication, or are too general with unnecessary overhead for MPI. A lock free study for MPICH is conducted in a version for the NEC shared memory vector machines and Cray T3D [18, 9, 2] using ....

[Article contains additional citation context not shown here]

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. In Proceedings of the 10th ACM Symposium on Parallel Algorithms and Architectures, June 1998.


Scheduling Threads for Low Space Requirement and Good Locality - Narlikar (1999)   (6 citations)  (Correct)

....deletions of deques from R can be either serialized and protected by a lock (for small p) or performed lazily in parallel (for large p) 4.2 Space bound We now analyze the space bound for a parallel computation executed by algorithm DFDeques . The analysis uses several ideas from previous work [2, 6, 34]. Let G be the dag that represents the parallel computation being executed. Depending on the resulting parallel schedule, we classify its nodes (actions) into one of two types: heavy and light. Every time a processor performs a steal, the first node it executes from the stolen thread is called a ....

....timesteps in which the number of deques in R is less than p. As with the proof of Lemma 4.2, we split type B timesteps into phases such that each phase has between p and 2p Gamma 1 steal attempts. We can then use a potential function argument similar to the dedicated machine case by Arora et al. [2]. Composing phases from only type B timesteps (ignoring type A timesteps) retains the validity of their analysis. We briefly outline the proof here. Nodes are assigned exponentially decreasing potentials starting from 10 As with the proof of Lemma 4.2, we can use the Chernoff bound here because ....

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In ACM symp. Parallel Algorithms and Architectures, 1998.


Cilk: Efficient Multithreaded Computing - Randall (1998)   (3 citations)  (Correct)

....arise when spin locks are used extensively. For example, even if a worker is suspended by the operating system during the execution of pop, the infrequency of locking in the THE protocol means that a thief can usually complete a steal operation on the worker s deque. Recent work by Arora et al. [4] has shown that a completely nonblocking work stealing scheduler can be implemented. Using these ideas, Lisiecki and Medina [68] have modified the Cilk 5 scheduler to make it completely nonblocking. Their experience is that the THE protocol greatly simplifies a nonblocking implementation. The ....

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Puerto Vallarta, Mexico, June 1998. To appear.


Scheduling Parallel Programs on Non-Uniform Memory Architectures - Gerson Cavalheiro   (Correct)

....threads is used on each node. Each of those threads executes an infinite loop: at each iteration, it gets a task from the reserve and executes it until completion. On each node, this reserve list is seen by the high level scheduling. However, in order to avoid contentions due to unnecessary mutex [1, 2], this reserve is splited in m sub lists, one for each thread of the node (see details in the figure 2) Preserving the depth first strategy of the high level work stealing scheduling, each of the m threads handles its sub list in a LIFO manner. When a thread tries to get a task from an empty ....

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxon. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of X SPAA, Puerto Vallarta, Mexico, June 1998.


Compile/Run-time Support for Threaded MPI Execution on.. - Hong Tang (1999)   (3 citations)  (Correct)

....proposed in this extended abstract are focused on efficient point to point communication primitives using lock free queue management techniques. The previous work has illustrated importance of lock free management for reducing synchronization contention and unnecessary delay due to locks [4, 5, 12, 15, 16] and it has also been used in the process based SGI implementation [11] Theoretically speaking, some concept of their design could be applied to our case after certain considerations for supporting thread based execution. However, as a proprietary implementation, SGI s MPI design is not ....

....low level functions and hardware support specific to the SGI architecture, which may not be general or suitable for other machines. Also their design uses busy waiting when a process is waiting for events [21] which is not desirable for multiprogrammed environments [13, 18] Lock free studies in [4, 5, 12, 15, 16] either restrict their queue model to be FIFO or stack, which are not sufficient for MPI point to point communication, or too general with unnecessary overhead for MPI. Thus our second goal is to design an efficient communication protocol for MPI threads by using a new lock free queue management ....

[Article contains additional citation context not shown here]

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. In Proceedings of the 10th ACM Symposium on Parallel Algorithms and Architectures, June 1998.


The Implementation of the Cilk-5 Multithreaded Language - Frigo, Leiserson, Randall (1998)   (70 citations)  (Correct)

....arise when spin locks are used extensively. For example, even if a worker is suspended by the operating system during the execution of pop, the infrequency of locking in the THE protocol means that a thief can usually complete a steal operation on the worker s deque. Recent work by Arora et al. [2] has shown that a completely nonblocking work stealing scheduler can be implemented. Using these ideas, Lisiecki and Medina [21] have modified the Cilk 5 scheduler to make it completely nonblocking. Their experience is that the THE protocol greatly simplifies a nonblocking implementation. The ....

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Puerto Vallarta, Mexico, June 1998. To appear.


Hood: A User-Level Thread Library for Multiprogramming.. - Papadopoulos (1998)   (Correct)

....accurately the performance of parallel applications that use the Hood implementation of the non blocking work stealer. In fact, this performance model is based on an analytical bound that has been proven to hold in a model where the kernel level scheduling is actually performed by an adversary [9]. Thus, our model is extraordinarily robust. Moreover, we have developed a collection of prototype applications to prove the efficiency of our library. All of our applications have been written in C on top of Hood and vary from matrix computations to n body simulations and ray tracing. We show ....

....The work stealing algorithm dynamically assigns threads to processes for execution in a provably efficient manner [14, 15] In this chapter, we review the work stealing algorithm, and we state the proven performance bounds. In addition, we describe the non blocking implementation of this algorithm [9] that is used in Hood. In the rest of this thesis, we experiment with applications that are coded with Hood. We have also implemented and studied some alternative implementations of the work stealing algorithm. These alternative implementations perform poorly and reveal the importance of some ....

[Article contains additional citation context not shown here]

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Puerto Vallarta, Mexico, June 1998.


The Performance of Work Stealing in Multiprogrammed Environments - Extend Ed   (Correct)

....display poor performance in such multiprogrammed environments [2] As an alternative to coscheduling or process control, we investigate the use of dynamic, user level, thread scheduling. In particular, we show that a non blocking [3] implementation of the work stealing thread scheduling algorithm [1] achieves efficient performance even when the number of available processors grows and shrinks over time. All of the experiments in this paper were run on a Sun Ultra Enterprise 5000 with 8 167 Mhz UltraSPARC processors running Solaris 2.5.1, with no modifications. The work stealing thread ....

....algorithm in which the recursive subproblems can be solved in parallel, a separate thread is created for each recursive call. The threads are scheduled onto the P processes using the non blocking work stealer, a non blocking [3] implementation of the work stealing algorithm, as described in [1]. The idea is that each process maintains a pool of ready threads and executes threads from its pool in last in first out (LIFO) order. If a process finds that its pool is empty, it steals the first in thread from a process chosen at random. The thread pools are implemented with non blocking ....

[Article contains additional citation context not shown here]

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Puerto Vallarta, Mexico, June 1998. To appear.


Compile/Run-time Support for Threaded MPI Execution on.. - Hong Tang (1999)   (3 citations)  (Correct)

....invoked in each MPI node, but not how to execute each MPI node as a thread. These studies are useful for us to relax our assumptions in the future. Previous work has also illustrated the importance of lock free management for reducing synchronization contention and unnecessary delay due to locks [4, 5, 18, 21, 22]. Lock free synchronization has also been used in the process based SGI implementation [17] Theoretically speaking, some concepts of SGI s design could be applied to our case after considerations for thread based execution. However, as a proprietary implementation, SGI s MPI design is not ....

....low level functions and hardware support specific to the SGI architecture, which may not be general or suitable for other machines. Also, their design uses busy waiting when a process is waiting for events [27] which is not desirable for multiprogrammed environments [19, 24] Lock free studies in [4, 5, 18, 21, 22] either restrict their queue model to be FIFO or FILO, which are not sufficient for MPI point to point communication, or are too general with unnecessary overhead for MPI. A lock free study for MPICH is conducted in a version for the NEC shared memory vector machines and Cray T3D [16, 8, 2] using ....

[Article contains additional citation context not shown here]

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. In Proceedings of the 10th ACM Symposium on Parallel Algorithms and Architectures, June 1998.


Scheduling Threads for Low Space Requirement and Good Locality - Girija Narlikar (1999)   (6 citations)  (Correct)

....The zeroing then requires a minimum depth of Theta(log m) it can be performed in parallel by forking a tree of height Theta(log m) 4.2 Space bound We now analyze the space bound for a parallel computation executed by algorithm DFDeques . The analysis uses several ideas from previous work [2, 6, 36]. Due to space limitations, we only present the outline of the proofs; detailed analysis can be found elsewhere [34] Let G be the dag that represents the parallel computation being executed. Depending on the resulting parallel schedule, we classify its nodes (actions) into one of two types: heavy ....

....Type B: n i p. We now consider timesteps in which the number of deques in R is less than p. We split type B timesteps into phases such that each phase has between p and 2p Gamma 1 steal attempts. We can then use a potential function argument similar to the dedicated machine case by Arora et al. [2]. Composing phases from only type B timesteps (ignoring type A timesteps) retains the validity of their analysis. We briefly outline the proof here. Nodes are assigned exponentially decreasing potentials starting from the root downwards. Thus a node at a depth of d is assigned a potential of 3 ....

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In ACM symp. Parallel Algorithms and Architectures, 1998.


Athapascan-1: A multithreaded execution model based on data flow - Gerson Cavalheiro   (Correct)

.... a sequential execution [4] Such a strategy has been implemented in the Cilk parallel language where the lexicographic order is used both to bound the memory space and to provide efficient sequential execution for dynamic series parallel graphs; it has lead to very good experimental performances [1] on parallel architectures with uniform memory access. Similarly to Cilk, the high level scheduling implements a work stealing algorithm. On December 23, 1998 DRAFT 22 each node of the architecture, a list of ready tasks is managed. Each node uses it to store tasks that become ready. When a node ....

....loss of parallelism. Experiments performed to evaluate the overhead of this schedule with respect to a pure sequential code shows that it could be used to amortize the cost of task creation. Then a future work consists in implementing a work stealing algorithm similar to the one developped in Cilk [1]. The difference here is that the schedule will be based on the reference sequential order to execute efficiently programs with neither resctrictions on the pattern of the synchronizations nor migration of running closures. ....

N. S. Arora and R. D. Blumofe and C. G. Plaxon. Thread scheduling for multiprogrammed multiprocessors. In Proc. of X SPAA, Puerto Vallarta, Mexico, June, 1998.


Compile/Run-time Support for Threaded MPI Execution on.. - Tang, Shen, Yang (1999)   (3 citations)  (Correct)

....invoked in each MPI node, but not how to execute each MPI node as a thread. These studies are useful for us to relax our assumptions in the future. Previous work has also illustrated the importance of lock free management for reducing synchronization contention and unnecessary delay due to locks [4, 5, 18, 21, 22]. Lock free synchronization has also been used in the process based SGI implementation [17] Theoretically speaking, some concepts of SGI s design could be applied to our case after considerations for thread based execution. However, as a proprietary implementation, SGI s MPI design is not ....

....low level functions and hardware support specific to the SGI architecture, which may not be general or suitable for other machines. Also, their design uses busy waiting when a process is waiting for events [27] which is not desirable for multiprogrammed environments [19, 24] Lock free studies in [4, 5, 18, 21, 22] either restrict their queue model to be FIFO or FILO, which are not sufficient for MPI point to point communication, or are too general with unnecessary overhead for MPI. A lock free study for MPICH is conducted in a version for the NEC shared memory vector machines and Cray T3D [16, 8, 2] using ....

[Article contains additional citation context not shown here]

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. In Proceedings of the 10th ACM Symposium on Parallel Algorithms and Architectures, June 1998.


The Data Locality of Work Stealing - Acar, Blelloch, Blumofe (2000)   (2 citations)  Self-citation (Blumofe)   (Correct)

....as race free computations. In this paper we consider only race free computations. The work stealing algorithm is a thread scheduling algorithm for multithreaded computations. The idea of work stealing dates back to the research of Burton and Sleep [11] and has been studied extensively since then [2, 9, 19, 20, 24, 36, 37]. In the work stealing algorithm, each process maintains a pool of ready threads and obtains work from its pool. When a process spawns a new thread the process adds the thread into its pool. When a process runs out of work and finds its pool empty, it chooses a random process as its victim and ....

....Follows from Theorem 3 and Lemma 11. 6 An Analysis of Nonblocking Work Stealing The non blocking implementation of the work stealing algorithm delivers provably good performance under traditional and multiprogrammed workloads. A description of the implementation and its analysis is presented in [2]; an experimental evaluation is given in [10] In this section, we extend the analysis of the non blocking work stealing algorithm for classical workloads and bound the execution time of a nested parallel, computation with a work stealer to include the number of cache misses, the cache miss ....

[Article contains additional citation context not shown here]

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Puerto Vallarta, Mexico, June 1998.


Verification of a Concurrent Deque Implementation - Blumofe, Plaxton, Ray (1999)   (1 citation)  Self-citation (Blumofe Plaxton)   (Correct)

....invocations appear as if they are executed atomically in some serial order, synchronizability allows some invocations to appear as if they are executed atomically at exactly the same time. 1 Introduction In this paper we prove the correctness of the concurrent deque implementation given in [1] as a component of the work stealing thread scheduling algorithm. This implementation is nonblocking, meaning that slow or preempted processes cannot prevent other processes from making progress [2] No mutual exclusion is used. This nonblocking property makes this implementation ideal for use in ....

....methods as pushBottom, popBottom, pushTop, and popTop. One or more processes manipulate the deque by invoking these methods. A nonblocking concurrent deque allows the execution of two or more method invocations to be arbitrarily interleaved. The nonblocking concurrent deque implementation of [1] does not provide a true concurrent deque as defined in the preceding paragraph, as it only specifies methods for three of the four deque operations, and it restricts the set of processes allowed to invoke each of these methods. Specifically, this concurrent deque implementation is subject to the ....

[Article contains additional citation context not shown here]

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 119--129, Puerto Vallarta, Mexico, June 1998.


Hood: A User-Level Threads Library for Multiprogrammed.. - Blumofe, Papadopoulos (1998)   (1 citation)  Self-citation (Blumofe)   (Correct)

....based on work and critical path length that characterizes accurately the performance of parallel applications that use Hood. This performance model is based on an analytical bound that we have proven to hold in a model where the kernel level scheduling is actually performed by an adversary [4]. 1.1 The problem with static load balancing Before considering Hood, we first review a well known performance anomaly that occurs when parallel programs use static load balancing [20, pages 284 285] In the simplest case, when such a program executes, it creates some number P of processes, ....

....threads are created and synchronized at user level, only a small amount of work per thread is need to amortize the cost of creating and synchronizing the myriad threads. To eliminate the performance cliff, Hood schedules threads using a non blocking implementation of the work stealing algorithm [4, 9]. This implementation employs non blocking synchronization [18] for the concurrent data structures and judicious use of yield system calls. Effectively, Hood s non blocking work stealer automatically adapts to the kernel s allocation of processes to processors. If we have PA P , then some ....

[Article contains additional citation context not shown here]

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 119--129, Puerto Vallarta, Mexico, June 1998.


The Performance of Work Stealing in Multiprogrammed.. - Blumofe, Papadopoulos (1998)   (5 citations)  Self-citation (Blumofe)   (Correct)

....length that characterizes accurately the performance of parallel applications that use this non blocking work stealer. In fact, this performance model is based on an analytical bound that we have proven to hold in a model where the kernel level scheduling is actually performed by an adversary [9]. Thus, our model is extraordinarily robust. We shall restrict attention to shared memory multiprocessors, and all experiments are performed on a Sun Ultra Enterprise 5000 with 8 167 Mhz UltraSPARC processors running Solaris 2.5.1. We shall use the word process to denote a kernel scheduled ....

....The work stealing algorithm dynamically assigns threads to processes for execution in a provably efficient manner [14, 15] In this section, we review the work stealing algorithm, and we state the proven performance bounds. In addition, we describe the non blocking implementation of this algorithm [9]. In the next few sections, we experiment with applications that are coded to use this non blocking work stealer. 3.1 The work stealing algorithm In the work stealing algorithm, each process maintains its own pool of ready threads from which it obtains work, and when a process finds that its ....

[Article contains additional citation context not shown here]

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Puerto Vallarta, Mexico, June 1998.


Lock-Free and Practical Deques using Single-Word.. - Sundell, Tsigas (2004)   (Correct)

No context found.

N. S. Arora, R. D. Blumofe, and C. G. Plaxton, Thread scheduling for multiprogrammed multiprocessors, in ACM Symposium on Parallel Algorithms and Architectures, 1998.


On-the-Fly Maintenance of Series-Parallel.. - Bender, Fineman.. (2004)   (Correct)

No context found.

N. Arora, R. Blumofe, and G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, pages 119--129, 1998.


Two-Handed Emulation: How to build non-blocking implementations.. - Greenwald (2002)   (3 citations)  (Correct)

No context found.

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Symposium on Parallel Algorithms and Architectures, pages 119-129, June 28 { July 2 1998. Puerto Vallarta, Mexico.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC