31 citations found. Retrieving documents...
J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li, "Thread Scheduling for Cache Locality, " The Proceedings of Seventh International Conference on Architectural Support for Programming languages and Operating Systems, pp. 60--71, Oct. 1996.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents

Locality-Aware Predictive Scheduling of Network Processors - Wolf, Franklin (2001)   (1 citation)  (Correct)

.... to the network processor environment, it does not consider the reuse of instruction cache state for different threads that use the same instruction code (as it is done with packets that use the same application) An example for scheduling that uses hints about the processing requirement is [10]. In this work, the compiler provides information about thread requirements that are used by the scheduler to determine a thread execution schedule with high cache locality. Salehi et:al: show the effect of affinity based scheduling on network processing in [11] While this also considers the ....

J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. In Proc. of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, Oct. 1996.


The Data Locality of Work Stealing - Acar, Blelloch, Blumofe (2000)   (2 citations)  (Correct)

.... One class of such techniques is based on software controlled distribution of data among the local memories of a distributed shared memory system [15, 22, 26] Another class of techniques is based on hints supplied by the programmer so that similar tasks might be executed on the same processor [15, 31, 34]. Both these classes of techniques rely on the programmer or compiler to determine the data access patterns in the program, which may be very difficult when the program has complicated data access patterns. Perhaps the earliest class of techniques was to attempt to execute threads that are close ....

.... distributed among the nodes of a distributed shared memory system by the programmer and a thread in the computation is scheduled on the node that holds the data that the thread accesses [15, 22, 26] In the second class, data locality hints supplied by the programmer are used in thread scheduling [15, 31, 34]. Techniques from both classes are employed in distributed shared memory systems such as COOL and Illinois Concert [15, 22] and also used to improve the data locality of sequential programs [31] However, the first class of techniques do not apply directly to HSMSs, because HSMSs do not allow ....

[Article contains additional citation context not shown here]

James Philbin, Jan Edler, Otto J. Anshus, Craig C. Douglas, and Kai Li. Thread scheduling for cache locality. In Proceedings of the Seventh ACM Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS), pages 60--71, Cambridge, Massachusetts, October 1996.


Per-Node Multi-Threading and Remote Latency - Kritchalach Thitikamol And   (Correct)

.... a lightweight, fine grained threading package with adaptive load balancing [9] Lightweight thread packages [10] can be fine grained enough that it is possible to load balance through thread migration, and to minimize unhealthy interactions with the underlying DSM by bin scheduling of threads [11]. However, such systems usually do not allow threads to be blocked, i.e. all threads are runto completion. The challenge is to build a lightweight threading system without changing the programming model in this way. The primary goal of this paper was to evaluate the effect of MT on DSM ....

J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li, "Thread Scheduling for Cache Locality," in Proceedings of the 7th International Conference on Architectural Supports for Programming Languages and Operating Systems, 1996.


Thread Scheduling for Out-of-Core Applications with Memory .. - Zhou, Wang, Clark, Li   (Correct)

....by improving applications data locality. The third limitation is that no results have been shown with the memory server model for message passing parallel applications. A recent study has proposed a method to improve the cache locality of sequential programs by scheduling finegrained threads [17]. The scheduling algorithm relies upon hints provided at the time of thread creation to determine thread execution order to improve data locality. This method can e#ectively reduce second level cache misses and consequently improve the performance of some untiled sequential applications. But their ....

....outperforms the traditional virtual memory disk paging by more than an order of magnitude for sequential applications and a factor of 3 to 6 for parallel applications. 2 Fine grained Thread Scheduling The fine grained thread scheduling was originally proposed to improve data locality for caches [17]. Their results show that this method can significantly improve performance by reducing second level cache misses. However, the proposed method is limited to threads that are independent of each other. In this paper, we extend the thread scheduling approach to handle dependent threads, and apply ....

[Article contains additional citation context not shown here]

James Philbin and et al. Thread Scheduling for Cache Locality. In Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 60--71, Cambridge, Massachusetts, 1--5 October 1996. ACM Press.


Optimizing Overall Loop Schedules using Prefetching and.. - Chen, O'Neil, Sha   (Correct)

....by Philbin et al. They do so to improve the inter thread cache locality of sequential programs which contain fine grained threads. Their experiments show that the thread scheduling method along with the partitioning idea can improve program performance by 2 reducing second level cache misses [11]. While Philbin s paper shows the effectiveness of the partitioning technique in improving the cache locality of multi thread programs, our research also shows the successful usage of the partitioning idea in improving the loop schedules. To the authors knowledge, this paper presents the first ....

J. Philbin, J. Edler, O.J. Anshus, C.C. Douglas, and K. Li. Thread scheduling for cache locality. In Computer Architecture News v 24 n Special Issue, pages 60--71, Oct. 1996.


Pthreads for Dynamic and Irregular Parallelism - Narlikar, Blelloch (1998)   (4 citations)  (Correct)

....lightweight threads packages written for shared memory machines. In particular, we are interested in implementing a scheduler that efficiently supports dynamic and irregular parallelism. 2. 1 Scheduling lightweight threads A variety of lightweight, user level threads systems have been developed [6, 11, 14, 15, 25, 29, 33, 37, 40, 45, 53], including mechanisms to provide coordination between the kernel and the user level threads library [2, 49, 31] Although the main goal of the threads schedulers in previous systems has been to achieve good load balancing and or locality, a large body of work has also focused on developing ....

....as the one described elsewhere [34] would be required to ensure further scalability. The effectiveness of our scheduler has been demonstrated on one SMP; future work involves studying its applicability to a scalable, NUMA multiprocessor by combining it with locality based scheduling techniques [6, 30, 40]. For example, to schedule threads on a hardware coherent cluster of SMPs, our scheduling algorithm could be used to maintain one shared queue on each SMP, and threads would be moved between SMPs only when required. We have shown that our space efficient scheduler is well suited for programs with ....

James Philbin, Jan Edler, Otto J. Anshus, and Craig C. Douglas. Thread scheduling for cache locality. In Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 60--71, Cambridge, Massachusetts, 1--5 October 1996. ACM Press.


The Data Locality of Work Stealing - Acar, Blelloch, Blumofe (2000)   (2 citations)  (Correct)

.... One class of such techniques is based on software controlled distribution of data among the local memories of a distributed shared memory system [15, 22, 26] Another class of techniques is based on hints supplied by the programmer so that similar tasks might be executed on the same processor [15, 31, 34]. Both these classes of techniques rely on the programmer or compiler to determine the data access patterns in the program, which may be very difficult when the program has complicated data access patterns. Perhaps the earliest class of techniques was to attempt to execute threads that are close ....

.... distributed among the nodes of a distributed shared memory system by the programmer and a thread in the computation is scheduled on the node that holds the data that the thread accesses [15, 22, 26] In the second class, data locality hints supplied by the programmer are used in thread scheduling [15, 31, 34]. Techniques from both classes are employed in distributed shared memory systems such as COOL and Illinois Concert [15, 22] and also used to improve the data locality of sequential programs [31] However, the first class of techniques do not apply directly to HSMSs, because HSMSs do not allow ....

[Article contains additional citation context not shown here]

James Philbin, Jan Edler, Otto J. Anshus, Craig C. Douglas, and Kai Li. Thread scheduling for cache locality. In Proceedings of the Seventh ACM Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS), pages 60--71, Cambridge, Massachusetts, October 1996.


An Automatic Object Inlining Optimization and its Evaluation - Dolby, Chien (2000)   (16 citations)  (Correct)

....optimized ones. General data structure libraries such as the Java standard library and NIHCL [24] are examples of this phenomenon. Such overheads have been addressed before: allocations by sophisticated memory management [4, 21] and by storage analyses [7, 42] and dereferences by prefetching [33] and by redundant load and store elimination [18] which uses sophisticated alias analyses [23, 17] We are exploring the problem of increased object indirection by developing automatic object inlining optimizations. These techniques preserve the programming model s modularity and semantics while ....

J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. In Proceedings of the Seventh Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII), pages 60-71, 1996.


A Memory-layout Oriented Run-time Technique for Locality.. - Yan, Zhang, Zhang   (Correct)

....caches, maximizing data reuses in the cache, and trading off locality and the other performance factors. 1.2. Our solution and contributions Because most data reuses of an application occur in loop structures [15] and the parallel loop is a major program structure in scientific applications [13, 14, 16, 22], we propose a run time technique to improve the memory performance of parallel loops with dynamic data access patterns. In our run time technique, the memory access patterns of parallel tasks in a program are captured at run time using a multi dimensional memory access space based on simple ....

....1, it is difficult for a user to specify affinity. Our proposed technique uses a simple programming interface for a user or compiler to specify simple information about data, not about complicated affinity relations. Regarding the run time locality optimization of sequential programs, reference [16] proposes a memory layout oriented method. It reorganizes the computation of a loop based on some simple hints about the memory reference patterns of loops and cache architectural information. Compared with a uniprocessor system, a cache coherent shared memory system has more complicated factors ....

[Article contains additional citation context not shown here]

J. E. Philbin, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. Proceedings of ASPLOS'96, pages 60--71, Oct. 1996.


Scheduling Threads for Low Space Requirement and Good Locality - Narlikar (1999)   (6 citations)  (Correct)

....the load. This technique effectively increases scheduling granularity, and therefore provides good locality [7] and low scheduling contention. Another approach for obtaining good locality is to allow the user to supply hints to the scheduler regarding the data access patterns of the threads [12, 28, 37, 45]. However, such hints can be cumbersome for the user to provide in complex programs, and are often specific to a certain language or library interface. Therefore, our DFDeques algorithm instead uses the heuristic of scheduling threads close in the dag on the same processor to obtain good ....

J. Philbin, J. E., O. J. Anshus, and C. C. Douglas. Thread scheduling for cache locality. In Intl. Conf. Architectural Support for Programming Languages and Operating Systems, pages 60--71, 1996.


Scheduling Threads for Low Space Requirement and Good Locality - Narlikar (1999)   (6 citations)  (Correct)

....the load. This technique effectively increases scheduling granularity, and therefore provides good locality [7] and low scheduling contention. Another approach for obtaining good locality is to allow the user to supply hints to the scheduler regarding the data access patterns of the threads [12, 28, 37, 45]. However, such hints can be cumbersome for the user to provide in complex programs, and are often specific to a certain language or library interface. Therefore, our DFDeques algorithm instead uses the heuristic of scheduling threads close in the dag on the same processor to obtain good locality. ....

J. Philbin, J. E., O. J. Anshus, and C. C. Douglas. Thread scheduling for cache locality. In Intl. Conf. Architectural Support for Programming Languages and Operating Systems, pages 60--71, 1996.


An Evaluation of Automatic Object Inline Allocation Techniques - Dolby, Chien (1998)   (16 citations)  (Correct)

....Pointer dereference (called pointer chasing) overhead not only incurs additional memory traffic, but given performance sensitivity to data locality, typically reduces cache efficiency. This topic has been studied by many researchers, using both runtime techniques (e.g. fine grained multi threading [22]) and compile time approaches (e.g. representations that explicate dependencies thru pointers [19] But it remains a challenging open problem. Object inlining coalesces objects by inline allocating child objects within their container objects. This attacks pointer chasing by eliding the pointers ....

James Philbin, Jan Edler, Otto J. Anshus, Craig C. Douglas, and Kai Li. Thread scheduling for cache locality. In Proceedings of the Seventh Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII), pages 60--71, 1996.


Dynamic Pointer Alignment: Tiling and Communication.. - Zhang, Chien (1997)   (4 citations)  (Correct)

....Figure 15. The figure shows total computation time in terms of idle time, communication overhead, and local computation, with the speedup shown on top of each bar. The Base bars of both codes use DPA for 7 Although reordering computation may also improve sequential performance (see Section 6 and [21]) because of the small L1 cache and the lack of a L2 cache, we believe this is not the case on a single T3D node. STATIC MAX. NO. OF MAX. STRIP SUSPENDED ELEMENTS OF CODE SIZE THREADS DATA SPACE Barnes Hut 50 4,749 495 300 27,722 833 FMM 50 2,386 1,198 300 5,472 1,714 Table 3: Barnes Hut and ....

....goal of DPA is to generalize loop and array oriented tiling [1, 4, 24, 32] and communication optimizations [20] to pointerbased computations. Although developed independently, our use of non blocking threads labeled by pointers is similar to the recent cache optimization study by Philbin, et al. [21]. In [21] the programmer manually extracts threads and can supply multiple scalar hints as thread labels. DPA is an automatic approach relying on the compiler for thread extraction and integrates communication optimizations, not addressed in [21] to tolerate latency and reduce communication ....

[Article contains additional citation context not shown here]

James Philbin, et. al. Thread scheduling for cache locality. In Proceedings of the Seventh Symposium on Architectural Support for Programming Languagesand Operating Systems (ASPLOS-VII), 1996.


CS270 Course Project - Resource Constrained Cache-Anity   Self-citation (Li)   (Correct)

No context found.

Philbin, J., Edler, J., Anshus, O. J., Douglas, C. C., and Li, K. Thread scheduling for cache locality. In Architectural Support for Programming Languages and Operating Systems (1996), pp. 60--71. 14


A Note on Cache Memory Methods for Multigrid in Three Dimensions - Douglas, Thorne (1998)   Self-citation (Douglas)   (Correct)

No context found.

J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li, Thread scheduling for cache locality, Proceedings of the Seventh ACM Conference on Architectural Support for Programming Languages and Operating Systems (Cambridge, MA), ACM, 1996, pp. 60-73.


Cache Based Multigrid On Unstructured Two Dimensional Grids - Douglas, Hu, Rüde.. (1998)   (1 citation)  Self-citation (Douglas)   (Correct)

....during each iteration. Rude and Stals [10] 11] and Douglas[3] however, use a simple change so that data passes through cache only once. They assume a rectangular grid and a 5 or 9 point stencil. Tiling is a software technique which is often used to maintain data in cache for as long as possible [9]. Tiling is not able to handle even the red black case on a rectangular grid, much less dynamically changing data structures, such as those encountered in adaptively chosen, unstructured grids. Our goal is to develop a variant of the GaussSeidel method for second order elliptic partial ....

....of the much smaller L2 cache. We assume that half of the L2 cache is available for program execution and that other processes (e.g. the operating system ones in particular) use the rest. This assumption came from a study of many different processors and operating systems while the research for [9] was being done. 3. Motivation from Structured Grids. Now assume a 5 point discretization on a rectangular mesh. The same ideas apply to a 9 point stencil. Suppose that an m Theta n block of nodes fits in cache. All nodes in cache get one update. We then shrink the block we are computing on by ....

J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li, Thread scheduling for cache locality, in Proceedings of the Seventh ACM Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, 1996, ACM, pp. 60-- 73.


Maximizing Cache Memory Usage for Multigrid Algorithms - Douglas, Hu, Iskandarani, .. (1999)   (3 citations)  Self-citation (Douglas)   (Correct)

....interfere with compiler optimizations. Due to the requirements about loop variable values at any given moment in the computation, compilers are not allowed to fuse nested loops into a single loop. In part, it is due to coding styles that make very high level code optimization (nearly) impossible [11]. Before transforming the standard Gauss Seidel algorithms into cache aware versions, let us de ne two operations for updating the approximate solution on either one or two rows of a grid: Update( row, color [ direction] and UpdateRedBlack( row, direction] We implicitly assume that ....

....subdomains ( ij must be further re ned in order to use as large of subdomains as possible each iteration of the relaxation algorithm. The last iteration of the relaxation algorithm must be treated di erently due to the projection or interpolation steps that must be done. As is noted in [11], only 50 60 of the cache is actually available for use by a given program. This is a side e ect of multitasking operating systems. Determining the sizes of the ( ij s per iteration can be done as a preprocessing step and is inexpensive. In order to eciently do loop unrolling and or ....

J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. In Proceedings of the Seventh ACM Conference on Architectural Support for Programming Languages and Operating Systems, pages 60-73, Cambridge, MA, 1996. ACM.


Cache Optimization for Structured and Unstructured.. - Douglas, Hu.. (1999)   (19 citations)  Self-citation (Douglas)   (Correct)

....interfere with compiler optimizations. Due to the requirements about loop variable values at any given moment in the computation, compilers are not allowed to fuse nested loops into a single loop. In part, it is due to coding styles that make very high level code optimization (nearly) impossible [12]. We note that both the memory bandwidth (the maximum speed that blocks of data can be moved in a sustained manner) as well as memory latency (the time it takes move the rst word(s) of data) contribute to the inability of codes to achieve anything close to peak performance. If latency were the ....

J. Philbin and J. Edler and O. J. Anshus and C. C. Douglas and K. Li, Thread scheduling for cache locality, in Proceedings of the Seventh ACM Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, 1996, ACM, pp. 60{ 73.


A Guide To Designing Cache Aware Multigrid Algorithms - Douglas, Rüde, al. (1998)   (3 citations)  Self-citation (Douglas)   (Correct)

....with compiler optimizations. Due to the requirements about loop variable values at any given moment in the computation, compilers are not always free to fuse nested loops into a single loop. In part, it is due to coding styles that make very high level code optimization (nearly) impossible [10]. The easiest way to hand tile a multigrid algorithm would be to use a domain decomposition preconditioner with local multigrid solvers to form a block method. We do not do this in this paper since, when done naively, the convergence rate drops to a standard domain decomposition one rather than ....

....through cache once. Red black ordered Gau Seidel implementations have certain properties. In the standard implementation, data passes through cache 2m times. Further, no compiler on the market automatically tiles even Laplace s equation on a square with a uniform mesh and a five point operator [10]. In the tiled implementation, data passes through cache once. We now define a tiled version of the naturally ordered Gau Seidel. We have to assume that m Gamma 1 rows of a N ThetaN grid G fit entirely into cache simultaneously and that m . Do it = 0; m Gamma 1 Do i = 1; Gamma it ....

PHILBIN, J., EDLER, J., ANSHUS, O.J., DOUGLAS, C.C., and LI, K.: "Thread scheduling for cache locality", in Proceedings of the Seventh ACM Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, 1996, ACM, pp. 60--73.


Minimizing Memory Cache Usage For Multigrid Algorithms In Two.. - Douglas (1997)   Self-citation (Douglas)   (Correct)

....s c sg Fig. 7. Five point discretization, point relaxation large of subdomains as possible each iteration of the relaxation algorithm. The last iteration of the relaxation algorithm must be treated differently due to the projection or interpolation steps that must be done. As is noted in [7], only 50 60 of the cache is actually available for use by a given program. This is a side effect of multitasking operating systems. Determining the sizes of the Omega ( ij s per iteration can be done as a preprocessing step and is inexpensive. Three steps occur during the pre correction ....

....space preconditioner for a relaxation method. The computational subdomains Omega ( ij can certainly be chosen to utilize a block solver. Unstructured grids would appear at first to be nearly unworkable with the algorithms developed in xx2 3. Using the light weight threads package developed in [7] with some modifications, this class of problems might be conquered. The light weight threads package requires only data address hints. It automatically chooses a cache aware tiling for a decomposed set of computations. An inexpensive method (hashing) determines the cache blocks. In essence, this ....

J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li, Thread scheduling for cache locality, in Proceedings of the Seventh ACM Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, 1996, ACM, pp. 60--73.


Exploiting Cache Locality At Run-Time - Yan (1998)   (Correct)

No context found.

J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li, "Thread Scheduling for Cache Locality, " The Proceedings of Seventh International Conference on Architectural Support for Programming languages and Operating Systems, pp. 60--71, Oct. 1996.


Data Cache Optimization in Multimedia Applications - Molnos Heijligers Cotofana   (Correct)

No context found.

J. Philbin, J. Edler, O. Anshus, C. Douglas, and K. Li. Thread scheduling for cache locality. In Proc. of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, October 1996.


Memory Management for Networked Servers - Zhou (2000)   (Correct)

No context found.

James Philbin and et al. Thread scheduling for cache locality. In Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 60--71, Cambridge, Massachusetts, 1--5 October 1996. ACM Press.


Design and Performance of Scalable High-Performance Programmable.. - Wolf (2002)   (5 citations)  (Correct)

No context found.

James Philbin, Jan Edler, Otto J. Anshus, Craig C. Douglas, and Kai Li. Thread scheduling for cache locality. In Proc. of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, October 1996.


Obtaining Efficient Single-Processor Performance From.. - Lowenthal, Greene (1999)   (Correct)

No context found.

James Philbin, Jan Edler, Otto J. Anshus, Craig C. Douglas, and Kai Li. Thread scheduling for cache locality. In Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 60--71, October 1996.

First 50 documents

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC