| J. Wolf, D. Dias, P. Yu, and J. Turek. An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew. In ICDE, 1991. |
....DeWitt [17] and Rahm and Marek [21] describe how to account for current CPU utilization, memory usage, and I O load to perform site selection and determine degree of declustering for hash joins. None of these previous schemes consider repartitioning the join operator during execution. Wolf et al. [31] and Lu and Tan [15] describe techniques to repartition a traditional hash join at one point in time: between the build and probe phases of the join. These techniques are specific to an implementation of the hash join operator, and do not react to external load and memory pressures. Moreover, the ....
J. Wolf, D. Dias, P. Yu, and J. Turek. An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew. In ICDE, 1991.
....streaming scenario. Work in [17] and [21] describes how to account for current CPU utilization, memory usage, and I O load to perform site selection and determine degree of declustering for hash joins. None of these previous schemes consider repartition11 ing the join operator during execution. In [30] and [15] the authors describe techniques to repartition a traditional hash join at one point in time: between the build and probe phases of the join. These techniques are specific to an implementation of the hash join operator, and do not consider continual, on the fly repartitioning of ....
J. Wolf, D. Dias, P. Yu, and J. Turek. An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew. In ICDE, 1991.
....of tuples after the split phase. Secondly the join product skew, in which data skew occurs on the output tuples of a join. Nearly all research efforts have been spent on the problem of the redistribution skew. Two classes of load balancing strategies have been developed : ffl Prevention methods [36] [37] The two relations to be joined are read from disk or sampled in order to build local statistics. Those local statistics are then analyzed globally and a new partitioning function is constructed to distribute the tuples uniformly. This method works good for hard data skew, but has the ....
J. Wolf D. Dias P. Yu and J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. In Proceedings of the Seventh International Conference on Data Engineering, pages 200--209, Kobe, Japan, April 1991.
....for having approximately twice as much to read to partition Set2 at the node with tuple placement skew. A hash function that ignores the node but uses the page and slot identifier from the pointer should produce fairly uniformly sized partitions. Alternatively, a skew resistant join technique [KITS90, WOLF90, HUA91, WALT91, DEWI92b] might be used after producing Set1 tuples. Note that the Find children algorithm must be used to allow either of these techniques if Set2 is not an explicit extent. Time in seconds (sel Set2 =0.50) M i 180 160 140 120 100 80 60 40 20 0 1400 1200 1000 800 600 400 200 0 Hybrid hash node pointer ....
Joel Wolf, Daniel Dias, Philip Yu, and John Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. IBM T. J. Watson Research Center Technical Report RC 15510, 1990.
....2.2. RELATED PARALLELIZATION WORK Our parallelization work uses both transformations and pointer based join techniques. We discuss the related work for each in turn. 2.2.1. Related Transformation Work An enormous amount of work has been done on parallelizing relational queries (e.g. [GERB86, SCHN89, GRAE90, KITS90, WOLF90, HUA91, WALT91, DEWI92a]) Other work has been on parallelizing loops in FORTRAN (e.g. PADU86, WOLF86, WOLF89] and in LISP [LARU89] All this work makes extensive use of program transformations. HART88, HART89] discuss their parallelizing compiler for FAD, a functional DBPL. They use analysis to determine if a ....
....having approximately twice as much to read in order to partition Set2 at the node with tuple placement skew. A hash function that ignores the node but uses the page and slot identifier from the pointer should produce fairly uniformly sized partitions. Alternatively, a skew resistant join technique [KITS90, WOLF90, HUA91, WALT91, DEWI92b] might be used after producing Set1 tuples. Note that the Find children algorithm must be used to allow either of these techniques if Set2 is not an explicit extent. 6.3.7.4. Speedup and Scaleup Next, we compared speedups for the algorithms; we varied the number of nodes, but kept the same ....
Joel Wolf. Daniel Dias, Philip Yu, and John Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. IBM T. J. Watson Research Center Technical Report RC 15510, 1990.
....and network traffic based on run time trace information, has been also disregarded. On the other hand, because the data parallelism exploited by hash based parallel join algorithms is sensitive to the skew of the input data distribution, a number of skew handling methods have been also proposed [13, 17, 10]. However, these techniques are not yet implemented nor considered in commercial products. Moreover, except for [6] almost all evaluations of the effect of data skew and skew handling in parallel hash join operations have been done by simulation, not on real database machines. This paper ....
J.L. Wolf, D.M. Dias, P.S. Yu, and J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew, In proc. of DE, pp. 200--209, 1991.
....the access methods used to read the relation partitions to be communicated. In fact, basically these operators only carry an estimation of the communication cost. Integration of run time control mechanisms Query processing often has to face important run time problems. Among them, load imbalance [21, 22] and run time adaption of static parallelization strategies [23, 24] are especially crucial. In order to ensure good performances, such situations must be quickly detected and corrected. This requires to introduce at compile time, in the PEP itself, control mechanisms [24] 4 . Thus, in XPRS, a ....
J. Wolf D. Dias P. Yu and J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. In Proceedings of the Seventh International Conference on Data Engineering, pages 200--209, Kobe, Japan, April 1991.
....on the where clause, redistribution skew where redistribution causes different number of tuples to be processed at each node, and join product skew where the join selectivity is different across the nodes i.e. the result sizes are different across the nodes. 11 The algorithm of Wolf et al. WDYT90] analyzes the base relations by doing an initial scan on them. This information is used in the actual repartitioning of the relations. Using an analytical model to compare the scheduling hash join algorithm of [WDYT90] and the hybrid hash join algorithm of Gamma [DG85, SD89, DGS 90] Walton ....
....result sizes are different across the nodes. 11 The algorithm of Wolf et al. WDYT90] analyzes the base relations by doing an initial scan on them. This information is used in the actual repartitioning of the relations. Using an analytical model to compare the scheduling hash join algorithm of [WDYT90] and the hybrid hash join algorithm of Gamma [DG85, SD89, DGS 90] Walton et al. conclude that scheduling hash effectively handles redistribution skew while hybrid hash degrades and eventually becomes worse than scheduling hash as redistribution skew increases. However, unless the join is ....
Joel L. Wolf, Daniel M. Dias, Philip S. Yu, and John J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. IBM T. J. Watson Research Center Tech Report RC 15510, 1990.
....The problem of data skew : During query processing, the data of a same relation can be non uniformly distributed over the processors (data skew) 5 In presence of such a phenomenon, ensuring good performances requires triggering explicit strategies of load balancing. Indeed experiments [YT91] have shown that data skew could make computation times explode. A lot of strategies of parallel join processing in presence of data skew have been proposed (see the survey guished : left linear trees (based on a data parallel strategy and right linear trees (more pipelined oriented) For more ....
....on these algorithms can suffer from two kinds of data skew : redistribution skew, in which the processors receive different numbers of tuples after the split phase ; join product skew, in which data skew occurs on the output tuples of a join. If the redistribution skew has been intensively studied [YT91, LT94], only DeWitt et al. considered the join product skew [NS92] They showed that it might have the same slow down effects as the redistribution skew. In order to smooth join product skew situations, they generate in the split phase more partitions than processors. Even attenuated, join product skew ....
[Article contains additional citation context not shown here]
J. Wolf D. Dias P. Yu and J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. In Proceedings of the Seventh International Conference on Data Engineering, pages 200--209, Kobe, Japan, April 1991.
....index. The Tandem group [EGKS90] state that their NonStop SQL system uses a variant of parallel nested loops with index algorithm if the appropriate indices exist and one of the relations is small, and hashing followed by sort merge otherwise, but did not compare the two algorithms. Wolf et al. WDYT90, WDY90] consider the performance of parallel hashing and sort merge algorithms in the presence of skew, but do not consider parallel nested loop algorithms. In early work, Blasgen and Eswaran [BE77] compared sort merge with nested loops with index on uniprocessors, and concluded that sort merge ....
J. L. Wolf, D. M. Dias, P. S. Yu, and J. J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. IBM T. J. Watson Research Center Tech Report RC 15510, 1990.
....also ventured a guess that relations in the future would be relatively small compared to the memory available (i.e. larger by only a small multiple) It is also noteworthy that the models he used did not include overlapped I O and processor time. Wolf, Dias, Yu, and Turek have published a paper [23] extending their complex iterative algorithm of balancing the load of skewed joins in shared nothing environments from sort merge joins to hash based joins. DeWitt et al. 7] addressed the issue of joins where almost the entire relations fit into memory. They were surprised to find that hash based ....
J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek, "An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew", Proc. Seventh International Conference on Data Engineering, Kobe, Japan, 1991.
....relations can result in significant imbalance in the work assigned to the different processors. This imbalance can cause significant degradation in the overall response time for the join operation and, as a result, there has been a lot of interest in skew handling algorithms for parallel joins [8, 9, 16, 17]. One can try to handle this skew by appropriately balancing the distribution of the skewed values among the join processors. But in cases of severe skew, no single join processor can be assigned the joining subtask corresponding to a single value and splitting, i.e. the participation of more ....
....[5] The problem can be generalized to allow any task to be split into multiple parts. In our work, we do not impose any constraints on the size or the total number of fragments. This is a realistic scenario for the join load balancing problem. The work on load balancing algorithms by Wolf et al. [16, 17] is relevant to our work. Their method uses only part of the histogram for the input relations initially but requires the complete histogram to make progressive refinements. Their algorithms are centralized in nature. Load balancing for the hash based join operation is addressed by Roy and ....
J. Wolf, D. Dias, P. Yu, and J. Turek. An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew. In Proceedings of IEEE Data Engineering Conference, pages 200-- 209. IEEE Computer Society, 1991.
....from our analytical cost model while varying several parameter values in section 6. Finally, we discuss our conclusions in section 7. 2 Previous work Because the join is the most frequently used and the most time consuming operation in a relational database system, there have been a lot of works [2, 3, 4, 7, 8, 9] dealing with parallel join algorithms in order to develop efficient join algorithms under several multiprocessor architectures. Those works can be roughly divided into two groups; one group of works deals with the basic parallel join methods and their performance, and another group of works ....
....[7] DeWitt et al. investigated the performance of four parallel hash join algorithms while implementing them on GAMMA which is a shared nothing parallel database machine. Omiecinski proposed a load balancing join algorithm for a shared memory multiprocessor and analyzed its performance in [4] In [8, 9], Wolf et al. considered the performance of the parallel hash join and sort merge join algorithms in the presence of data skew. To the best of our knowledge, only [1, 6] deal with index based join processing in a parallel database environment. In [6] Omiecinski suggested heuristic algorithms ....
J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew. In Proceedings of the Seventh International Conference on Data Engineering, pages 200--209, 1991.
....[1] have shown that load imbalance could make performances fall down. Load balancing has been intensively studied among relational operations (see the survey of Mishra and Eich in [7] In the last years, join algorithms including dynamic load balancing capabilities have been proposed [8] 9] 10][11][12] However for whole relational queries it is still an open research problem [2] 9] In this paper we propose an original dynamic load balancing technique based on an elastic approach, which will interfere each time data skew appears. The paper is organized as follows. Section 2 analyzes ....
.... which the tuples of the relations are hashedpartitioned on the available processors (split phase) before a local join operator (nested loop, hash join, or a sort merge join [24] is applied (join phase) Two great classes of load balancing strategies have been developed: ffl Prevention Methods [11] [12] The two relations to be joined are read from disk or sampled in order to build local statistics. Those local statistics are then analyzed globally and a new partitioning function is constructed to distribute the tuples uniformly. This method works good for hard data skew, but has the ....
Yu P. Wolf J., Dias D. and Turek J. An effective algorithm for parallelizing hash joins in the presence of data skew. In Proceedings of the Seventh International Conference on Data Engineering, pages 200--209, Kobe, Japan, April 1991.
....algorithm on data that is not skewed. For example, most of the previously proposed skew handling algorithms require that the relations to be joined are completely Department of Computer Sciences, University of Wisconsin Madison. y HP Labs, Palo Alto. scanned before the join begins [HL91, WDYT90, KO90] Since the time to perform a parallel hash join is a small multiple of the time required to scan the two relations being joined, this can represent a substantial overhead, which is unacceptable for anything but extremely skewed data. Since there little or no empirical evidence that extreme ....
....has been computed, any task scheduling algorithm can be used to try to equalize the times required by the virtual processor partitions allocated to the physical processors. We used the heuristic scheduling algorithm known as LPT [Gra69] This approach is similar to that used by Wolf et al. WDYT90] in scheduling hash partitions, although in that paper the statistics used to schedule these partitions are gained by a complete scan of both relations rather than by sampling, and hash partitioning is used instead of range partitioning. 2.3 Algorithm Description The algorithms that we ....
[Article contains additional citation context not shown here]
Joel L. Wolf, Daniel M. Dias, Philip S. Yu, and John J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. IBM T. J. Watson Research Center Tech Report RC 15510, 1990.
....in the dataset, and partition skew, which occurs in parallel machines when the load is not balanced between the nodes. Different kinds of partition skew can be classified as initial tuple placement skew, selectivity skew, redistribution skew, and join product skew. The algorithm of Wolf et al. WDYT90] analyzes the base relations by doing an initial scan on them. This information is used in the actual repartitioning of the relations. Using an analytical model to compare the scheduling hash join algorithm of [WDYT90] and the hybrid hash join algorithm of Gamma [DG85, SD89, DGS 90] Walton ....
....skew, redistribution skew, and join product skew. The algorithm of Wolf et al. WDYT90] analyzes the base relations by doing an initial scan on them. This information is used in the actual repartitioning of the relations. Using an analytical model to compare the scheduling hash join algorithm of [WDYT90] and the hybrid hash join algorithm of Gamma [DG85, SD89, DGS 90] Walton et al. conclude that scheduling hash effectively handles redistribution skew while hybrid hash degrades and eventually becomes worse than scheduling hash as redistribution skew increases. However, unless the join is ....
J. L. Wolf, D. M. Dias, et al. An effective algorithm for parallelizing hash joins in the presence of data skew. IBM T. J. Watson Research Center Tech Report RC 15510, 1990.
....algorithms on a hypercube and obtained results similar to the current MasPar implementation. The hash algorithms, besides being fast, are easily parallelized. To use hashing as a preprocessing stage for join, is a relatively new concept and as such there is much 47 ongoing research in this area [TOY94, MS94, ME92, LY90, WDYTu91]. The future work in this area will consist of making even the hash stage parallel. One possible method would be to split the relation randomly into various processing elements. Then the hashing could be performed in parallel on these horizontal fragments. This concept incorporates the idea of ....
J. Wolf, D. Dias, P. Yu, and J. Turek. An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew, Proceedings of the 2nd International Conference on Data Engineering, 200-209, 1991. 51
....each processor is performing essentially the same task over different data, it is important that the division of data among the processors be as close to even as possible. Overcoming data skew has been studied extensively and the various approaches are well documented in the literature [WDJ91] [WDY91] [DNS92] HLY93] RM95] proposes dynamic load balancing schemes for parallel join processing and compares them to static algorithms in a simulated multi user environment. The dynamic load balancing techniques adjust the degree of join parallelism based on several factors, such as disk I O, CPU ....
J.L. Wolf, D.M. Dias, P.S. Yu, J. Turek, "An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew," Proc. 7th IEEE Data Engineering Conf. (1991), pp. 200-209.
No context found.
J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew. Proceedings of the 7th International Conference on Data Engineering, pages 200--209, April 1991.
....all tuples in a relation and that the values of one attribute are independent of those in another. Thus, the cardinalities of resulting relations of joins can be estimated according to the formula used in prior work [7] that is given in the Appendix for reference 3 . In the presence of data skew [33], we only have to modify the corresponding formula accordingly [10] 3 Using Hash Filters for a Bushy Execution Tree In this section, we shall first evaluate the effect of hash filters and then propose a scheme to derive hash filters for a bushy execution tree. 3 Note that this formula offers ....
J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew. Proceedings of the 7th International Conference on Data Engineering, pages 200--209, April 1991.
....are independent of those in another. Thus, when the heuristics derived in Section 3.2 are applied, the cardinalities of resulting relations of joins can be estimated according to the formula used in prior work [3] that is given in the Appendix for reference 2 . In the presence of data skew [25], we only have to modify the corresponding formula accordingly [8] For ease of exposing the concept of segmented right deep trees, we assume the aggregate memory in the system can accommodate a few entire relations for pipelining. Note that in the case that the aggregate memory is not 2 This ....
J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. Proceedings of the 7th Intern'l Conf. on Data Engineering, pages 200-- 209, April 1991.
....1 1 Introduction: There has been a good deal of progress recently towards the efficient parallelization of the component phases of single queries in multiprocessor database systems. For example, 1, 2] have designed algorithms for handling the merge phase of a sort query in parallel. Similarly, [3, 4, 5, 6, 7, 8] have designed algorithms for effectively parallelizing the join phase of either a sort merge join or a hash join, even in the presence of skewed data [9, 10] Other queries, such as scans, and other phases of sort queries and join queries are actually easier to parallelize effectively. The above ....
....probabilities and have found similar results to those presented here in all cases. Our database of randomly generated query trees contains over 2000 entrees. We model the task execution time functions as follows: Given the work of [1, 2] on parallelizing the merge phase of a sort query and that of [3, 4, 5, 6, 7] on parallelizing the join phase of either a sort merge or hash join, it is not unreasonable to expect speedups which are close to linear. This would argue for a hyperbolic task execution time function of the form t(p) A=p (and a scheduling problem which is trivial) However, it is realistic to ....
J. Wolf, D. Dias, P. Yu, and J. Turek. An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew. In Proceedings of the 7th International Conference on Data Engineering, Kobe, Japan, pages 200--209, April 1991.
....of attributes are uniformly distributed over all tuples in a relation and that the values of one attribute are independent of those in another. Thus, the cardinalities of resulting relations of joins can be estimated according to the formula used in prior work [4] In the presence of data skew [26], we only have to modify the corresponding formula accordingly [9] 3 Using Hash Filters for a Bushy Tree In this section, we shall first evaluate the effect of hash filters and then propose a scheme to derive hash filters for a bushy execution tree. 3.1 The Effect of Hash Filters Let HFR i ....
J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew. Proceedings of the 7th International Conference on Data Engineering, pages 200--209, April 1991.
No context found.
J. L. Wolf, D. M. Dias, P. S. Yu, J. Turek, "An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew", IEEE Int. Conf. on Data Engineering, Kobe, Japan, April 1991.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC