| S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In 10th World Wide Web, pages 396--406, 2001. |
....where Bloom filters are used to decrease the cost of intersecting lists of postings over the network, though this only improves results by a constant factor. We investigate in Subsection 4. 1 how These two organizations are also sometimes referred to as vertical and horizontal index partitioning [32]. We avoid these terms here as they tend to lead to confusion with standard database terminology. This problem is also known as the database selection problem in the meta search community [33] 4 recent results on top queries in the database literature [18] can be applied to our scenario to ....
.... execution in IR and search engines, we refer to [3, 5, 51] and for issues in parallel search engine architecture we refer to [7, 8, 28, 41] Discussions and comparisons of local and global index partitioning schemes and the resulting query performance on parallel architectures are given, e.g. in [4, 12, 25, 31, 32, 48]. There has been a lot of recent interest in the pruning techniques of Fagin et. al [17, 19] see also [18] for a survey and [13] for early related ideas. Most of the interest has been focused on multimedia and meta search scenarios, and we are not aware of previous applications in a peer to peer ....
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In Proc. of the 10th Int. World Wide Web Conference, May 2000.
....representation is con structed. This forward index is then compared with the forward index of the old version stored by the system. Note how the location IDs of the words changes. word in the collection has to be scanned to construct the inverted file. Distributed rebuilding technique such as [17] parallelizes (and pipelines) the process, but does not eliminate the need of scanning every word in every document. Scanning and re indexing the words in documents that did not change is very wasteful. Rebuilding is preferable only if (1) there are very few common documents ( f [ is small) ....
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. Proceedings of the loth Intl. WWW Conf., 2001.
....node builds a complete index on its own subset of documents (used by AltaVista and Inktomi) or (b) a global index organization where each node contains complete inverted lists for a subset of the words. Each scheme has advantages and disadvantages that we do not have space to discuss here; see [4, 28, 37]. In this paper, we assume a local index organization, although some of the ideas also apply to a global index. Thus, we have a number of machines, in our case , each containing an index on a subset of the documents. Another machine acts as a query integrator that receives queries from the ....
.... For background on indexing and query execution in IR and search engines, we refer to [3, 5, 40] and for basics of parallel search engine architecture we refer to [7, 8, 26, 34] Discussions and comparisons of local and global index partitioning schemes and their performance are given, e.g. in [4, 12, 23, 28, 37]. A large amount of recent work has focused on link based ranking and analysis schemes; see [6, 22, 24, 25, 31, 33] for a small sample. Previous work on pruning techniques for top can be divided into two fairly disjoint sets of literature. In the IR community, researcher have studied pruning ....
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In Proc. of the 10th Int. World Wide Web Conference, May 2000.
....Internet Archive [31] and the private repositories of commercial search engines like Google and Altavista [22, 4] Web repositories maintain several indexes over the pages in their collection. Typically, this would include an index over the textual content of the pages (such as an inverted index [18]) indexes over simple properties of pages such as length, natural language, or page topic, and an index over the link graph structure of the Web (hereafter referred to as the Web graph) Using the pages and these basic indexes, repositories often build more advanced application specific indexes ....
....one page in S1 and one page in S2. Rank each page in R by the number of incoming links from S1 and S2 and output R in descending order by rank. Intersection of out neighborhoods of two different sets of pages Table 3: Some of the queries used in the experiments Stanford WebBase repository [18, 9]. However, unlike the graph representations which were stored on local disks, these indexes were available on other machines and had to be accessed remotely. As a result, execution times for the entire query would be influenced by factors such as network latency, organization and size of these ....
Sergey Melnik et. al. Building a distributed full-text index for the Web. In Proc. of the 10th Intl. World Wide Web Conf. (WWW10), pages 396--406, May 2001.
....in memory lists, which based on certain policies periodically are merged with the lists on disk. 2. Storing the posting lists in the leaf nodes in the vocabulary index. This is more suitable in the case of dynamic databases. One example of using this approach is presented by Melnik et al. [15], in the context of a distributed indexing system based on inverted files. Similar to our approach, they also store postings in chunks in a B tree index based on Berkeley DB. Combinations of these alternatives are also possible. In general, the vocabulary index is a multi tree variant (usually ....
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In Proceedings of WWW10, 2001.
....Design possibilities and tradeoffs for a repository of web pages are covered by Hirai, et al. [13] Bharat, et al. [4] describe their experiences in building a fast web page linkage connectivity server. Different architectures for distributed inverted indexing schemes are discussed by Melnik, et al. [24] and Ribeiro Neto, et al. [31] In contrast, this paper primarily focuses on design and implementation details and considerations of a comprehensive and extensible search engine prototype that implements analogs or derivatives of many individual functions discussed in the mentioned papers, as well ....
Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. Building a distributed full-text index for the Web. In Proceedings of the 10th International World Wide Web Conference, Hong Kong, May 2001.
....in memory lists, which periodically, based on certain policies, are merged with the lists on disk. 2. Storing the posting lists in the leaf nodes in the vocabulary index. This is more suitable in the case of dynamic databases. One example of using this approach is presented by Melnik et al. [15], which describe a distributed indexing system based on inverted files. Similar to our approach, they also store postings in chunks in a B tree index based on Berkeley DB. Combinations of these alternatives are also possible. In general, the vocabulary index is a multi tree variant (usually ....
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In Proceedings of WWW10, 2001.
....Work There has been a substantial amount of recent research on the design and e#cient implementation of various features of web search engines. In particular, Ribeiro Neto, et al. [20] described an inverted index construction scheme carefully optimized for clustered execution. Melink, et al. [18] proposed a distributed index construction method, which applied explicit disk, CPU, and network communication pipeline. Hirai, et al. [14] analyzed di#erent design aspects of a repository system for generic web page storage, update, and retrieval. Haveliwala [12] described a disk e#cient ....
Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. Building a distributed full-text index for the Web. In Proceedings of the Hong Kong, May 2001.
....IMS of another type. The main advantage of this approach is that the top level IMS can leverage the facilities of the underlying IMS (e.g. concurrency control, recovery, caching, index structures, etc. without significant Figure 1: Common integration architectures Layering [8] 15] 41] 46] 9][47][20] 38] 39] 57] 19] 42] 23] 56] 60] 12] 29] 54] 27] Loose Coupling (or) Middleware Integration [40] 62] 25] 26] 16] 4] 45] 13] Extension [44] 31] 33] 18] 17] 35] 20] 22] 58] 51] 21] 65] 53] 61] 37] 35] 32] 30] 24] 5] 36] 28] 14] 55] 43] 11] Text Retrieval ....
....object database systems [9, 19, 57] Since the object data model natively supports nesting, in addition to collection types and sets, we expect that systems for content based retrieval of structured documents could be effectively implemented on top of OODB systems. More recently, Melnik et al. [47] demonstrate the use of an embedded database system (such as Berkeley DB [50] to store and manage inverted indexes. The performance results in [47] indicate that for application scenarios when a full fledged heavyweight client server database is not necessary, an embedded DBMS can act as an ....
[Article contains additional citation context not shown here]
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In Proceedings of the 10th Intl. World Wide Web Conf. (WWW10), pages 396--406, May 2001.
....Section 1) periodic crawling and rebuilding of the index is necessary to maintain freshness. Index rebuilds become necessary because most incremental index update techniques perform poorly when confronted with the huge wholesale changes commonly observed between successive crawls of the Web [46]. Finally, storage formats for the inverted index must be carefully designed. A small compressed index improves query performance by allowing large portions of the index to be cached in memory. However, there is a tradeo# between this performance gain and the corresponding decompression overhead ....
....related statistics, and updates the statistics as summaries for a term are received. At the end of the processing phase, the statistician sorts the statistics in memory and sends them back to the indexers. Figure 12 illustrates the FL strategy for collecting page frequency statistics. Reference [46] presents and analyzes experiments that compare the relative overhead of the two strategies for di#erent collection sizes. Their studies show that the relative overheads of both strategies is acceptably small (less than 5 for a 2 million page collection) and exhibit sub linear growth with ....
Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. Building a distributed full-text index for the web. In Proceedings of the Tenth International World-Wide Web Conference, 2001.
....A and B, A could store the inverted lists for all index terms that begin with characters in the 20 ranges [a q] whereas B could store the inverted lists for the remaining index terms. Therefore, a search query that asks for pages containing the term process would only involve A. Reference [45] describes certain important characteristics of the IF L strategy, such as resilience to node failures and reduced network load, that make this organization ideal for the Web search environment. Performance studies in [56] also indicate that IF L organization uses system resources e#ectively and ....
....processing, and flushing that transform some set of pages into a sorted run. Let B i , i =1. N, be the bu#er sizes used during these N executions. The sum B i = B total is fixed for a given amount of input and represents the total size of all the postings extracted from the pages. In [45], we show that for an indexer with a single resource of each type (single CPU, single disk, and a single network connection over which to receive the pages) optimal speedup of the pipeline is achieved when the bu#er sizes are identical in all executions of the pipeline, i.e. B = B 1 . BN = ....
[Article contains additional citation context not shown here]
Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. Building a distributed full-text index for the web. Technical Report SIDL-WP-2000-0140, Stanford Digital Library Project, Computer Science Department, Stanford University, July 2000. Available at http://wwwdiglib. stanford.edu/cgi-bin/get/SIDL-WP-2000-0140.
....loading, flushing, and processing phases for the i th execution of the pipeline. 3 For large N , the overall indexing time is determined by the scarcest resource (the CPU, in Figure 3) and can be approximated by Tp = max P N i=1 l i , P N i=1 p i , P N i=1 f i . It can be shown (see [13]) that Tp is minimized when all N pipeline executions use the same bu#er size B, where B = B1 . BN = B total N . Let l = #B, f = #B, and p = #B #B log B be the durations of the loading, processing, and flushing phases, respectively. We must choose a value of B that maximizes the speedup ....
....ratio to 3:3:2, so that #1 = 2 2 3 # 2.67. In general, setting #B #B log B = max #B,#B , we obtain B = 2 max #,# # # (1) This expression represents the size of the postings bu#er that must be used to maximize the pipeline speedup, on an indexer with a single resource of each type. In [13] we generalize equation 1 to handle indexers with multiple CPUs and disks. If we use a bu#er of size less than the one specified by equation 1, loading or flushing (depending on the relative magnitudes of # and #) will be the bottleneck and the processing phase will be forced to periodically wait ....
[Article contains additional citation context not shown here]
S. Melnik et al. Building a distributed full-text index for the web. Technical report, Stanford Digital Library Project, July
....the loading, ushing, and processing phases for the # ## execution of the pipeline. # For large # , the overall indexing time is determined by the scarcest resource (the CPU, in Figure 3) and can be approximated by ## =max# # # ### # # # # # ### # # # # # ### # # #. It can be shown (see [13]) that ## is minimized when all # pipeline executions use the same bu er size #, where # = ## ### = ## = # ##### # . Let # = ##, # = ##, and # = # ## log # be the durations of the loading, processing, and ushing phases, respectively. Wemust choose a value of # that maximizes the speedup ....
....the ratio to 3:3:2, so that ## = 2 # # # 2#67. In general, setting # ## log # = max#######,we obtain # =2 ######### # (1) This expression represents the size of the postings bu er that must be used to maximize the pipeline speedup, on an indexer with a single resource of eachtype. In [13] wegeneralize equation 1 to handle indexers with multiple CPUs and disks. If we use a bu er of size less than the one speci ed by equation 1, loading or ushing (depending on the relative magnitudes of # and #) will be the bottleneck and the processing phase will be forced to periodically wait ....
[Article contains additional citation context not shown here]
S. Melnik et al. Building a distributed full-text index for the web. Technical report, Stanford Digital Library Project, July 2000. Available at wwwdiglib. stanford.edu/cgi-bin/get/SIDL-WP-2000-0140.
No context found.
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In 10th World Wide Web, pages 396--406, 2001.
No context found.
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In 10th World Wide Web, pages 396--406, 2001.
No context found.
Melnik, S., Raghavan, S., Yang, B., Garcia-Molina, H.: Building a distributed full-text index for the web. ACM Trans. Inf. Syst. 19 (2001) 217--241
No context found.
S. Melnik, S. Raghavan, B. Yang, and H. GarciaMolina. Building a Distributed Full-Text Index for the Web. In Proceedings of the 10th International World Wide Web Conference, Hong Kong, pages 396--406, 2001.
No context found.
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In Proc. of the 10th Int. World Wide Web Conference, May 2000.
No context found.
Melnik, S., Raghavan, S., Yang, B., Garcia-Molina, H.: Building a distributed full-text index for the web. ACM Trans. Inf. Syst. 19 (2001) 217--241
No context found.
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In Proc. of the 10th Int. World Wide Web Conference, May 2000.
No context found.
Sergey Melnik, Sriram Raghaven, Beverly Yang, and Hector Garcia-Molina. Building a distributed full-text index for the web. pages 396--406, 2001.
No context found.
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed fulltext index for the web. In Proceedings of WWW10, pages 396--406, 2001.
No context found.
Melnik, S., Raghavan, S., Yang, B., and Garcia-Molina, H., "Building a distributed full-text index for the Web," Proc. 10 Intl. World Wide Web Conf., 396-406, 2001. Available at http://dbpubs.stanford.edu/pub/2000-29
No context found.
Melnik S., Raghavan S., Yang B., Garcia-Molina H. "Building a Distributed Full-Text Index for the Web". 10th World Wide Web Conference, Hong Kong, 2001.
No context found.
Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. Building a distributed full-text index for the web. Proceedings of the 10th Intl. WWW Conf., 2001.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC