49 citations found. Retrieving documents...
Jamie Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In Proc. ACM SIGMOD 99, 1999.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Extending SDARTS: Extracting Metadata from Web Databases .. - Ipeirotis, Barry.. (2002)   (Correct)

....contents are not HTML documents) To circumvent these problems and still be able to automatically build good quality content summaries, we resort to document sampling. A good quality content summary of a collection can be derived from a small, representative document sample from the collection [3]. Earlier research has shown that we can extract such a document sample with a relatively small number of query probes [3, 4, 12] An approximate content summary can then be built from the documents that best match each query probe at the collection in question. Interestingly, the e#ectiveness of ....

....summaries, we resort to document sampling. A good quality content summary of a collection can be derived from a small, representative document sample from the collection [3] Earlier research has shown that we can extract such a document sample with a relatively small number of query probes [3, 4, 12]. An approximate content summary can then be built from the documents that best match each query probe at the collection in question. Interestingly, the e#ectiveness of the best database selection algorithms does not su#er significantly from using approximate content summaries extracted in this ....

James P. Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In Proceedings of the 1999.


Query- vs. Crawling-based Classification of Searchable.. - Gravano, Ipeirotis.. (2002)   (Correct)

....web databases. The average number of queries sent to each database was 182, and no documents needed to be retrieved from the databases. Furthermore, the number of words per query ranged between just one and four words. Further details of our algorithm and evaluation are described in [8] See [2, 3, 10, 6, 7] 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CNN Sports Illustrated Johns Hopkins AIDS Service Tom s Hardware Guide Office of Scientific and Technical Information Duke University Rare Books Specificity Arts Computers Health Science Sports Figure 2: Distribution of documents in the ....

James P. Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In Proceedings of the 1999.


Query- vs. Crawling-based Classification - Of Searchable Web (2002)   (Correct)

....Johns Hopkins AIDS Service Tom s Hardware Guide Office of Scientific and Technical Information Duke University Rare Books Specificity Arts Computers Health Science Sports Figure 2: Distribution of documents in the top level categories for five searchable web databases. in [8] See [2, 3, 10, 6, 7] for other related work relevant to database classification. As we discussed in the introduction, our technique can be also applied to the classification of any database that offers a search interface for its contents, no matter if its contents are hidden or not. 3.2 Crawling based ....

James P. Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In Proceedings of the 1999.


Automatic Classification of Text Databases through Query.. - Ipeirotis, Gravano, Sahami (2000)   (Correct)

....Manually constructed query probes have been used in [4] for the classification of text databases. Query probes were used in [7] to rank databases by similarity to a given query. This algorithm assumes that the query interface can handle di#erently normal queries and query probes. Reference [1] probes text databases with queries to determine an approximation of their vocabulary and associated statistics. This technique requires retrieving the documents in the query results for further analysis. Finally, guided query probing has been used in [13] to determine sources of heterogeneity in ....

....queries in such a way that we can use the inclusion exclusion principle to calculate the number of results that would have been returned for the original queries. A significant advantage of our probing approach is that we do not need to retrieve documents to analyze the contents of a database [1]. Instead, we count only the number of matches for these queries. Thus, in our approach we only require a database to report the number of matches for a given query. It is common for a database to return something like X documents found before returning the actual results. 3.3 Using Probing ....

[Article contains additional citation context not shown here]

James P. Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In SIGMOD 1999.


Distributed Search over the Hidden Web: Hierarchical.. - Ipeirotis, Gravano (2002)   (16 citations)  (Correct)

....hope that such a protocol will be adopted soon. Hence, other solutions are needed to automate the construction of content summaries from databases that cannot or are not willing to export such information. We review one such approach next. 2. 2 Uniform Probing for Content Summary Callan et al. [4, 3] presented pioneer work on automatic extraction of document frequency statistics from uncooperative text databases that do not export such metadata. Their algorithm extracts a document sample from a given database D and computes the frequency of each observed word w in the sample, SampleDF ....

....or on metadata directly exported by the databases. Unfortunately, web accessible databases rarely export such metadata. Recently, Etzioni and Sugiura [31] proposed the QPilot technique, which uses query expansion to route queries to the appropriate search engines (Section 5. 2) Callan et al. [3, 4] suggested using query probes to extract document samples from databases for content summary construction. Craswell et al. 7] compared the performance of flat database selection algorithms in the presence of such content summaries. Hawking and Thistlewaite [16] used query probing at query time to ....

J. P. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In SIGMOD'99, 1999.


SDLIP + STARTS = SDARTS A Protocol and Toolkit for.. - Green, Ipeirotis, Gravano (2001)   (Correct)

....administrator to write a metadata file (the meta attributes.xml file) with the information specified by STARTS XML. In the future, we could automatically generate at least an approximation of the content summaries by using the results of research on metadata extraction from uncooperative sources [4, 15]. We decided that the best way to make this wrapper configurable without additional Java coding was through the use of XSLT stylesheets and the starts intermediate format. We extended it to be able to describe CGI invocations HTTPRequest Apache Xalan XSL Processor STARTS XML www query ....

J. P. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In SIGMOD


Mining the Web to Create Minority Language Corpora - Ghani, Jones, Mladenic (2001)   (1 citation)  (Correct)

....the precision of our corpus. In general on the web we cannot assess coverage, though we could measure the rate at which we nd new Slovenian documents as our experiments progress and a decreasing rate would give us a bound on the number of documents we can nd using our methods. Callan et al. [3] and Ghani and Jones [7] use the measures percent vocabulary coverage and cumulative term frequency to evaluate the coverage of their language models, as they have access to the entire experimental corpus. Although our task is not to construct a language model for Slovenian and we do not have the ....

J. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In Proceedings of the


Learning a Monolingual Language Model from a Multilingual Text .. - Ghani, Jones (2000)   (Correct)

....and access the database using the model. Various studies have been performed on the optimal choice of features that a good language model should contain, but in general a language model describes the words that occur in a database, and frequency information indicating how often each term occurs [2]. In natural language tasks, a language model is usually formulated as a probability distribution p(s) over strings s that attempts to reflect how frequently a string occurs in a language. The most widely used language models are n gram models. We construct unigram language models which assume ....

....to the WWW, and access through a search engine is time intensive, and every page on the WWW does not come labeled with the language the document was written in, we cannot apply traditional language modelling techniques to our database. Instead, we use the approach introduced by Callan et al. [2] which uses querybased sampling to acquire language models from multiple databases. They are motivated by the fact that word occurrences follow a highly skewed distribution, with a few words occurring very often, and most words occurring rarely. In the light of evidence suggesting that the ....

[Article contains additional citation context not shown here]

J. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In Proceedings of the


Obtaining Language Models of Web Collections Using.. - Monroe, French, Powell (2002)   (2 citations)  (Correct)

....to search by identifying those collections that are most likely to satisfy the information need[2,3,4,6,8,11,12,13,14] Collection selection algorithms require information about the contents of the collections among which they are selecting. We will use the terminology of Callan et al. [1] and refer to the summary content information as a language model of the collection. For our purposes, a language model is simply a list of the words that occur in a collection and their frequency of occurrence. Because a completely accurate language model must include a representation of all of ....

....the need for cooperation from the other party. One such method is called query based sampling and is the technique used in the research presented in this paper. Query based sampling is a sampling technique in which metadata is inferred by interacting with each collection and observing the outcomes [1]. Previous research has been done on query based sampling to investigate the generality and behavior of the technique under a variety of conditions. Prior research by [1] demonstrated the technique s effectiveness at learning accurate language models for several research testbeds of vary1 ing ....

[Article contains additional citation context not shown here]

J. Callan, M. Connell, and A. Du. Automatic Discovery of Language Models for Text Databases. In Proc. ACM SIGMOD International Conference on Management of Data, pages 479--490, 1999.


MIND: An architecture for multimedia information retrieval.. - Nottelmann, Fuhr (2001)   (4 citations)  (Correct)

....# V (e.g. colour histograms) over a continuous domain V . It is not possible to store the sums for all feature vectors. Instead, vectors are clustered. Each cluster V j # V is described by its centroid # v j , the radius and the number of vectors V i in it. Furthermore, let f : V V # [0, 1] define a retrieval metric for feature vectors. Then, the indexing weight sum can be estimated by # j V j f(v j , value(c i ) for condition c i (computed at runtime) All library specific information needed to compute the expected costs (except the function f which will be coded in the ....

....also stored in the resource description. 3 Acquisition of resource descriptions The proxies cannot simply request the resource description from the non co operating libraries. For an environment in which the libraries only provide the query interface, query based sampling has been proposed in [1] as a solution to estimate document frequencies and indexing weights in text libraries. We will extend this technique for other media types, where feature vectors have to be extracted and clustered. For query transformation, schema mappings are required. We want MIND to learn schema mappings from ....

J. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In Proceedings of the


Mining the Web to Create Minority Language Corpora - Ghani, Jones, Mladenic (2001)   (1 citation)  (Correct)

....the precision of our corpus. In general on the web we cannot assess coverage, though we could measure the rate at which we find new Slovenian documents as our experiments progress and a decreasing rate would give us a bound on the number of documents we can find using our methods. Callan et al. [3] and Ghani and Jones [7] use the measures percent vocabulary coverage and cumulative term frequency to evaluate the coverage of their language models, as they have access to the entire experimental corpus. Although our task is not to construct a language model for Slovenian and we do not have the ....

J. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In Proceedings of the


Probe, Count, and Classify: Categorizing Hidden-Web Databases - Ipeirotis, Gravano, Sahami (2001)   (2 citations)  (Correct)

....procedure requires. Table 1 shows a sample of five databases from the Web set. 4.2 Techniques for Comparison We tested variations of our probing technique, which we refer to as Probe and Count, against two alternative strategies. The first one is an adaptation of the technique described in [2], which we refer to as Document Sampling. The second one is a method described in [29] that was specifically designed for database classification. We will refer to this method as Title based Querying. The methods are described in detail below. Probe and Count (PnC) This is our technique, ....

.... parameters that can be varied in our database classification technique are thresholds #ec (for coverage) and #es (for specificity) Di#erent values for these thresholds result in di#erent approximations Approximate (D) of the ideal classification Ideal(D) Document Sampling (DS) Callan et al. [2] use query probing to automatically construct a language model of a text database (i.e. to extract the vocabulary and associated word frequency statistics) Queries are sent to the database to retrieve a representative random document sample. The documents retrieved are analyzed to extract the ....

[Article contains additional citation context not shown here]

J. P. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, pages 479--490, 1999.


The Effects of Query-Based Sampling on Automatic.. - Callan, French.. (2000)   (4 citations)  (Correct)

....each database contains. This information is often derived from a unigram language model, which lists the words that occur in the database and their frequencies of occurrence. Two methods have been proposed for acquiring such metadata automatically: the STARTS protocol [8] and query based sampling [2]. STARTS is a cooperative protocol, in which all parties are trusted to exchange accurate metadata upon request. Query based sampling is a sampling technique in which metadata is inferred by interacting with each database and observing the outcomes. Query based sampling is a relatively new ....

....the outcomes. Query based sampling is a relatively new technique, hence little is known about its generality and behavior under a variety of conditions. Prior research demonstrated its effectiveness at learning accurate metadata for several research testbeds of varying size and heterogeneity [2]. Later research demonstrated that learned metadata resulted in relatively accurate database selection [1] The early results were encouraging, but they studied query based sampling under a set of relatively narrow conditions. Retrieval results were obtained with just one database selection ....

[Article contains additional citation context not shown here]

J. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, pages 479--490. 10 ACM, 1999.


Using the Web to Create Minority Language Corpora - Ghani, Jones, Mladenic (2001)   (2 citations)  (Correct)

....of Web pages found by using each of the three documents as seeds for our experiments. We can also measure the rate at which we find new Slovenian documents as our experiments progress and a decreasing rate would give us a bound on the number of documents we can find using our methods. Callan et al. [3] and Ghani and Jones [8] use various measures like percent vocabulary coverage and ctf to evaluate the coverage of their language models. Although our task is not to construct a language model for Slovenian and we do not have the true model to compare against, we can still use these measures and ....

J. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pages 479--490, Philadelphia, 1999. ACM.


Towards a Highly-Scalable and Effective Metasearch Engine - Wu, Meng, Yu, Li (2001)   (3 citations)  (Correct)

....each local search engine to provide the statistical information needed for database selection. Clearly, there will be cases where the documents of a database cannot be independently obtained and a local search engine is un cooperative. In these cases, a technique known as query sampling [5] could be adopted to estimate the needed statistics. For the rest of this this paper, we assume that the adjusted maximum normalized weights have already been obtained. If we follow the example of existing approaches, wewould create a separate database representative for each database. In this ....

....of correct documents are identi ed. The improvements over not considering the new terms vary from 5.3to8.8percentage points for cor iden db and from 5.2 to 8.9 percentage points for cor iden doc. One of the issues we are currently studying is howtoadopt the query sampling technique proposed in [5] to estimate the adjusted maximum normalized weight of a term from an un cooperative search engine. A pilot study has been carried out to estimate a related statistic (i.e. the maximum normalized weight) and preliminary results indicate that the technique is promising [18] ################ ....

J. Callan, M. Connell, and A. Du. Automatic Discovery of Language Models for Text Databases. ACM SIGMOD, 1999.


A Description of the LAMB Web-Derived Language Model Builder - Edward Neil James (2000)   (Correct)

....Builder Edward K. O Neil James C. French Department of Computer Science University of Virginia Technical Report CS 2000 31 May 16, 2000 Abstract This paper describes the language modeling script constructed for the Spring 2000 Information Retrieval seminar. The script builds on Callan s[1] work of automatically generating language models for text databases by creating language models for internet search engines. An overview of the work, principles and observed properties of language models, the language modeling script (LAMB) and its shortcomings are described. 1 Introduction ....

....of different indexing, stopping, or stemming strategies at each site, but the sites cooperate with the server to provide information about their contents. Obviously, this method is not effective for sites that do not voluntarily provide a language model representing their contents. Callan et al.[1] present the query based sampling method as a way to subvert uncooperative collections by automatically learning the language model of a given site[1] This process proceeds by approximate random sampling at the source via a query interface and construction of the language model from the documents ....

[Article contains additional citation context not shown here]

Jamie Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In Proceedings of the ACM--SIGMOD International Conference on Management of Data, pages 479--490, 1999.


Determining Stopping Criteria in the Generation of.. - Monroe, Mikesell, French (2000)   (Correct)

....need [GGM95, CLC95, FPV98, FPC 99b, FPC99a, XC99] Database selection algorithms require information about the contents of those databases over which they are selecting. A number of ways have been proposed for defining what information is required and how to acquire it, cf. GCGMP97, HT99, CCD99] Following Callan et al. CCD99] we refer to this information as a language model. Our language model of a database lists the words that occur in the database and their frequency of occurrence and, perhaps, other information. Since a completely accurate language model must include a ....

.... 99b, FPC99a, XC99] Database selection algorithms require information about the contents of those databases over which they are selecting. A number of ways have been proposed for defining what information is required and how to acquire it, cf. GCGMP97, HT99, CCD99] Following Callan et al. CCD99] we refer to this information as a language model. Our language model of a database lists the words that occur in the database and their frequency of occurrence and, perhaps, other information. Since a completely accurate language model must include a representation of all of the words from all ....

[Article contains additional citation context not shown here]

Jamie Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In Proceedings of the ACM--SIGMOD International Conference on Management of Data, pages 479--490, 1999. 9


Server Selection on the World Wide Web - Craswell, Bailey, al. (2000)   (29 citations)  (Correct)

....normally do not export any information about the documents they index, such as term occurrence statistics, for use in selection. For this reason, information needs to be extracted using the lowest common denominator: the ability of a search server to return search results. Callan, Connell and Du [2] suggested query based sampling of documents, by sending a series of probe queries to all servers (see Figure 2) before query time, downloading the documents returned and extracting term occurrence statistics from those documents. Although this provides a non random sample of a server s documents, ....

....query based sampling of documents, by sending a series of probe queries to all servers (see Figure 2) before query time, downloading the documents returned and extracting term occurrence statistics from those documents. Although this provides a non random sample of a server s documents, the study [2] found extracted statistics to be representative of full server statistics. The approach is also appealing because: 1) it can be applied to any search server whose returned documents are available for download, 40 and 2) a broker can address a new server at any time simply by sending it the probe ....

[Article contains additional citation context not shown here]

Jamie Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In Proceedings of the 1999 ACM International Conference on Management of Data (SIGMOD 99), New York, 1999. ACM.


Discovery of Similarity Computations of Search Engines - Liu, Meng, Yu (2000)   (1 citation)  (Correct)

....ffl providing techniques to determine the constants embedded in the formulas for computing the weight of a term; and ffl providing experimental results to illustrate how our techniques can be utilized to discover the similarity computation of the WebCrawler search engine. In a recent paper [3], the problem of discovering the language model for a text database is addressed. A language model describes the words or indexing terms that occur in the database and frequency information indicating how often each term occurs [3] A method is presented that uses a query based sampling approach. ....

....computation of the WebCrawler search engine. In a recent paper [3] the problem of discovering the language model for a text database is addressed. A language model describes the words or indexing terms that occur in the database and frequency information indicating how often each term occurs [3]. A method is presented that uses a query based sampling approach. It is shown that a database selection service (global search engine) can learn the language model of a (uncooperative) database by sampling the contents of the database via the process of running carefully selected queries and ....

J. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In Proceedings of ACM SIGMOD, 1999.


Building Efficient and Effective Metasearch Engines - Meng, Yu, Liu (2002)   (11 citations)  (Correct)

....representatives may not contain certain information desired by a particular database selector. For non cooperative search engines that do not follow any standard, their representatives may be extracted from past retrieval experiences 14 (e.g. SavvySearch [14] or from sampled documents (e.g. [15]) But sampling may cause inaccuracies. There are two major challenges in developing good database selection algorithms. One is to identify appropriate database representatives. A good representative should permit fast and accurate estimation of database usefulness. At the same time, a good ....

....engine but also the detection of major upgrades or changes of existing component systems. Some preliminary work in this area has started to be reported. Using sampling technique to discover the terms in a component database and some statistical information about these terms is reported in [15]. In [42] a technique is proposed to discover how term weights are assigned in component search engines. New techniques need to be developed to discover knowledges about component search engines more accurately and more efficiently. 4. There are two extreme ways to building a metasearch engine. ....

J. Callan, M. Connell, and A. Du. Automatic Discovery of Language Models for Text Databases. ACM SIGMOD Conference, 1999.


Towards a Highly-Scalable and Effective Metasearch Engine - Wu (2001)   (3 citations)  (Correct)

....each local search engine to provide the statistical information needed for database selection. Clearly, there will be cases where the documents of a database cannot be independently obtained and a local search engine is un cooperative. In these cases, a technique known as query sampling [5] could be adopted to estimate the needed statistics. For the rest of this this paper, we assume that the adjusted maximum normalized weights have already been obtained. If we follow the example of existing approaches, we would create a separate database representative for each database. In this ....

....that very good retrieval accuracy can be achieved by the proposed solution. A prototype system based on the proposed method has been implemented (see http: slate.cs. binghamton.edu:8080 CSams ) One of the issues we are currently studying is how to adopt the query sampling technique proposed in [5] to estimate the adjusted maximum normalized weight of a term from an un cooperative search engine. A pilot study has been carried out to estimate a related statistic (i.e. the maximum normalized weight) and preliminary results indicate that the technique is promising [18] Acknowledgement: This ....

J. Callan, M. Connell, and A. Du. Automatic Discovery of Language Models for Text Databases. ACM SIGMOD, 1999.


Automatically Building a Corpus for a Minority Language from.. - Jones, Ghani (2000)   (Correct)

....have complete access to the WWW, and access through a search engine is time intensive, and not every page on the WWW comes labeled with the language the document was written in, we cannot apply traditional language modeling techniques to our database. Instead, we use the approach introduced by Callan et al. 1999) which uses query based sampling to acquire monolingual language models from multiple databases. They are motivated by the fact that word occurrences follow a highly skewed distribution, with a few words occurring very often and most words occurring rarely. In the light of evidence suggesting that ....

....in M , the language model should cover more of the terms found in the true vocabulary. Percentage of vocabulary learned gives equal importance to all the terms in the vocabulary and thus is not a good match for text data because of the skewed distribution of terms in a corpus. According to Callan et al. (1999), about 75 of the vocabulary of a text database is words that occur 3 times or less. 3.3 Cumulative Term Frequency (ctf) Ratio Another measure for the quality of the learned vocabulary is the ctf ratio which gives a weight to each term that is proportional to its frequency in the corpus. Ctf ....

J. Callan, M. Connell, and A. Du. 1999. Automatic discovery of language models for text databases. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pages 479--490, Philadelphia.


Learning a Monolingual Language Model from a Multilingual Text .. - Ghani, Jones (2000)   (Correct)

....and access the database using the model. Various studies have been performed on the optimal choice of features that a good language model should contain, but in general a language model describes the words that occur in a database, and frequency information indicating how often each term occurs [2]. In natural language tasks, a language model is usually formulated as a probability distribution p(s) over strings of words s that attempts to reflect how frequently a string occurs in a language. The most widely used language models are n gram models. In this paper, we construct unigram language ....

....to the WWW, and access through a search engine is time intensive, and every page on the WWW does not come labeled with the language the document was written in, we cannot apply traditional language modelling techniques to our database. Instead, we use the approach introduced by Callan et al. [2] which uses query based sampling to acquire language mod els from multiple databases. Query based sampling is motivated by the fact that word occurrences follow a highly skewed distribution, with a few words occurring very often, and most words occurring rarely. In the light of evidence ....

[Article contains additional citation context not shown here]

J. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pages 479--490, Philadelphia, 1999. ACM.


Automatically Building a Corpus for a Minority Language from.. - Jones, Ghani (2000)   (Correct)

....have complete access to the WWW, and access through a search engine is time intensive, and not every page on the WWW comes labeled with the language the document was written in, we cannot apply traditional language modeling techniques to our database. Instead, we use the approach introduced by Callan et al. (1999) which uses query based sampling to acquire monolingual language models from multiple databases. They are motivated by the fact that word occurrences follow a highly skewed distribution, with a few words occurring very often and most words occurring rarely. In the light of evidence suggesting that ....

....in M , the language model should cover more of the terms found in the true vocabulary. Percentage of vocabulary learned gives equal importance to all the terms in the vocabulary and thus is not a good match for text data because of the skewed distribution of terms in a corpus. According to Callan et al. (1999), about 75 of the vocabulary of a text database is words that occur 3 times or less. 3.3 Cumulative Term Frequency (ctf) Ratio Another measure for the quality of the learned vocabulary is the ctf ratio which gives a weight to each term that is proportional to its frequency in the corpus. Ctf ....

J. Callan, M. Connell, and A. Du. 1999. Automatic discovery of language models for text databases. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pages 479--490, Philadelphia.


Automatic Classification of Text Databases through Query.. - Ipeirotis, Gravano, Sahami (2000)   (Correct)

....Manually constructed query probes have been used in [4] for the classi cation of text databases. Query probes were used in [7] to rank databases by similarity to a given query. This algorithm assumes that the query interface can handle di erently normal queries and query probes. Reference [1] probes text databases with queries to determine an approximation of their vocabulary and associated statistics. This technique requires retrieving the documents in the query results for further analysis. Finally, guided query probing has been used in [13] to determine sources of heterogeneity in ....

....queries in such a way that we can use the inclusion exclusion principle to calculate the number of results that would have been returned for the original queries. A signi cant advantage of our probing approach is that we do not need to retrieve documents to analyze the contents of a database [1]. Instead, we count only the number of matches for these queries. Thus, in our approach we only require a database to report the number of matches for a given query. It is common for a database to return something like X documents found before returning the actual results. 3.3 Using Probing ....

[Article contains additional citation context not shown here]

James P. Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In SIGMOD 1999, Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, USA, pages 479-490. ACM Press, 1999.


Personalized Information Environments: An Architecture for.. - French, Viles (1999)   (1 citation)  (Correct)

....existence to some registry that a PIE knows about. We assume that such registries exist. 10 with. Some providers will be willing to give up such information, others will do so only for a fee. Random sampling of the resource is an indirect way to determine a resource s statistical properities [4] In addition to RDs, PeCs will contain significant information related to user manipulations. Minimally, this would include desktop organization information and user annotations, but, in the case of shared PeCs, would also include access control information and anonymized PeC usage patterns. ....

J. Callan, M. Connell, and A. Du. Automatic Discovery of Language Models for Text Databases. In To appear: Proceedings of SIGMOD'99, Philadelphia, PA, 1999.


Discovery of Similarity Computations of Search Engines - Liu, Meng, Yu (2000)   (1 citation)  (Correct)

....ffl providing techniques to determine the constants embedded in the formulas for computing the weight of a term; and ffl providing experimental results to illustrate how our techniques can be utilized to discover the similarity computation of the WebCrawler search engine. In a recent paper [7], the problem of discovering the language model for a text databases is addressed. A language model describes the words or indexing terms that occur in the database and frequency information indicating how often each term occurs [7] A method is presented that uses a query based sampling approach. ....

....computation of the WebCrawler search engine. In a recent paper [7] the problem of discovering the language model for a text databases is addressed. A language model describes the words or indexing terms that occur in the database and frequency information indicating how often each term occurs [7]. A method is presented that uses a query based sampling approach. It is shown that a database selection service (global search engine) can learn the language model of a (uncooperative) database by sampling the contents of the database via the process of running carefully selected queries and ....

J. Callan, M. Connell and A. Du. "Automatic Discovery of Language Models for Text Databases". ACM SIGMOD, 1999.


A Statistical Method for Estimating the Usefulness of Text Databases - Liu (1998)   (1 citation)  (Correct)

....estimation method is quite robust with respect to the inaccuracy as a 4 bit approximation of each value can still produce reasonably accurate usefulness estimation. Furthermore, a recent study indicates that using sampling queries is capable of generating decent statistical information for terms [4]. 5 Experimental Results Three databases, D1, D2 and D3, and a collection of 6,234 queries are used in the experiment. D1, containing 761 documents, is the largest among the 53 databases that are collected at Stanford University for testing the gGlOSS system. The 53 databases are snapshots of 53 ....

J. Callan, M. Connell, and A. Du. Automatic Discovery of Language Models for Text Databases. ACM SIGMOD Conference, 1999.


Detection of Heterogeneities in a Multiple Text Database Environment - Meng   (4 citations)  (Correct)

....37] there have been relatively few studies for text database systems. Second, we present techniques to detect specific heterogeneities among multiple text retrieval systems. Applying probe queries to discover knowledge about a search engine is a new research area. Most recently, the authors of [7] used probe queries to discover the terms in a local database and some statistical information about these terms. The rest of the paper is organized as follows. In Section 2, we identify major heterogeneities among local search engines and analyze their impact on building an effective and ....

....engine and use the experiences to predict the usefulness of the search engine for future queries. SavvySearch is a metasearch engine that uses this solution [10] The second solution is to submit probe queries to the search engine and extract a database representative from the retrieved documents [7]. 3. Due to both autonomy and heterogeneity, different types of database representatives for different search engines may be available to the metasearch engine. First, we may have representatives extracted from past experiences or retrieved documents for search engines that do not want to provide ....

[Article contains additional citation context not shown here]

J. Callan, M. Connell, and A. Du. Automatic Discovery of Language Models for Text Databases. ACM SIGMOD Conference, 1999.


Query-Based Sampling of Text Databases - Callan, Connell (1999)   (23 citations)  Self-citation (Callan Connell)   (Correct)

....an unsuitable solution for environments where resources are controlled by many parties. In these environments, a di erent solution is required. Query based sampling is a recently developed method of acquiring resource descriptions that does not require explicit cooperation from resource providers [5]. Instead, resource descriptions are created by running queries and examining the documents that are returned. Resource descriptions can be guaranteed to be compatible because they are created under the control of the sampling process, not each individual resource provider. Preliminary experiments ....

....suggested that query based sampling is an e ective and ecient method of acquiring resource descriptions. The preliminary experiments studied how closely a resource description created by sampling (a learned resource description) matched the actual resource description for a text database [5]. The results were encouraging but inconclusive, in part due to a awed experimental methodology. This paper reproduces the earlier experiments using an improved experimental methodology. It also extends the prior research by investigating the e ects of learned resource descriptions on a resource ....

[Article contains additional citation context not shown here]

J. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, pages 479-490. ACM, 1999.


Collection Selection and Results Merging with Topically.. - Larkey, Connell, Callan (2000)   (7 citations)  Self-citation (Callan Connell)   (Correct)

....like term frequencies. These statistics, which are used to select or rank the available collections relevance to a query, are usually assumed to be available from cooperative providers. Alternatively, statistics can be approximated by sampling uncooperative providers with a set of queries [2]. In the present study we compare two of these approaches, CORI and topic modeling. The distributed patent system uses the CORI net (collection retrieval information network) approach in INQUERY [1] described in more detail in section 3.3.1, because this method has been shown successful in ....

Callan, J., Connell, M., and Du, A. Automatic Discovery of language models for text databases. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, pages 479-490, 1999.


Distributed Information Retrieval - Callan (2000)   (15 citations)  Self-citation (Callan)   (Correct)

....and publication date. One representative example was the testbed created for TREC 5 ( Harman, 1997) in which data on TREC CDs 2 and 4 was partitioned into 98 databases, each about 20 megabytes in size. Testbeds of about 100 databases each were also created based on TREC CD s 1 and 2 (Xu and Callan, 1998) 130 ADVANCES IN INFORMATION RETRIEVAL TREC CD s 2 and 3 (Lu et al. 1996a; Xu and Callan, 1998) and TREC CD s 1, 2, and 3 (French et al. 1999; Callan, 1999a) A testbed of 921 databases was created by dividing the 20 gigabyte TREC Very Large Corpus (VLC) data into smaller databases (Callan, ....

Callan, J., Connell, M., and Du, A. (1999a). Automatic discovery of language models for text databases. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, pages 479--490, Philadelphia. ACM.


Server Selection Methods in Hybrid Portal Search - David Hawking Csiro (2005)   (Correct)

No context found.

Jamie Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In Proc. ACM SIGMOD 99, 1999.


A Thesis by John King, B.I.T. - Deep Web Collection   (Correct)

No context found.

Jamie Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 479--490. ACM Press, 1999. 17


Downloading Hidden Web Content - Ntoulas, Zerfos, Cho (2004)   (Correct)

No context found.

J. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In SIGMOD, 1999.


Frontiers in Web Data Management - Junghoo John Cho   (Correct)

No context found.

J. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In SIGMOD Conference, pages 479--490, 1999.


Distributed Search-Based Advertising on the Web - Schmidt, Patel (2004)   (Correct)

No context found.

CALLAN, J., CONNELL, M., and DU, A., Automatic discovery of language models for text databases, in: Proceedings of the ACM SIGMOD 1999.


Clustering Structured Web Sources: a Schema-based.. - He, Tao, Chang (2004)   (Correct)

No context found.

Callan, J.P., Connell, M., Du, A.: Automatic discovery of language models for text databases. In: SIGMOD Conference. (1999)


Knocking the Door to the Deep Web: Integrating Web Query.. - He, Zhang, Chang (2004)   (Correct)

No context found.

J. P. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In SIGMOD Conference, 1999.


MetaQuerier over the Deep Web: Shallow Integration across.. - Chang, He, Zhang (2004)   (Correct)

No context found.

J. P. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In SIGMOD Conference, 1999.


Structured Databases on the Web: Observations and.. - Chang, He, Li, Patel, Zhang (2004)   (4 citations)  (Correct)

No context found.

James P. Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In Proceedings ACM SIGMOD International Conference on Management of Data, pages 479--490, Philadelphia, Pennsylvania, USA, June 1999. ACM Press.


Organizing Structured Web Sources by Query Schemas: A.. - Bin He Tao (2004)   (Correct)

No context found.

J. P. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In Proceedings ACM SIGMOD International Conference on Management of Data, pages 479--490, Philadelphia, Pennsylvania, USA, June 1999. ACM Press.


Using Generic Corpora to Learn Domain-Specific - Terminology David Vogel (2003)   (Correct)

No context found.

Callan, J., Connell, M., and Du, A. Automatic Discovery of Language Models for Text Databases. SIGMOD '99, 479490, 1999.


Discovering and Ranking Data Intensive Web Services: - Source-Biased Approach James (2003)   (Correct)

No context found.

J. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In SIGMOD '99.


When one Sample is not Enough: Improving Text Database.. - Ipeirotis, Gravano (2004)   (Correct)

No context found.

J. P. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In SIGMOD'99, 1999.


Probe, Cluster, and Discover: Focused Extraction of.. - Caverlee, Liu, Buttler (2004)   (Correct)

No context found.

J. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In SIGMOD '99.


Text Database Selection for Longer Queries - Wu   (Correct)

No context found.

J. Callan, M. Connell, and A. Du. Automatic Discovery of Language Models for Text Databases. ACM SIGMOD, 1999.


Towards a Highly-Scalable Metasearch Engine - Meng, Yu, Wu   (Correct)

No context found.

J. Callan, M. Connell, and A. Du. Automatic Discovery of Language Models for Text Databases. ACM SIGMOD Conference, 1999.


Challenges and Solutions for Building an Efficient and . . . - Meng, al. (1999)   (Correct)

No context found.

J. Callan, M. Connell, and A. Du. Automatic Discovery of Language Models for Text Databases. ACM SIGMOD Conference, 1999.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC