Results 1 - 10
of
59
Augmenting Naive Bayes Classifiers with Statistical Language Models
, 2003
"... We augment naive Bayes models with statistical n-gram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier ..."
Abstract
-
Cited by 65 (0 self)
- Add to MetaCart
We augment naive Bayes models with statistical n-gram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier
A graph-based recommender system for digital library
- In Proceedings of the Second ACM/IEEE-CS Joint Conference on Digital Libraries
, 2002
"... Research shows that recommendations comprise a valuable service for users of a digital library [11]. While most existing recommender systems rely either on a content-based approach or a collaborative approach to make recommendations, there is potential to improve recommendation quality by using a co ..."
Abstract
-
Cited by 42 (5 self)
- Add to MetaCart
(Show Context)
Research shows that recommendations comprise a valuable service for users of a digital library [11]. While most existing recommender systems rely either on a content-based approach or a collaborative approach to make recommendations, there is potential to improve recommendation quality by using a combination of both approaches (a hybrid approach). In this paper, we report how we tested the idea of using a graph-based recommender system that naturally combines the content-based and collaborative approaches. Due to the similarity between our problem and a concept retrieval task, a Hopfield net algorithm was used to exploit high-degree book-book, useruser and book-user associations. Sample hold-out testing and preliminary subject testing were conducted to evaluate the system, by which it was found that the system gained improvement with respect to both precision and recall by combining content-based and collaborative approaches. However, no significant improvement was observed by exploiting high-degree associations.
Accessor variety criteria for chinese word extraction
- Computational Linguistics
, 2004
"... We are interested in the problem of word extraction from Chinese text collections. We de�ne a word to be a meaningful string composed of several Chinese characters. For example,, ‘percent’, and, ‘more and more’, are not recognized as traditional Chinese words from the viewpoint of some people. Howev ..."
Abstract
-
Cited by 39 (0 self)
- Add to MetaCart
(Show Context)
We are interested in the problem of word extraction from Chinese text collections. We de�ne a word to be a meaningful string composed of several Chinese characters. For example,, ‘percent’, and, ‘more and more’, are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have speci�c meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) and the characters that are directly after a string (successors) as important factors for determining the independence of the string. We call such characters accessors of the string, consider the number of distinct predecessors and successors of a string in a largecorpus (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments con�rm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods. 1.
The Effects of Fitness Functions on Genetic Programming-Based Ranking Discovery for Web Search
- Journal of the American Society for Information Science and Technology
, 2004
"... Genetic-based evolutionary learning algorithms, such as genetic algorithms (GAs) and genetic programming (GP), have been applied to information retrieval (IR) since the 1980s. Recently, GP has been applied to a new IR task --- discovery of ranking functions for web search --- and has achieved very p ..."
Abstract
-
Cited by 35 (15 self)
- Add to MetaCart
(Show Context)
Genetic-based evolutionary learning algorithms, such as genetic algorithms (GAs) and genetic programming (GP), have been applied to information retrieval (IR) since the 1980s. Recently, GP has been applied to a new IR task --- discovery of ranking functions for web search --- and has achieved very promising results. However, in our prior research, only one fitness function has been used for GP-based learning. It is unclear how other fitness functions may impact ranking function discovery for web search, especially since it is well known that choosing a proper fitness function is very important for the effectiveness and efficiency of evolutionary algorithms. In this paper, we report our experience in contrasting different fitness function designs on GP-based learning using a very large web corpus. Our results indicate that the design of fitness functions is instrumental in performance improvement. We also give recommendations on the design of fitness functions for genetic-based information retrieval experiments.
On the Use of Words and N-grams for Chinese Information Retrieval
- In Fifth International Workshop on Information Retrieval with Asian Languages, IRAL2000, Hong Kong
, 2000
"... : In the processing of Chinese documents and queries in information retrieval (IR), one has to ..."
Abstract
-
Cited by 32 (6 self)
- Add to MetaCart
(Show Context)
: In the processing of Chinese documents and queries in information retrieval (IR), one has to
Updateable PAT-Tree Approach to Chinese Key Phrase Extraction Using Mutual . . .
, 1999
"... There has been renewed research interest in using the statistical approach to extraction of key phrases from Chinese documents because existing approaches do not allow online frequency updates after phrases have been extracted. This consequently results in inaccurate, partial extraction. In this pap ..."
Abstract
-
Cited by 30 (13 self)
- Add to MetaCart
There has been renewed research interest in using the statistical approach to extraction of key phrases from Chinese documents because existing approaches do not allow online frequency updates after phrases have been extracted. This consequently results in inaccurate, partial extraction. In this paper, we present an updateable PAT-tree approach. In our experiment, we compared our approach with that of Lee-Feng Chien with that showed an improvement in recall from 0.19 to 0.43 and in precision from 0.52 to 0.70. This paper also reviews the requirements for a data structure that facilitates implementation of any statistical approaches to key-phrase extraction, including PAT-tree, PAT-array and suffix array with semi-infinite strings.
TREC-6 English and Chinese Retrieval Experiments using PIRCS
, 1996
"... For Trec-6 ad-hoc experiments, we continue to use twostage retrieval with pseudo-feedback from top-ranked unjudged documents for both Chinese and English. We perform three types of retrieval characterized by queries formed using title only, description only and all sections of the given topics. For ..."
Abstract
-
Cited by 22 (4 self)
- Add to MetaCart
For Trec-6 ad-hoc experiments, we continue to use twostage retrieval with pseudo-feedback from top-ranked unjudged documents for both Chinese and English. We perform three types of retrieval characterized by queries formed using title only, description only and all sections of the given topics. For short queries mainly derived from title or description section, query terms are weighted by average term frequency avtf introduced previously. For Chinese, we employ a combination of representation (character, bigram and short-word) strategy, returning the highest average non-interpolated precision that is even better than some manual approaches. In English ad-hoc, we try a document re-ranking strategy for the first stage retrieval based on occurrence of selected query term pairs, so as to have better result in the second stage. Performance for English ad-hoc is also highly competitive for both very short and long queries. In routing, a strategy of combining different methods of query format...
Internet Searching and Browsing in a Multilingual World: An Experiment on the Chinese Business Intelligence Portal (CBizPort)
, 2004
"... this paper, we propose a generic and integrated approach to searching and browsing the Internet in a multilingual world. Based on this approach, we have developed the Chinese Business Intelligence Portal (CBizPort) , a meta-search engine that searches for business information of mainland China, Taiw ..."
Abstract
-
Cited by 17 (7 self)
- Add to MetaCart
this paper, we propose a generic and integrated approach to searching and browsing the Internet in a multilingual world. Based on this approach, we have developed the Chinese Business Intelligence Portal (CBizPort) , a meta-search engine that searches for business information of mainland China, Taiwan, and Hong Kong. Additional functions provided by CBizPort include encoding conversion (between Simplified Chinese and Traditional Chinese), summarization, and categorization. Experimental results of our user evaluation study show that the searching and browsing performance of CBizPort was comparable to that of regional Chinese search engines, and CBizPort could significantly augment these search engines. Subjects' verbal comments indicate that CBizPort performed best in terms of analysis functions, cross-regional searching, and user-friendliness, whereas regional search engines were more efficient and more popular. Subjects especially liked CBizPort's summarizer and categorizer, which helped in understanding search results. These encouraging results suggest a promising future of our approach to Internet searching and browsing in a multilingual world
Combination and boundary detection approaches on chinese indexing
- Journal of the American Society for Information Science
, 2000
"... Digital libraries store materials in electronic format. Research and development in digital libraries includes content creation, conversion, indexing, organization, and dissemination. The key technological issues are how to search and display desired selections from and across large collections effe ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
Digital libraries store materials in electronic format. Research and development in digital libraries includes content creation, conversion, indexing, organization, and dissemination. The key technological issues are how to search and display desired selections from and across large collections effectively [Schatz & Chen, 1996]. Digital library research projects (DLI-1) sponsored by NSF/ DARPA/NASA have a common theme of bringing search to the net, which is the flagship research effort for the National Information Infrastructure (NII) in the United States. A repository is an indexed collection of objects. Indexing is an important task for searching. The better the indexing, the better the searching result. Developing a universal digital library has been the dream of many researchers, however, there are still many problems to
Using Genetic Algorithm to Improve Information Retrieval Systems
"... Abstract—This study investigates the use of genetic algorithms in information retrieval. The method is shown to be applicable to three well-known documents collections, where more relevant documents are presented to users in the genetic modification. In this paper we present a new fitness function f ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Abstract—This study investigates the use of genetic algorithms in information retrieval. The method is shown to be applicable to three well-known documents collections, where more relevant documents are presented to users in the genetic modification. In this paper we present a new fitness function for approximate information retrieval which is very fast and very flexible, than cosine similarity fitness function.