Results 11 - 20
of
22
Webaffix: Discovering Morphological Links on the WWW
"... This paper presents a new language-independent method for finding morphological links between newly appeared words (i.e. absent from reference word lists). Using the WWW as a corpus, the Webaffix tool detects the occurrences of new derived lexemes based on a given suffix, proposes a base lexeme foll ..."
Abstract
- Add to MetaCart
(Show Context)
This paper presents a new language-independent method for finding morphological links between newly appeared words (i.e. absent from reference word lists). Using the WWW as a corpus, the Webaffix tool detects the occurrences of new derived lexemes based on a given suffix, proposes a base lexeme following a standard scheme (such as noun-verb), and then performs a compatibility test on the word pairs produced, using the Web again, but as a source of cooccurrences. The resulting pairs of words are used to build generic morphological databases useful for a number of NLP tasks. We develop and comment an example use of Webaffix to find new noun/verb pairs in French. 1. Overview We present a new language-independent method of finding morphological links between newly appeared words (i.e. absent from reference word lists). Our reflection originates from the observation that only a few languages have easily available morphological databases. The only widely distributed one is CELEX (Baayen et al., 1995) for Dutch, English and German. In
Using Ontology to build News Network
"... One of the main activities in the knowledge sharing is to search and retrieve textual document. Traditional searching methods use user-specified keywords to search for documents. The common problem with this method is that the retrieved documents are not the ones that they are actually looking for e ..."
Abstract
- Add to MetaCart
(Show Context)
One of the main activities in the knowledge sharing is to search and retrieve textual document. Traditional searching methods use user-specified keywords to search for documents. The common problem with this method is that the retrieved documents are not the ones that they are actually looking for even the searching is based on user-defined keywords The proposal in the research work is to build a welldefined domain where semantic relationship can be defined among the text documents in the repository to enhance the searching and retrieval performance. Reuters news is chosen as the domain where the ontology that defined the relationship is established to address the synonymy and polysemy problems. The ontology uses keywords to quantify the relationship strengths and labels the qualitative semantics. The ontology structure is a network of documents that is arranged based on hierarchy. This paper discusses the implementation of the document ontology which is applied to Reuters news corpus where the retrieval performance is measured based on the recall and precision.
Author manuscript, published in "Fifth Workshop on Web As Corpus, San-Sebastian: Spain (2009)" Looking for French deverbal nouns in an evolving Web (a short history of WAC)
, 2009
"... This papers describes an 8-year-long research effort for automatically collecting new French deverbal nouns on the Web. The goal has remained the same: building an extensive and cumulative list of nounverb pairs where the noun denotes the action expressed by the verb (e.g. production- produce). This ..."
Abstract
- Add to MetaCart
This papers describes an 8-year-long research effort for automatically collecting new French deverbal nouns on the Web. The goal has remained the same: building an extensive and cumulative list of nounverb pairs where the noun denotes the action expressed by the verb (e.g. production- produce). This list is used for both linguistic research and for NLP applications. The initial method consisted in taking advantage of the former Altavista search engine, allowing for a direct access to unknown word forms. The second technique led us to develop a specific crawler, which raised a number of technical difficulties. In the third experiment, we use a collection of web pages made available to us by a commercial search engine. Through all these stages, the general method has remained the same, and the results are similar and cumulative, although the technical environment has greatly evolved. 1
THE REGNET PROJECT A Review of Academic Research on Information Retrieval
, 2002
"... Acknowledgement and Disclaimer This report is intended to review current academic research into information retrieval for unstructured multimedia content. This review has been performed as part of the Regnet Project, which is funded by the National Science Foundation under Grant No. EIA-0085998. Any ..."
Abstract
- Add to MetaCart
Acknowledgement and Disclaimer This report is intended to review current academic research into information retrieval for unstructured multimedia content. This review has been performed as part of the Regnet Project, which is funded by the National Science Foundation under Grant No. EIA-0085998. Any opinions, findings, and conclusions or recommendations expressed in this report are those of the author and do not necessarily reflect the views of the National Science Foundation.
Contractual Delivery Date: 31/08/2003 Actual Delivery Date: 31/08/2003 Responsible Partner: EPFL
"... ..."
(Show Context)
Dissertation Proposal
, 2002
"... Contents 1. Introduction ..................................................................................................................... 4 1.1 Motivation ................................................................................................................. 5 1.2 Objectives......... ..."
Abstract
- Add to MetaCart
Contents 1. Introduction ..................................................................................................................... 4 1.1 Motivation ................................................................................................................. 5 1.2 Objectives..................................................................................................................6 1.3 Scope and Approach.................................................................................................. 7 2. Related Work................................................................................................................... 8 2.1 Building the Repository ............................................................................................ 8 2.2 Analyzing the Documents ......................................................................................... 9 3. Building the Repository ..................................................................
Feature selection based on word–sentence relation 1
"... Feature selection proved to improve both the speed and the quality of classification. Methods such as mutual information, information gain or chi-square are all based on the joint distribution of classes and words; there exist only a few methods which exploit contextual information for feature selec ..."
Abstract
- Add to MetaCart
(Show Context)
Feature selection proved to improve both the speed and the quality of classification. Methods such as mutual information, information gain or chi-square are all based on the joint distribution of classes and words; there exist only a few methods which exploit contextual information for feature selection. We introduce an algorithm based on word and word pair frequencies that reduces both vocabulary and total word size prior to classification. We measure the effectiveness of our algorithm by clustering Ken Lang’s 20 newsgroups corpus and obtain significantly better size reduction than the state-ofthe-art methods. We perform keyword selection by identifying correlated word pairs within sentences; measuring how strongly a word in a given document takes part in such pairs; finally selecting those keywords that take part in several such pairs in several documents. 1
By
, 2006
"... Word segmentation, part-of-speech (POS) tagging, and sense tagging are important steps in various Chinese natural language processing (CNLP) systems. Unknown words, i.e., words that are not in the dictionary or training data used in a CNLP system, constitute a major challenge for each of these steps ..."
Abstract
- Add to MetaCart
Word segmentation, part-of-speech (POS) tagging, and sense tagging are important steps in various Chinese natural language processing (CNLP) systems. Unknown words, i.e., words that are not in the dictionary or training data used in a CNLP system, constitute a major challenge for each of these steps. This dissertation is concerned with developing hybrid models that effectively combine statistical, knowledge-based, and machine learn-ing approaches for Chinese unknown word resolution, including the identification, part-of-speech (POS) tagging, and sense tagging of Chinese unknown words. What makes Chinese unknown word resolution hard is the limited information available for predicting the properties of unknown words, and for this reason it is crucial to make optimal use of information that is available. To this end, this research explores two central ideas and aims to achieve two major goals. First, the morphological, syntactic, and semantic information of the component characters or morphemes of an unknown word provides useful insights into its structural and semantic properties. The first goal of this work is to develop novel algorithms that