Results 1 - 10
of
17
Phrase-based Document Categorization revisited
"... This paper takes a fresh look at an old idea in Information Retrieval: the use of linguistically extracted phrases as terms in the automatic categorization (aka classification) of documents. Until now, there was found little or no evidence that document categorization benefits from the application o ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
(Show Context)
This paper takes a fresh look at an old idea in Information Retrieval: the use of linguistically extracted phrases as terms in the automatic categorization (aka classification) of documents. Until now, there was found little or no evidence that document categorization benefits from the application of linguistics techniques. Classification algorithms using the most cleverly designed linguistical representations typically do no better than those using simply the bag-of-words representation. Shallow linguistical techniques are used routinely, but their positive effect on the accuracy is small at best. We have investigated the use of dependency triples as terms in document categorization, which are derived according to a dependency model based on the notion of aboutness. The documents are syntactically analyzed by a parser and transduced to dependency graphs, which in turn are unnested into dependency triples following the aboutnessbased model. In the process, various normalizing transformations are applied to enhance recall. We describe a sequence of large-scale experiments with different document representations, test collections and even languages, presenting evidence that adding such triples to the words in a bag-of-terms document representation may lead to a significant increase in the accuracy of document categorization.
Patent classification experiments with the Linguistic Classification System LCS
"... Abstract. In the context of the CLEF-IP 2010 classification task, we conducted a series of experiments with the Linguistic Classification System (LCS). We compared two document representations for patent abstracts: a bag-of-words representation and a syntactic/semantic representation containing both ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Abstract. In the context of the CLEF-IP 2010 classification task, we conducted a series of experiments with the Linguistic Classification System (LCS). We compared two document representations for patent abstracts: a bag-of-words representation and a syntactic/semantic representation containing both words and dependency triples. We evaluated two types of output: using a fixed cut-off on the ranking of the classes and using a flexible cut-off based on a threshold on the classification scores. Using the Winnow classifier, we obtained an improvement in classification scores when triples are added to the bag of words. However, our results are remarkably better on a held-out subset of the target data than on the 2 000-topic test set. The main findings of this paper are: (1) adding dependency triples to words has a positive effect on classification accuracy and (2) selecting classes by using a threshold on the classification scores instead of returning a fixed number of classes per document improves classification scores while at the same time it lowers the number of classes needs to be judged manually by the professionals at the patent office. 1
Improving on-line learning
- Department of Computer Science, Rutgers University
, 2007
"... and approved by ..."
Text Representations for Patent Classification
, 2013
"... With the increasing rate of patent application filings, automated patent classification is of rising economic importance. This article investigates how patent classification can be improved by using different representations of the patent documents. Using the Linguistic Classification System (LCS), ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
With the increasing rate of patent application filings, automated patent classification is of rising economic importance. This article investigates how patent classification can be improved by using different representations of the patent documents. Using the Linguistic Classification System (LCS), we compare the impact of adding statistical phrases (in the form of bigrams) and linguistic phrases (in two different dependency formats) to the standard bag-of-words text representation on a subset of 532,264 English abstracts from the CLEF-IP 2010 corpus. In contrast to previous findings on classification with phrases in the Reuters-21578 data set, for patent classification the addition of phrases results in significant improvements over the unigram baseline. The best results were achieved by combining all four representations, and the second best by combining unigrams and lemmatized bigrams. This article includes extensive analyses of the class models (a.k.a. class profiles) created by the classifiers in the LCS framework, to examine which types of phrases are most informative for patent classification. It appears that bigrams contribute most to improvements in classification accuracy. Similar experiments were performed on subsets of French and German abstracts to investigate the generalizability of these findings.
Automatic Discovery of Technology Trends from Patent Text
"... Patent text is a rich source to discover technological progresses, useful to understand the trend and forecast upcoming advances. For the importance in mind, several researchers have attempted textual-data mining from patent documents. However, previous mining methods are limited in terms of readabi ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Patent text is a rich source to discover technological progresses, useful to understand the trend and forecast upcoming advances. For the importance in mind, several researchers have attempted textual-data mining from patent documents. However, previous mining methods are limited in terms of readability, domain-expertise, and adaptability. In this paper, we first formulate the task of technological trend discovery and propose a method for discovering such a trend. We complement a probabilistic approach by adopting linguistic clues and propose an unsupervised procedure to discover technological trends. Based on the experiment, our method is promising not only in its accuracy, 77 % in R-precision, but also in its functionality and novelty of discovering meaningful technological trends.
A framework of automatic subject term assignment for text categorization: An indexing conception-based approach
- Journal of the American Society for Information Science and Technology
, 2010
"... The purpose of this study is to examine whether the understandings of subject-indexing processes con-ducted by human indexers have a positive impact on the effectiveness of automatic subject term assign-ment through text categorization (TC). More specifically, human indexers ’ subject-indexing appro ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The purpose of this study is to examine whether the understandings of subject-indexing processes con-ducted by human indexers have a positive impact on the effectiveness of automatic subject term assign-ment through text categorization (TC). More specifically, human indexers ’ subject-indexing approaches, or con-ceptions, in conjunction with semantic sources were explored in the context of a typical scientific journal arti-cle dataset. Based on the premise that subject indexing approaches or conceptions with semantic sources are important for automatic subject term assignment through TC, this study proposed an indexing conception-based framework. For the purpose of this study, two research questions were explored: To what extent are semantic sources effective? To what extent are indexing concep-
Phrase-based Document Categorization
"... takes a fresh look at an old idea in Information Retrieval: the use of linguistically extracted phrases as terms in the automatic categorization of documents, and in particular the pre-classification of patent applications. In Information Retrieval, until now there was found little or no evidence th ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
takes a fresh look at an old idea in Information Retrieval: the use of linguistically extracted phrases as terms in the automatic categorization of documents, and in particular the pre-classification of patent applications. In Information Retrieval, until now there was found little or no evidence that document categorization benefits from the application of linguistics techniques. Classification algorithms using the most cleverly designed linguistic representations typically did not perform better than those using simply the bag-of-words representation. We have investigated the use of dependency triples as terms in document categorization, according to a dependency model based on the notion of aboutness and using normalizing transformations to enhance recall. We describe a number of large-scale experiments with different document representations, test collections and even languages, presenting evidence that adding such triples to the words in a bag-of-terms document representation may lead to a statistically significant increase in the accuracy of document categorization. 1
Signalling Events in Text Streams
"... Summary. With the rise of Web 2.0 vast amounts of textual data are generated every day. Micro-blogging streams and forum posts are ideally suited for signalling suspicious events. We are investigating the use of classification techniques for recognition of suspicious passages. We present CBSSearch, ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Summary. With the rise of Web 2.0 vast amounts of textual data are generated every day. Micro-blogging streams and forum posts are ideally suited for signalling suspicious events. We are investigating the use of classification techniques for recognition of suspicious passages. We present CBSSearch, an experimental environment for text-stream analysis. Our aim is to develop an end-to-end solution for creating models of events and their application within forensic analysis of text-streams.
LCI-INSA Linguistic Experiment for CLEF-IP Classification Track
"... Abstract. We present the experiment the LCI group has performed to prepare our submission to CLEF-IP Classification Track. In this preliminary experiment we used a part of the available target documents as test set and the rest as train set. We describe the systems AGFL used for extracting these tri ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract. We present the experiment the LCI group has performed to prepare our submission to CLEF-IP Classification Track. In this preliminary experiment we used a part of the available target documents as test set and the rest as train set. We describe the systems AGFL used for extracting these triples and the LCS used for classification by the Winnow algorithm. We show that the use of linguistic triples in place of bags of words improves the accuracy, as well as using the names and addresses of the applicants. we found that using the complete descriptions as bags of words does not really perform better than using only abstracts and titles. Some simple mathematics show that the official measures are redundant and that
For additional information about this publication click this link.
"... The following full text is a publisher's version. ..."
(Show Context)