Results 11 -
17 of
17
Supervisors: Second Examiner:
, 2012
"... Every year, the patent filing rates at the different patent offices around the world increase, and the patent examiners are struggling to catch up. As such, reliably categorizing patents is to aid the examination process of rising economic importance. This paper investigates whether capturing multiw ..."
Abstract
- Add to MetaCart
(Show Context)
Every year, the patent filing rates at the different patent offices around the world increase, and the patent examiners are struggling to catch up. As such, reliably categorizing patents is to aid the examination process of rising economic importance. This paper investigates whether capturing multiword expressions (specifically: institutionalized phrases) is an important factor for improving automated patent pre-classification. To do so we describe a novel text representation based on filtering by combinations of Part-of-Speech tags: typed skipgrams. We then compare the performance of different text representations (unigrams, bigrams, skipgrams, typed skipgrams and unigrams in combination with any of the other) when classifying a subset of the CLEF-IP 2010 corpus. We examine if there is a link between classification accuracy and ability to capture multiword expressions. We furthermore carry out additional experiments and analyses to investigate the influence of specific combinations of Parts-of-Speech on the overall result. We find that typed skipgrams in combination with unigrams perform significantly better (difference in F1 value: 0.7%) than the unigrams+bigrams baseline. We also find that typed skipgrams succeed in capturing multiword expressions and that typed skipgrams consisting of noun-noun and noun-adjective combinations are the most important factor for the overall success. Finally, we conclude that capturing multiword expressions is the crucial mechanism behind the improvement in classification scores. This research provides a potential means to further the state-of-the-art in patent classification when combined with additional optimizations. They also give directions for future research, highlighting typed skipgrams and filtering for multiword expressions as viable paths. Finally, we expect our results to generalize to every kind of text in which multiword expressions play an important role. Examples are scientific abstracts and, more generally, technical texts. Chapter 1
Comparative Analysis of Balanced Winnow and SVM in Large Scale Patent Categorization
"... This study investigates the effect of training different categorization algorithms on a corpus that is significantly larger than those reported in experiments in the literature. By means of machine learning techniques, a collection of 1.2 million patent applications is used to build a classifier tha ..."
Abstract
- Add to MetaCart
(Show Context)
This study investigates the effect of training different categorization algorithms on a corpus that is significantly larger than those reported in experiments in the literature. By means of machine learning techniques, a collection of 1.2 million patent applications is used to build a classifier that is able to classify documents with varyingly large feature spaces into the International Classification System (IPC) at Subclass level. The two algorithms that are compared are Balanced Winnow and Support Vector Machines (SVMs). Contrary to SVM, Balanced Winnow is frequently applied in today’s patent categorization systems. Results show that SVM outperforms Winnow considerably on all four document representations that were tested. While Winnow results on the smallest sub-corpus do not necessarily hold for the full corpus, SVM results are more robust: they show smaller fluctuations in accuracy when smaller or larger feature spaces are used. The parameter tuning that was carried out for both algorithms confirms this result. Although it is necessary to tune SVM experiments to optimize either recall or precision- whereas this can be combined when Winnow is used- effective parameter settings obtained on a small corpus can be used for training a larger corpus. Categories and Subject Descriptors H.3.3 [Information storage and retrieval]: Information search and retrieval—clustering, information filtering, retrieval models, search process, selection process
Patent Mining: A Survey
"... Patent documents are important intellectual resources of protecting interests of individuals, organizations and com-panies. Different from general web documents, patent doc-uments have a well-defined format including frontpage, de-scription, claims, and figures. However, they are lengthy and rich in ..."
Abstract
- Add to MetaCart
(Show Context)
Patent documents are important intellectual resources of protecting interests of individuals, organizations and com-panies. Different from general web documents, patent doc-uments have a well-defined format including frontpage, de-scription, claims, and figures. However, they are lengthy and rich in technical terms, which requires enormous human ef-forts for analysis. Hence, a new research area, called patent mining, emerges in recent years, aiming to assist patent ana-lysts in investigating, processing, and analyzing patent doc-uments. Despite the recent advances in patent mining, it is still far from being well explored in research communities. To help patent analysts and interested readers obtain a big picture of patent mining, we thus provide a systematic sum-mary of existing research efforts along this direction. In this survey, we first present an overview of the technical trend in patent mining. We then investigate multiple research questions related to patent documents, including patent re-trieval, patent classification, and patent visualization, and provide summaries and highlights for each question by delv-ing into the corresponding research efforts.
Information Foraging Lab
"... Abstract. We report the results of a series of classification experiments with the Lin-guistic Classification System LCS in the context of CLEF-IP 2011. We participated in the main classification task: classifying documents on the subclass level. We in-vestigated (1) the use of different sections (a ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. We report the results of a series of classification experiments with the Lin-guistic Classification System LCS in the context of CLEF-IP 2011. We participated in the main classification task: classifying documents on the subclass level. We in-vestigated (1) the use of different sections (abstract, description, metadata) from the patent documents; (2) adding dependency triples to the bag-of-words representation; (3) adding the WIPO corpus to the EPO training data; (4) the use of patent citations in the test data for reranking the classes; and (5) the threshold on the class scores for class selection. We found that adding full descriptions to abstracts gives a clear improvement; the first 400 words of the description also improves classification but to a lesser degree. Adding metadata (applicants, inventors en address) did not improve classification. Adding dependency triples to words gives a much higher recall at the cost of a lower precision but this effect is largely due to the class selection threshold. We did not find an effect from adding the WIPO corpus, nor from reranking with patent citations. In future work, we plan to investigate whether there are other methods for reranking with patent citations that does give an improvement, because we feel that the citations may still give valuable information. Our most important finding however is the importance of the threshold on the class selection. For the current work, we only compared two values for the threshold and the results are much better for 1.0 than for 0.5. The 0.5 threshold gives higher recall in all runs, which was the original motivation for submitting runs with a lower threshold. However, because the much lower precision, the F-scores are lower. We think that there is still some improvement to be gained from proper tuning of the class selection threshold, and the use of a flexible threshold (also taking into account the different text representations). This is part of our future work. 1
Combining document representations for prior-art retrieval
"... In this paper we report on our participation in the CLEF-IP 2011 prior art retrieval task. We investigated whether adding syntactic information in the form of dependency triples to a bag-of-words representation could lead to improvements in patent retrieval. In our experiments, we investigated this ..."
Abstract
- Add to MetaCart
In this paper we report on our participation in the CLEF-IP 2011 prior art retrieval task. We investigated whether adding syntactic information in the form of dependency triples to a bag-of-words representation could lead to improvements in patent retrieval. In our experiments, we investigated this effect on the title, abstract and first 400 words of the description section. The experiments were conducted in the Spinque framework with which we tried to optimize for the combinations of text representation and document sections. We found that adding triples did not improve overall MAP scores, compared to the baseline bag-of-words approach but does result in slightly higher set recall scores. In future work we will extend our experiments to use all the text sections of the patent documents and fine-tune the mixture weights.
Empirical Study on Citation Network-based Patent Classification
"... Knowledge management is essential to modern organizations. Due to the information overload problem, managers are facing critical challenges in utilizing the data in organizations. Although several automated tools have been applied, previous applications often deem knowledge items independent and use ..."
Abstract
- Add to MetaCart
(Show Context)
Knowledge management is essential to modern organizations. Due to the information overload problem, managers are facing critical challenges in utilizing the data in organizations. Although several automated tools have been applied, previous applications often deem knowledge items independent and use solely contents, which may limit their analysis abilities. This study focuses on the process of knowledge evolution and proposes to incorporate this perspective into knowledge management tasks. Using a patent classification task as an example, we represent knowledge evolution processes with patent citations and introduce a labeled citation graph kernel to classify patents under a kernel-based machine learning framework. In the experimental study, our proposed approach shows more than 30 percent improvement in classification accuracy compared to traditional content-based methods. The approach can potentially affect the existing patent management procedures. Moreover, this research lends strong support to considering knowledge evolution processes in other knowledge management tasks.
Examiner
, 2011
"... complies with the regulations of this University and meets the accepted standards with re-spect to originality and quality. ..."
Abstract
- Add to MetaCart
(Show Context)
complies with the regulations of this University and meets the accepted standards with re-spect to originality and quality.