Results 1 - 10
of
17
Quantifying the Challenges in Parsing Patent Claims
- In Proceedings of the 1st International Workshop on Advances in Patent Information Retrieval (AsPIRe 2010
, 2010
"... In this paper, we aim to verify and quantify the challenges of patent claim processing that have been identified in the literature. We focus on the following three challenges that, judging from the numbers of mentions in papers concerning patent analysis and patent retrieval, are central to patent c ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
(Show Context)
In this paper, we aim to verify and quantify the challenges of patent claim processing that have been identified in the literature. We focus on the following three challenges that, judging from the numbers of mentions in papers concerning patent analysis and patent retrieval, are central to patent claim processing: (1) The length of sentences is much longer than for general language use; (2) Many novel terms are introduced in patent claims that are difficult to understand; (3) The syntactic structure of patent claims is complex. We find that the challenges of patent claim processing that are related to syntactic structure are much more problematic than the challenges at the vocabulary level. The sentence length issue only causes problems indirectly by resulting in more structural ambiguities for longer noun phrases.
Patent classification experiments with the Linguistic Classification System LCS
"... Abstract. In the context of the CLEF-IP 2010 classification task, we conducted a series of experiments with the Linguistic Classification System (LCS). We compared two document representations for patent abstracts: a bag-of-words representation and a syntactic/semantic representation containing both ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Abstract. In the context of the CLEF-IP 2010 classification task, we conducted a series of experiments with the Linguistic Classification System (LCS). We compared two document representations for patent abstracts: a bag-of-words representation and a syntactic/semantic representation containing both words and dependency triples. We evaluated two types of output: using a fixed cut-off on the ranking of the classes and using a flexible cut-off based on a threshold on the classification scores. Using the Winnow classifier, we obtained an improvement in classification scores when triples are added to the bag of words. However, our results are remarkably better on a held-out subset of the target data than on the 2 000-topic test set. The main findings of this paper are: (1) adding dependency triples to words has a positive effect on classification accuracy and (2) selecting classes by using a threshold on the classification scores instead of returning a fixed number of classes per document improves classification scores while at the same time it lowers the number of classes needs to be judged manually by the professionals at the patent office. 1
T.: A resource-light approach to phrase extraction for english and german documents from the patent domain and user generated content
- In: Eighth International Conference on Language Resources and Evaluation (LREC
, 2012
"... In order to extract meaningful phrases from corpora (e. g. in an information retrieval context) intensive knowledge of the domain in question and the respective documents is generally needed. When moving to a new domain or language the underlying knowledge bases and models need to be adapted, which ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
In order to extract meaningful phrases from corpora (e. g. in an information retrieval context) intensive knowledge of the domain in question and the respective documents is generally needed. When moving to a new domain or language the underlying knowledge bases and models need to be adapted, which is often time-consuming and labor-intensive. This paper adresses the described challenge of phrase extraction from documents in different domains and languages and proposes an approach, which does not use comprehensive lexica and therefore can be easily transferred to new domains and languages. The effectiveness of the proposed approach is evaluated on user generated content and documents from the patent domain in English and German. Keywords:multilingual phrase extraction, shallow parsing, cross-language information retrieval, opinion mining 1.
Phrase-based Document Categorization
"... takes a fresh look at an old idea in Information Retrieval: the use of linguistically extracted phrases as terms in the automatic categorization of documents, and in particular the pre-classification of patent applications. In Information Retrieval, until now there was found little or no evidence th ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
takes a fresh look at an old idea in Information Retrieval: the use of linguistically extracted phrases as terms in the automatic categorization of documents, and in particular the pre-classification of patent applications. In Information Retrieval, until now there was found little or no evidence that document categorization benefits from the application of linguistics techniques. Classification algorithms using the most cleverly designed linguistic representations typically did not perform better than those using simply the bag-of-words representation. We have investigated the use of dependency triples as terms in document categorization, according to a dependency model based on the notion of aboutness and using normalizing transformations to enhance recall. We describe a number of large-scale experiments with different document representations, test collections and even languages, presenting evidence that adding such triples to the words in a bag-of-terms document representation may lead to a statistically significant increase in the accuracy of document categorization. 1
Genre and Domain in Patent Texts
"... In this paper we investigate the variation in language use within the very broad patent domain. We find that language use (represented by syntactic phrases) not only differs from one patent class to the next, but is also a characteristic that sets apart the four sections of a patent (viz. Title, Abs ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
In this paper we investigate the variation in language use within the very broad patent domain. We find that language use (represented by syntactic phrases) not only differs from one patent class to the next, but is also a characteristic that sets apart the four sections of a patent (viz. Title, Abstract, Description and Claims). This lends support to the claim that these sections can be viewed as different text genres. For the development of a syntactic parser that is trained on patent texts, we quantify the domain and genre differences in terms of the amounts of text needed to train domain-dependent versions of the parser. Our quantified and exemplified findings on the domain variation in patent data are of interest for the patent retrieval and analysis communities.
Phrases or Terms? The Impact of Different Query Types
"... Abstract. At CLEF 2010, the University of Hildesheim took part in the Intellectual Property Track, which for the first time provided two separate tasks: the prior art candidate search and the classification task. We focused on the first one whose aim was to identify patent documents that state prior ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract. At CLEF 2010, the University of Hildesheim took part in the Intellectual Property Track, which for the first time provided two separate tasks: the prior art candidate search and the classification task. We focused on the first one whose aim was to identify patent documents that state prior art of an invention. The University of Hildesheim submitted four monolingual English runs using term as well as phrase queries. Each of the experiments made use of the small topic set. With the help of the before mentioned and additional post runs, we tried to investigate the impact of phrase queries in contrast to simple terms. Compared to the results of last year, there seemed to be some improvements especially in case of the P@5 values which could be an effect of the implemented Okapi algorithm. Categories and Subject Descriptors
LCI-INSA Linguistic Experiment for CLEF-IP Classification Track
"... Abstract. We present the experiment the LCI group has performed to prepare our submission to CLEF-IP Classification Track. In this preliminary experiment we used a part of the available target documents as test set and the rest as train set. We describe the systems AGFL used for extracting these tri ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract. We present the experiment the LCI group has performed to prepare our submission to CLEF-IP Classification Track. In this preliminary experiment we used a part of the available target documents as test set and the rest as train set. We describe the systems AGFL used for extracting these triples and the LCS used for classification by the Winnow algorithm. We show that the use of linguistic triples in place of bags of words improves the accuracy, as well as using the names and addresses of the applicants. we found that using the complete descriptions as bags of words does not really perform better than using only abstracts and titles. Some simple mathematics show that the official measures are redundant and that
Query Phrase Expansion Using Wikipedia in Patent Class Search
"... Abstract. Relevance Feedback methods generally suffer from topic drift caused by words ambiguity and synonymous uses of words. As a way to alleviate the inherent problem, we propose a novel query phrase expansion approach utilizing semantic annotations in Wikipedia pages, trying to enrich queries wi ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Relevance Feedback methods generally suffer from topic drift caused by words ambiguity and synonymous uses of words. As a way to alleviate the inherent problem, we propose a novel query phrase expansion approach utilizing semantic annotations in Wikipedia pages, trying to enrich queries with context disambiguating phrases. Focusing on the patent domain, especially on patent search where patents are classified into a hierarchy of categories, we attempt to understand the roles of phrases and words in query expansion in determining the relevance of documents and examine their contributions to alleviating the query drift problem. Our approach is compared against Relevance Model, a state-of-the-art, to show its superiority in terms of MAP on all levels of the classification hierarchy.
Insight to Hyponymy Lexical Relation Extraction in the Patent Genre Versus Other Text Genres∗
"... Due to the large amount of available patent data, it is no longer feasible for industry actors to manually create their own termi-nology lists and ontologies. Furthermore, domain specific the-sauruses are rarely accessible to the research community. In this paper we present extraction of hyponymy le ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Due to the large amount of available patent data, it is no longer feasible for industry actors to manually create their own termi-nology lists and ontologies. Furthermore, domain specific the-sauruses are rarely accessible to the research community. In this paper we present extraction of hyponymy lexical relations con-ducted on patent text using lexico-syntactic patterns. We explore the lexico-syntactic patterns. Since this kind of extraction involves Natural Language Processing we also compare the extractions made with and without domain adaptation of the extraction pipeline. We also deployed our modified extraction method to other text genres in order to demonstrate the method’s portability to other text do-mains. From our study we conclude that the lexico-syntactic pat-terns are portable to domain specific text genre such as the patent genre. We observed that general Natural Language Processing tools, when not adapted to the patent genre, reduce the amount of correct hyponymy lexical relation extractions and increase the number of incomplete extractions. This was also observed in other domain specific text genres.
For additional information about this publication click this link.
"... The following full text is a publisher's version. ..."
(Show Context)