Results 1 - 10
of
15
Old and new challenges in automatic plagiarism detection
- National Plagiarism Advisory Service, 2003; http://ir.shef.ac.uk/cloughie/index.html
, 2003
"... Automatic methods of measuring similarity between program code and natural language text pairs have been used for many years to assist humans in detecting plagiarism. For example, over the past thirty years or so, a vast number of approaches have been proposed for detecting likely plagiarism between ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Automatic methods of measuring similarity between program code and natural language text pairs have been used for many years to assist humans in detecting plagiarism. For example, over the past thirty years or so, a vast number of approaches have been proposed for detecting likely plagiarism between programs written by Computer Science students. However, more recently, approaches to identifying similarities between natural language texts have been addressed, but given the ambiguity and complexity of natural over program languages, this task is very difficult. Automatic detection is gaining further interest from both the academic and commercial worlds given the ease with which texts can now be found, copied and rewritten. Following the recent increase in the popularity of on-line services offering plagiarism detection services and the increased publicity surrounding cases of plagiarism in academia and industry, this paper explores the nature of the plagiarism problem, and in particular summarise the approaches used so far for its detection. I focus on plagiarism detection in natural language, and discuss a number of methods I have used to measure text reuse. I end by suggesting a number of recommendations for further work in the field of automatic plagiarism detection. 1.
A Comparative Study of Language Models for Book and Author Recognition
- In Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP-05
, 2005
"... Linguistic information can help improve evaluation of similarity between documents; however, the kind of linguistic information to be used depends on the task. In this paper, we show that distributions of syntactic structures capture the way works are written and accurately identify individual books ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Linguistic information can help improve evaluation of similarity between documents; however, the kind of linguistic information to be used depends on the task. In this paper, we show that distributions of syntactic structures capture the way works are written and accurately identify individual books more than 76% of the time. In comparison, baseline features, e.g., tfidf-weighted keywords, function words, etc., give an accuracy of at most 66%. However, testing the same features on authorship attribution shows that distributions of syntactic structures are less successful than function words on this task; syntactic structures vary even among the works of the same author whereas features such as function words are distributed more similarly among the works of an author and can more effectively capture authorship.
Using empirical methods for evaluating expression and content similarity
- In 37th Hawaiian International Conference on System Sciences (HICSS-37). IEEE Computer Society
, 2004
"... Despite lack of any significant quantifiable similarities between documents, people can intuitively compare documents and evaluate their similarity. To understand how people evaluate text similarity, we queried subjects about the level of content similarity and expression similarity of pairs of docu ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Despite lack of any significant quantifiable similarities between documents, people can intuitively compare documents and evaluate their similarity. To understand how people evaluate text similarity, we queried subjects about the level of content similarity and expression similarity of pairs of documents. Using these judgments on similarity as ground truth, we automated evaluation of similarity. Our main application for automatic evaluation of text similarity is copyright infringement detection. United States copyright law protects expression but not any underlying facts and ideas being expressed. Therefore, we focus on recognizing similarity of expression. We envision a scenario where authors present the system with a document and the system replies with documents that share the same expressive characteristics. We hypothesize that, since content and expression are not independent of each other, accurate recognition of expression similarity will also help recognition of content similarity. The experiments presented in this paper evaluate two sets of features, unigrams and style features, with respect to their ability to recognize similarities in content and expression of documents using the ground truth obtained from user experiments. Our results show that, on our data set of short news articles, stylistic features predict similarity of expression more accurately than tf*idf weighted unigrams. While unigrams can identify high-level content similarities between documents about the same people, topic and events, they are less effective than style features in evaluating finer grained content similarities. 1.
Segmenting a Document By Stylistic Character
- In Workshop on computational
, 2003
"... As part of a larger project to develop an aid for writers that wouldhelp to eliminate stylistic inconsistencies within a document, we experimented with neural networks to find the points in a text at which its stylistic character changes. Our best results, well above baseline, were achieved with t ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
As part of a larger project to develop an aid for writers that wouldhelp to eliminate stylistic inconsistencies within a document, we experimented with neural networks to find the points in a text at which its stylistic character changes. Our best results, well above baseline, were achieved with time-delay networks that used features related to the author's syntactic preferences. Low-level and vocabulary-based features were not found to be useful.
Content and Expression-Based Copy Recognition for Intellectual Property Protection
- In the Proceedings of the 3rd ACM Workshop on Digital Rights Management (DRM'03
, 2003
"... Protection of copyrights and revenues of content owners in the digital world has been gaining importance in the recent years. This paper presents a way of fingerprinting text documents that can be used to identify content and expression similarities in documents, as a way of facilitating tracking of ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Protection of copyrights and revenues of content owners in the digital world has been gaining importance in the recent years. This paper presents a way of fingerprinting text documents that can be used to identify content and expression similarities in documents, as a way of facilitating tracking of digital copies of works, to ensure proper compensation to content owners.
Shallow Text Analysis and Machine Learning for Authorship Attribution
"... Current advances in shallow parsing and machine learning allow us to use results from these fields in a methodology for Authorship Attribution. We report on experiments with a corpus that consists of newspaper articles about national current affairs by different journalists from the Belgian newspape ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Current advances in shallow parsing and machine learning allow us to use results from these fields in a methodology for Authorship Attribution. We report on experiments with a corpus that consists of newspaper articles about national current affairs by different journalists from the Belgian newspaper De Standaard. Because the documents are in a similar genre, register, and range of topics, token-based (e.g., sentence length) and lexical features (e.g., vocabulary richness) can be kept roughly constant over the different authors. This allows us to focus on the use of syntax-based features as possible predictors for an author’s style, as well as on those token-based features that are predictive to author style more than to topic or register. These style characteristics are not under the author’s conscious control and therefore good clues for Authorship Attribution. Machine Learning methods (TiMBL and the WEKA software package) are used to select informative combinations of syntactic, token-based and lexical features and to predict authorship of unseen documents. The combination of these features can be considered an implicit profile that characterizes the style of an author. 1
Automatic Detection of Authorship Changes within Single Documents
, 2000
"... One of the most difficult tasks facing anyone who must compile or maintain any large, collaboratively-written document is to foster a consistent style throughout. In this thesis, we explore whether it is possible to identify stylistic inconsistencies within documents even in principle, given our u ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
One of the most difficult tasks facing anyone who must compile or maintain any large, collaboratively-written document is to foster a consistent style throughout. In this thesis, we explore whether it is possible to identify stylistic inconsistencies within documents even in principle, given our understanding of how style can be captured statistically. We carry out
A Classifier System for Author Recognition Using Synonym-Based Features
"... Abstract. The writing style of an author is a phenomenon that computer scientists and stylometrists have modeled in the past with some success. However, due to the complexity and variability of writing styles, simple models often break down when faced with real world data. Thus, current trends in st ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. The writing style of an author is a phenomenon that computer scientists and stylometrists have modeled in the past with some success. However, due to the complexity and variability of writing styles, simple models often break down when faced with real world data. Thus, current trends in stylometry often employ hundreds of features in building classifier systems. In this paper, we present a novel set of synonym-based features for author recognition. We outline a basic model of how synonyms relate to an author’s identify and then build an additional two models refined to meet real world needs. Experiments show strong correlation between the presented metric and the writing style of four authors with the second of the three models outperforming the others. As modern stylometric classifier systems demand increasingly larger feature sets, this new set of synonym-based features will serve to fill this everincreasing need. “The least of things with a meaning is worth more in life than the greatest of things without it.” Carl Jung (1875- 1961) 1
Subjectivity in Stylistic Assessment
- Text Technology
, 2000
"... this paper was presented as addressing issues related to computational stylistics and its application to the design of writing tools intended to eliminate problems with style in collaboratively written texts. The results of the experiment with respect to the effect of authorship suggest that there a ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
this paper was presented as addressing issues related to computational stylistics and its application to the design of writing tools intended to eliminate problems with style in collaboratively written texts. The results of the experiment with respect to the effect of authorship suggest that there are some limitations in using authorial stylostatistical tests to predict a reader's impression of a text's style. Additionally, sweeping predictive statements about a text's stylistic effect in a reader audience should be made cautiously, since a group of readers might not share homogeneous stylistic judgements. Although the stylistic assessments of our subjects were found to be similar, they varied enough to show that subjectivity does exist. The subjectivity itself is an interesting property of both the text and the readers.

