| Damashek, M., "Gauging Similarity with ngrams: Language-Independent Categorization of Text", Science, Vol. 267, pp. 843-848, 1995. |
....features is still too high for most learning algorithms. For a further reduction, statistically motivated domain independent methods can be used (e.g. frequency counts [1] mutual information [1, 10] or statistics [19] An alternative to the classical word based representations are n grams [2, 4, 7]. The advantage of this technique is its language independency. Classifier construction Many different learning models have been proposed for the classification of documents including decision trees [20, 11] rule learning [1, 3, 16] Bayesian classifiers [14, 10, 11, 8] nearestneighbor methods ....
M. Damashek. Gauging Similarity with N-Grams: Language-Independent Categorization of Text. Science, 267:843--848, February 1995.
....see [11] has been in use for many years mainly in the field of speech processing. Fairly recently, this notion has attracted even more interest in other fields of natural language processing, as illustrated by the works of Greffenstette [9] on language identification and that of Damashek [7] on the processing of written text. Amongst other things, these researchers have shown that the use of N grams instead of words as the basic unit of information does not lead to information loss. Examples of recent applications of N grams include the work of Mayfield McNamee [14] on indexation, ....
Damashek M. (1995). "Gauging Similarity with nGrams : Language-Independent Categorization of Text", Science, 267, 843-848.
....domainindependence should remain as the main goal, so the adjustments to the parameters should be done through the collection of statistics. The record matching algorithms need to be compared with other approximate string matching algorithms such as those proposed by Buss and Yianilos (1995) and Damashek (1995). In addition, other tools such as agrep [Wu and Manber, 1992] for approximate pattern matching and diff [Simon, 1989] for finding differences in files, are relevant work which will be compared to the algorithms stated here. VI.A.2 Detecting approximate duplicate records Future work on detecting ....
M. Damashek. Gauging similarity with n-grams: language independent categorization of text. Science, 267(5199):843--848, 1995.
.... cross over. 5 1.2 Background: Duplicate Detection Broder et al. BGMZ97] sought to detect copies of documents in a single language. They propose two document similarity scores, resemblance (r) and containment (c) In order to compute these scores, a document D is viewed as a set of shingles [Dam95] (a shingling ) where a shingle is an n gram type (i.e. a contiguous subsequence of length n) contained in D. For example, the trigram shingling of the next sentence is f denote the shingling , the shingling of , shingling of D , of D as , D as S(D) g. Denote the shingling of D as S(D) ....
Damashek, Marc (1995). Gauging Similarity with n-Grams: LanguageIndependent Categorization of Text. In Science 267, pp. 843--848.
....of how likely a medical history is. One approach to using context to determine the probability of a sequence is to use N grams, as has been done in language modeling for speech recognition (Lee, 1989; Maltese Mancini, 1992) and for searching large databases of text documents (Damashek, 1992; Damashek, 1995). For example, in order to determine the probability of the symbol sequence [B, G, Q, D] an algorithm might use stored knowledge about the probability of G following B, Q following G, and D following Q. These probabilities are called bigram probabilities because they use stored information about ....
Damashek, M. (1995). Gauging similarity with n-grams: Language-independent categorization of text. Science, 267, 843-848.
....We hypothesize that images from similar categories have similar texture features. In particular, we present a method for categorizing images using a new texture feature termed N Theta M gram . This method is based on the N gram technique that is used for determining similarity of text documents [2]. Our approach for categorizing images is based on extending the definition of N grams to 2D. Intuitively an N Theta M gram is a small patch or pattern in an image. The hypothesis that we examine in this paper is that two images that have the same recurring patterns are likely to belong to the ....
.... f(j; A) f(j; B) q P T j=0 f(j; A) 2 P T j=0 f(j; B) 2 (1) S dp (A; B) is essentially the cosine of the angle between the N Theta M gram vector of images A and B, its range of values lie between 1 and 1, with 1 given for an exact match (i.e. identical images) In work on text documents [2], it was observed that the similarity can be improved by subtracting the average N gram vector for a corpus of documents from the N gram vector of the query and database documents. The average N gram vector is simply the average frequency of each N gram over all of the documents that are stored in ....
M. Damashek. Gauging similarity with n-Grams: Language-independent categorization of text. Science, 267:843--848, February 1995.
....are different words but share two 5 grams, compu and omput . Uses of n grams. N grams were investigated for tasks related to information retrieval at least as early as 1979 (Suen[21] Since then they have been investigated in such tasks as language identification (Damashek 1995[6]; Sibun and Reynar 1996[19] spelling correction (Zamora et al. 1981[23] Salton 1989[20] document categorization (Huffman and Damashek 1994[13] Labrou and Finin 1999[14] document comparison (Damashek 1995[6] robust handling of noisy (misspelled, OCRed etc. texts (Grossman et al. ....
.... have been investigated in such tasks as language identification (Damashek 1995[6] Sibun and Reynar 1996[19] spelling correction (Zamora et al. 1981[23] Salton 1989[20] document categorization (Huffman and Damashek 1994[13] Labrou and Finin 1999[14] document comparison (Damashek 1995[6]) robust handling of noisy (misspelled, OCRed etc. texts (Grossman et al. 1995[11] Pearce and Nicholas 1996[18] Pearce and Miller 1997[5] topic highlighting (Cohen 1995[3] document space visualization (Fox et al. 1999[9] Huffman 1995[12] Charoenkitkarn et al. 1994[2] spoken 3 ....
[Article contains additional citation context not shown here]
Damashek, M., Gauging similarity with n-grams: Language-independent categorization of text, Science, 267 (1995), pp. 843 848.
....dimensions and view documents and terms graphically in the feature space [5] 2.3 N gram processing N grams are overlapping n character sequences of text in a document. In a typical document processing system using n grams, a document is processed by sliding an n character window across its text [12, 25]. During this process, all alphabetic characters are converted to lower case, non alphabetic characters are converted to spaces, and multiple spaces are collapsed into a single space. For example, the first 5 grams in this sentence would consist of for e , or ex , r exa , and so on. This ....
....of the vector is calculated by d i,k = c i,k m i ,wherec i,k is the count of n gram k in document i and m i = # k c i,k is the total number of n grams in document i. While there are 27 n possible unique English n grams, experimentation has shown that relatively few of them occur in any corpus [12]. For example, a 40 MB collection of articles from the Wall Street Journal has about 270 000 unique 5 grams (out of a possible total of 7.5 10 18 ) and this number increases very slowly as the corpus increases in size. N gram processing tools are somewhat language independent because they do ....
Damashek, M.: Gauging similarity with N-Grams: Languageindependent categorization of text. Science 267:843--848, 1995
....On the other hand, the validity of the Zipf law implies that the observation of a Zipf behavior is necessary in a natural text. Moreover, the analysis of the frequencies of the n gram substrings of a text allows a language independent categorization of topical similarity in unrestricted texts [14] so that a Zipf analysis may be useful for practical purposes. Another example of the usefulness of the Zipf s approach is given in [15] where it is suggested that the distance between two Zipf plots of two different texts is shorter when the texts are written by the same author than when ....
M. Damashek, Gauging Similarity with n-Grams: Language-Independent Categorization of Text, Science 267, 843 (1995).
....methods Frequency based methods model the frequency distributions of various events. For the system call application, the events are occurrences of each pattern of system calls in a sequence. One example of a frequency based method is the n gram vector used to classify text documents [3]. Each document is represented by a vector that is a histogram of sequence frequencies. Each element corresponds to one sequence of length n (called an n gram) and the value of the element is the normalized frequency with which the n gram occurs in the document. Each histogram vector then ....
....until the program has terminated. It is also difficult to determine what size vector to use; the space of all possible sequences is much too large, and we cannot guarantee that the subset of sequences observed in traces of normal behavior is complete. Finally, the coarse clustering of documents in [3] does not suggest sufficient precision to discriminate between normal and intrusive traces of the same program. Other frequency based methods examine sequences individually, making them suitable for on line use. Determination of whether a sequence is likely to be anomalous is based on empirically ....
M. Damashek. Gauging similarity with n-grams: Language-independent categorization of text. Science, 267:843--848, February 1995.
.... (Henrich; 1989; Ziegler, 1991; Souter et al. 1994) and particularly shaped words from images (Nakayama and Spitz, 1993; Sibun and Spitz, 1994) The frequency of character n grams was used by Beesley (1988) Henrich (1989) Cavnar and Trenkle (1994) Dunning (1994) Souter et al. 1994) and Damashek (1995). A number of analytic techniques have been employed, ranging from completely manual, Ingle, 1976; Newman, 1987) to semiautomatic (Kulikowski, 1991) to fully automatic. Batchelder (1992) trained a neural network to distinguish languages. Both Henrich (1989) and Ziegler (1991) incorporated a ....
....for cryptanalysis. Markov models were used by Dunning (1994) One of the methods developed by Souter and his colleagues (1994) tested for the presence of unique character sequences. Henrich (1989) Cavnar and Trenkle (1994) and Souter et al. 1994) built task specific statistical models. Damashek (1995) used a model that computed dot products of frequency vectors. We address the question of the appropriateness of some of these models below. Whereas human oriented techniques exploit the full range of character encodings, automatic methods are limited to standard character sets. Most systems only ....
[Article contains additional citation context not shown here]
Damashek, Marc. "Gauging Similarity with n- Grams: Language-Independent Categorization of Text." Science, Vol. 267, 10 February, 1995.
....We hypothesize that images from similar categories have similar texture features. In particular, we present a method for categorizing images using a new texture feature termed N Theta M gram that is based on the N gram technique commonly used for determining similarity of text documents [1]. We define the notion of N Theta M grams and show how to compute an image profile in terms of its N Theta M grams . We propose three similarity measures for comparing images based on their N Theta M gram profile. We present results of experiments that compare categorization using N Theta ....
....binary images and 3 Theta 3grams) and let f j;I denote the frequency of N Theta M gram j in image I . The similarity between two images using dot product, S dp , is given in Equation 1. S dp (A; B) P T j=0 f j;A f j;B q P T j=0 f 2 j;A P T j=0 f 2 j;B (1) In work on text documents [1], it was observed that the similarity can be improved by subtracting the average N gram vector for a corpus of documents from the N gram vector of the query and database documents. We have adopted this idea for images. Let af j denote the average frequency of N Theta M gram j over all images ....
M. Damashek. Gauging similarity with n-Grams: Language-independent categorization of text. Science, 267:843--848, February 1995.
....[6] The LikeIt approach, used in effect by Friendly Finder [10] is to build an optimal weighted matching of the letters and multigraphs in the query, and those in each database record. Words as such receive no special treatment. In this sense it is related to the document retrieval approach of [3, 5]. An alternative approach to string comparison computes edit distance [4, 12] i.e. the minimum cost transformation of one string into another via some set of elementary operations. Most commonly, weighted insertion, deletion, and substitution operations are used, and the edit distance ....
M. Damashek, Gauging similarity with n-grams: Languageindependent categorization of text, Science, 267 (1995), pp. 843--848.
....research. 2 Text Indexing Using Telltale Telltale [16] is a dynamic hypertext environment that provides text indexing via a hypertext style user interface for text corpora. This indexing is done using statistical techniques based on n grams n character sequences of text to create links [9] associating documents that are similar . The only inputs to Telltale s algorithms are the documents to be indexed and the list of ASCII characters that make up words; thus, Telltale works well with text in languages other than English because it does not need stop word lists or other ....
Marc Damashek. Gauging similarity with NGrams: Language-independent categorization of text. Science, 267:843--848, 10 February 1995.
No context found.
Damashek, M., "Gauging Similarity with ngrams: Language-Independent Categorization of Text", Science, Vol. 267, pp. 843-848, 1995.
No context found.
M. Damashek. Gauging similarity with n-grams: language independent categorization of text. Science, 267(5199):843--848, 1995.
No context found.
Damashek, M. (1995) Gauging similarity with N-grams: Language-independent categorization of text. Science, 267(5199):843---848.
No context found.
Damashek, M.. Gauging similarity with ngrams: language independent categorization of text. Science, 267(5199):843--848, 1995.
No context found.
Damashek, M.. Gauging similarity with n-grams: language independent categorization of text. Science, 267(5199):843--848, 1995.
No context found.
M. Damashek, "Gauging Similarity with n-Grams: Language-Independent Categorization of Text", Science, Vol. 267, 843-848, Feb. 1995. 448
No context found.
M. DAMASHEK. 1995. Gauging Similarity With n-grams: Language Independent Categorization of Text. In Science, 267(5199), pp. 843-848.
No context found.
M. Damashek, Gauging similarity with n-grams: Language-independent categorization of text, Science, 267 (1995), pp. 843--848.
No context found.
Damashek, M. Gauging similarity with n-grams: Language-independent categorization of text. In Science (1995), vol. 267, pp. 843 -- 848.
No context found.
Damashek, M.. Gauging similarity with n-grams: language independent categorization of text. Science, 267(5199):843-- 848, 1995.
No context found.
Damashek, Marc. 1995. Gauging Similarity with ngrams: Language-Independent Categorization of Text. Science, Vol. 267, 10 February, 843 - 848.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC