| M. Goldszmidt and M. Sahami. A probabilistic approach to full-text document clustering. In SRI Technical Report ITAD-433-MS-98-044, 1997. |
....in order to increase potential web traffic, and thirdly it does not address the problem of the (at least) 2 billion web pages that have already been created. Therefore researchers in this field are turning to autonomous, or semi autonomous methods for web document categorization ( 2] 3] 4] [5], 6] 7] 8] 9] 10] There are many other potential applications and benefits that will accrue from being able to reliably and automatically cluster and categorize corpora of documents. Much of this document clustering work is based on using either supervised or unsupervised learning ....
Goldszmidt, M., Sahami, M. (1998) A Probabilistic Approach to Full-Text Document Clustering, Technical Report ITAD-433-MS-98-044, SRI International,
....[9] introduced another important manufacturing domain in which reinforcement learning appears likely to make a significant mark the control of autonomous guided vehicles (AGVs) on factory floors. Riedmiller (Univ. of Karlsruhe) applied reinforcement learning to heating control of liquid tanks [6]. Finally, Schulte (CMU) innovatively used reinforcement learning to control lighting conditions in an intelligent workplace [8] A warning note raised in the workshop is that machine learning studies (which are based on standard feature vector datasets) are often idealized and far removed from ....
....J. Boyan, and M. S. Lee. Q2: Memory based active learning for optimizing noisy continuous functions. In Jude Shavlik, editor, International Conference on Machine Learning (to appear) 1998. 5] M. Puterman. Markov Decision Processes: Discrete Dynamic Stochastic Programming. John Wiley, 1994. [6] M. Riedmiller. High Quality Thermostat Control by Reinforcement Learning A Case Study. In Conference on Automated Learning and Discovery, 1998. 7] J. G. Schneider, J. A. Boyan, and A. W. Moore. Value Function Based Production Scheduling. In Jude Shavlik, editor, International Conference on ....
[Article contains additional citation context not shown here]
M. Goldszmidt and M. Sahami. A probabilistic approach to full-text document clustering.
....is the use of a similarity query, or query by example; The user provides a sample document that is relevant, and expects to get back other documents discussing the same subject matter. Various similarity measures over documents have been defined and used in applications of Information Retrieval [8, 15, 17, 22, 23, 24]. We review some of this work in Section 5. However, most existing work does not pay much attention to explaining what it is that makes the retrieved documents similar. Moreover, in many cases the similarity of the retrieved documents is based on terms that are not necessarily central to the ....
....in the database. Given a collection of themes, T 1 ; T k , with respective theme distributions p T j ; q T j , there is some probability for any document, d, to be in any theme, T j . ffl Unlike most existing work, that deals with complete classification of the database topics [3, 8, 9], we concentrate on finding documents for one particular theme. The next section describes our algorithm for finding a theme. 3. Probabilistic theme finding algorithm Given a kernel document, our task is to find a model R as described above, such that the probability of the model given the ....
[Article contains additional citation context not shown here]
M. Goldszmidt and M. Sahami, A Probabilistic Approach to Full-Text Document Clustering, Tech. Report ITAD-433MS -98-044, SRI International, 1998.
....is simple, it has been criticized for its tendency to produce large clusters early in the clustering process. It is because that the clusters generated by the SPC technique depends on the order in which bibliographic records are processed. Reallocation Clustering (RC) Reallocation clustering[2] operates by selecting an initial set of clusters followed by some iterations of re assigning bibliographic records to the most similar clusters. Through the iterations, the cohesiveness among records in a cluster is improved. In reallocation clustering, it is difficult to decide how many ....
M. Goldszmidt and M. Sahami. A Probabilistic Approach to Full-Text Document Clustering. Technical Report ITAD-433-MS-98-044, SRI International, 1998.
....clustering process. This situation also appears in our experiment using the single pass clustering technique. It is because that the clusters generated by the SPC technique depends on the order in which bibliographic records are processed. 4.2. 2 Reallocation Clustering (RC) Reallocation clustering[4, 6] operates by selecting an initial set of clusters followed by some iterations of re assigning bibliographic records to the most similar clusters. Through the iterations, the cohesiveness among records in a cluster is improved. The following algorithm describes the steps required by the ....
....next iteration of reallocation (i.e. Step (2) is performed again) until a specified number of iterations are completed. 4.3 Discussions Several relevant issues regarding to our proposed database clustering techniques for bibliographic databases are discussed as follows. 4.3. 1 Outliers Outliers[6] are records that are dissimilar to almost all other records. It is difficult to fit them into even the most similar cluster, i.e. the distance from the bibliographic record to its most similar cluster is much larger than the distance between any pair of records in that cluster. In this case, we ....
M. Goldszmidt and M. Sahami. A Probabilistic Approach to Full-Text Document Clustering. Technical Report ITAD-433-MS-98-044, SRI International, 1998.
....learning methods applied to text processing tasks. The presented work involved a wide array of learning approaches, including finite state machine induction [HD, MMK] neural networks that can accept advice from users [SER] relational learning methods [Moo, SC] statistical clustering algorithms [GS, Hof, LV, YPC] boosting methods [ADW] algorithms for learning with hierarchical classes [Hof, MG] and active learning methods [LT, NM] A principal limitation of many of these approaches is that they do not directly reflect attempts to develop formal models of the text phenomenon of interest. ....
.... LS, Moo, MMK] information finding [SER] information integration from Web sources [MMK] automatic citation indexing [BLG, KP] event detection in text streams [YPC] document routing [ADW] and classification [GWI, Moo] organization and presentation of documents in information retrieval systems [GS, Hof] collaborative filtering [dVN] lexicon learning [GBGH] query reformulation [KK] text generation [Rad] and analysis of the statistical properties of text [MA] In short, the state of the art in learning from text and the web is that a broad range of methods are currently being applied to ....
[Article contains additional citation context not shown here]
M. Goldszmidt and M. Sahami. A probabilistic approach to full-text document clustering.
....the advantage of simplicity, one might find that large clusters will often be generated by this clustering technique. Moreover, the performance of SPC C technique depends on the order in which bibliographic records are processed. Reallocation Clustering(RC C ) Reallocation clustering[FBY92, GS98] operates by selecting an initial set of clusters followed by a series of iterations of re assigning bibliographic records to the most similar clusters. Through the iterations, the cohesiveness among records in a cluster is improved. The following algorithm describes the steps required by the ....
....Discussions Several relevant issues regarding to our proposed database clustering techniques for bibliographic databases are discussed as follows. 36 record cluster outlier Figure 17: Outliers Outliers Outliers are records that are dissimilar to almost all other records as shown in Figure 17[GS98] It is difficult to fit them into even the nearest cluster, i.e. the distance from the bibliographic record to its most similar cluster is much larger than the distance between any pair of records in that cluster. In this case, we have to decide whether outliers should be included into the most ....
M. Goldszmidt and M. Sahami. A Probabilistic Approach to Full-Text Document Clustering. Technical Report ITAD-433-MS-98-044, SRI International, 1998.
....[9] introduced another important manufacturing domain in which reinforcement learning appears likely to make a significant mark the control of autonomous guided vehicles (AGVs) on factory floors. Riedmiller (Univ. of Karlsruhe) applied reinforcement learning to heating control of liquid tanks [6]. Finally, Schulte (CMU) innovatively used reinforcement learning to control lighting conditions in an intelligent workplace [8] A warning note raised in the workshop is that machine learning studies (which are based on standard feature vector datasets) are often idealized and far removed from ....
....J. Boyan, and M. S. Lee. Q2: Memory based active learning for optimizing noisy continuous functions. In Jude Shavlik, editor, International Conference on Machine Learning (to appear) 1998. 5] M. Puterman. Markov Decision Processes: Discrete Dynamic Stochastic Programming. John Wiley, 1994. [6] M. Riedmiller. High Quality Thermostat Control by Reinforcement Learning A Case Study. In Conference on Automated Learning and Discovery, 1998. 7] J. G. Schneider, J. A. Boyan, and A. W. Moore. Value Function Based Production Scheduling. In Jude Shavlik, editor, International Conference on ....
[Article contains additional citation context not shown here]
M. Goldszmidt and M. Sahami. A probabilistic approach to full-text document clustering.
....the absolute performance of the resulting clustering still leaves something to be desired. The poor performance of mixture modeling suggests the need of a better model for document clustering altogether. Addressing this need is the focus of Chapter 6, which describes joint work with Goldszmidt [63]. Here we construct a model for clustering documents based on a novel probabilistic similarity measure that captures the expected overlap in words between documents. This score prompts the investigation of different methods for estimating the probability of a word appearing in a document. ....
Goldszmidt, M., and Sahami, M. A probabilistic approach to full-text document clustering. Tech. Rep. ITAD-433-MS-98-044, SRI International, 1998.
....clustering method can be used at this stage. We have recently conducted comparisons with a number of different clustering algorithms including hierarchical agglomerative clustering [18] and iterative clustering methods, such as KMeans [15] using different measures of similarity between documents [7]. Currently, we have chosen to use a two step approach to clustering. First, group average hierarchical agglomerative clustering is used to form an initial set of clusters which is then further optimized with an iterative method. Both of these methods rely on the definition of a similarity score ....
....between two documents than a match on a more common term. To compute the probabilities in Eq. 2, namely P (Y i = wjd i ) we use a novel normalized geometric mean (NGM) smoothing estimate. A justification of this estimate is beyond the scope of this paper (we refer the interested reader to [7] for further details) but we have found it to work quite well in practice and present a brief overview of these results shortly. The hierarchical agglomerative clustering method proceeds by initially placing each document in a separate cluster. The similarity between each pair of clusters c and ....
[Article contains additional citation context not shown here]
Moises Goldszmidt and Mehran Sahami. A probabilistic approach to full-text document clustering. Technical Report ITAD-433-MS-98-044, SRI International, 1998.
No context found.
M. Goldszmidt and M. Sahami. A probabilistic approach to full-text document clustering. In SRI Technical Report ITAD-433-MS-98-044, 1997.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC