Results 1 - 10
of
17
Webpage Genre Identification Using Variable-length Character n-grams
- In Proc. of the 19th IEEE Int. Conf. on Tools with Artificial Intelligence, v.2
, 2007
"... An important factor for discriminating between webpages is their genre (e.g., blogs, personal homepages, e-shops, online newspapers, etc). Webpage genre identification has a great potential in information retrieval since users of search engines can combine genre-based and traditional topic-based que ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
(Show Context)
An important factor for discriminating between webpages is their genre (e.g., blogs, personal homepages, e-shops, online newspapers, etc). Webpage genre identification has a great potential in information retrieval since users of search engines can combine genre-based and traditional topic-based queries to improve the quality of the results. So far, various features have been proposed to quantify the style of webpages including word and html-tag frequencies. In this paper, we propose a low-level representation for this problem based on character n-grams. Using an existing approach, we produce feature sets of variable-length character n-grams and combine this representation with information about the most frequent html-tags. Based on two benchmark corpora, we present webpage genre identification experiments and improve the best reported results in both cases. 1.
Learning to recognize webpage genres
- Information Processing and Management
, 2009
"... Webpages are mainly distinguished by their topic (e.g., politics, sports etc.) and genre (e.g., blogs, homepages, e-shops, etc.). Automatic detection of webpage genre could considerably enhance the ability of modern search engines to focus on the requirements of the user’s information need. In this ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Webpages are mainly distinguished by their topic (e.g., politics, sports etc.) and genre (e.g., blogs, homepages, e-shops, etc.). Automatic detection of webpage genre could considerably enhance the ability of modern search engines to focus on the requirements of the user’s information need. In this paper, we present an approach to webpage genre detection based on a fully-automated extraction of the feature set that represents the style of webpages. The features we propose (character n-grams of variable length and HTML tags) are language-independent and easilyextracted while they can be adapted to the properties of the still evolving web genres and the noisy environment of the web. Experiments based on two publicly-available corpora show that the performance of the proposed approach is superior in comparison to previously reported results. It is also shown that character n-grams are better features than words when the dimensionality increases while the binary representation is more effective than the term-frequency representation for both feature types. Moreover, we perform a series of cross-check experiments (e.g., training using a genre palette and testing using a different genre palette as well as using the features extracted from one corpus to discriminate the genres of the other corpus) to illustrate the robustness of our approach and its ability to capture the general stylistic properties of genre categories even when the feature set is not optimized for the given corpus.
Retrieval models for genre classification
- Scandinavian Journal of Information Systems
, 2008
"... Abstract. Genre provides a characterization of a document with respect to its form or functional trait. Genre is orthogonal to topic, rendering genre information a powerful filter technology for information seekers in digital li-braries. However, an efficient means for genre classification is an ope ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. Genre provides a characterization of a document with respect to its form or functional trait. Genre is orthogonal to topic, rendering genre information a powerful filter technology for information seekers in digital li-braries. However, an efficient means for genre classification is an open and controversially discussed issue. This paper gives an overview and presents new results related to automatic genre classification of text documents. We present a comprehensive survey which contrasts the genre retrieval models that have been developed for Web and non-Web corpora. With the concept of genre-specific core vocabularies the paper provides an original contribu-tion related to computational aspects and classification performance of genre retrieval models: we show how such vocabularies are acquired automatical-ly and introduce new concentration measures that quantify the vocabulary distribution in a sensible way. Based on these findings we construct light-weight genre retrieval models and evaluate their discriminative power and computational efficiency. The presented concepts go beyond the existing utilization of vocabulary-centered, genre-revealing features and open new possibilities for the construction of genre classifiers that operate in real-time. Key words: genre analysis, retrieval models, analysis and evaluation.
Design of Web Agents Inspired by Brain Research
"... Abstract. The paper presents an approach to combine knowledge from memory and brain sciences with information retrieval research in the design of Web agents. An information retrieval agent for classification of Web pages based on genre features is used. In developing the agent to adapt to users ’ se ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract. The paper presents an approach to combine knowledge from memory and brain sciences with information retrieval research in the design of Web agents. An information retrieval agent for classification of Web pages based on genre features is used. In developing the agent to adapt to users ’ search preferences, a neuro-cognitive model of human episodic memory is employed. Our studies show that neuro-realistic models, capable of abstraction of meaningful fragments of knowledge, rather than snapshots of the retrieved Web pages, are closer to the human way of interacting with the Web and can be used for optimization of agent performance.
Exploiting Link Structure for Web Page Genre Identification
"... As the World Wide Web grows at an unprecedented pace, web page genre identification has recently attracted increasing attention because of its importance in web search. A common approach for genre identification is to utilize textual features that can be extracted directly from the web page itself, ..."
Abstract
- Add to MetaCart
As the World Wide Web grows at an unprecedented pace, web page genre identification has recently attracted increasing attention because of its importance in web search. A common approach for genre identification is to utilize textual features that can be extracted directly from the web page itself, i.e., On-Page features. The extracted features are subsequently given to a machine learning algorithm that will perform classification. However, these approaches may not be e↵ective when the web page contains limited textual information (e.g., full of images). In this paper, we tackle the genre identification of web pages in such situation. We propose a framework that not only uses On-Page features, but also takes into account information in neighbor-ing pages, i.e., the pages that are connected to the original page by backward and forward links. We first introduce a graph-based model called GenreSim which selects an appropriate set of neighboring pages. We then construct a multiple classifier combination module that utilizes information from the selected neighboring pages and On-Page features to improve genre identification performance. The experiments are conducted on well-known corpora, and the favorable results indicate that our proposed framework is e↵ective, particularly in identifying web pages with limited textual information.
A New Centroid-based Approach for Genre Categorization of Web Pages
, 2009
"... In this paper we propose a new centroid-based approach for genre categorization of web pages. Our approach constructs genre centroids using a set of genre-labeled web pages, called training web pages. The obtained centroids will be used to classify new web pages. The aim of our approach is to provid ..."
Abstract
- Add to MetaCart
In this paper we propose a new centroid-based approach for genre categorization of web pages. Our approach constructs genre centroids using a set of genre-labeled web pages, called training web pages. The obtained centroids will be used to classify new web pages. The aim of our approach is to provide a flexible, incremental, refined and combined categorization, which is more suitable for automatic web genre identification. Our approach is flexible because it assigns a web page to all predefined genres with a confidence score; it is incremental because it classifies web pages one by one; it is refined because each web page either refines the centroids or is discarded as noisy page; finally, our approach combines three different feature sets, i.e. URL addresses, logical structure and hypertext structure. The experiments conducted on two known corpora show that our approach is very fast and outperforms other approaches.
Combining classifiers for flexible genre categorization of web pages
"... With the increase of the number of web pages, it is very difficult to find wanted information easily and quickly out of thousands of web pages retrieved by a search engine. To solve this problem, many researches propose to classify documents according to their genre, which is another criteria to cl ..."
Abstract
- Add to MetaCart
With the increase of the number of web pages, it is very difficult to find wanted information easily and quickly out of thousands of web pages retrieved by a search engine. To solve this problem, many researches propose to classify documents according to their genre, which is another criteria to classify documents different from the topic. Most of these works assign a document to only one genre. In this paper we propose a new flexible approach for document genre categorization. Flexibility means that our approach assigns a document to all predefined genres with different weights. The proposed approach is based on the combination of two homogenous classifiers: contextual and structural classifiers. The contextual classifier uses the URL, while the structural classifier uses the document structure. Both contextual and structural classifiers are centroid-based classifiers. Experimentations provide a micro-averaged break-even point (BEP) more than 85%, which is better than those obtained by other categorization approaches.
Adjectives and Adverbs as Indicators of Affective Language for Automatic Genre Detection
"... Abstract. We report the results of a systematic study of the feasibility of automatically classifying documents by genre using adjectives and adverbs as indicators of affective language. In addition to the class of adjectives and adverbs, we focus on two specific subsets of adjectives and adverbs: ( ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. We report the results of a systematic study of the feasibility of automatically classifying documents by genre using adjectives and adverbs as indicators of affective language. In addition to the class of adjectives and adverbs, we focus on two specific subsets of adjectives and adverbs: (1) trait adjectives, used by psychologists to assess human personality traits, and (2) speaker-oriented adverbs, studied by linguists as markers of narrator attitude. We report the results of our machine learning experiments using Accuracy Gain, a measure more rigorous than the standard measure of Accuracy. We find that it is possible to classify documents automatically by genre using only these subsets of adjectives and adverbs as discriminating features. In many cases results are superior to using the count of (a) nouns, verbs, or punctuation, or (b) adjectives and adverbs in general. In addition, we find that relatively few speaker-oriented adverbs are needed in the discriminant models. We conclude that at least in these two cases, the psychological and linguistic literature leads to identification of features that are quite useful for genre detection and for other applications in which identification of style and other non-topical characteristics of documents is important. 1
A New Approach for Flexible Document Categorization
"... Abstract—In this paper we propose a new approach for flexible document categorization according to the document type or genre instead of topic. Our approach implements two homogenous classifiers: contextual classifier and logical classifier. The contextual classifier is based on the document URL, wh ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—In this paper we propose a new approach for flexible document categorization according to the document type or genre instead of topic. Our approach implements two homogenous classifiers: contextual classifier and logical classifier. The contextual classifier is based on the document URL, whereas, the logical classifier use the logical structure of the document to perform the categorization. The final categorization is obtained by combining contextual and logical categorizations. In our approach, each document is assigned to all predefined categories with different membership degrees. Our experiments demonstrate that our approach is best than other genre categorization approaches. Keywords—Categorization, combination, flexible, logical