• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Towards automatic web genre identification – a corpus-based approach in the domain of academia by example of the academic’s personal homepage (2002)

by G Rehm
Venue:In Proc. of the Hawaii Internat. Conf. on System Sciences
Add To MetaCart

Tools

Sorted by:
Results 1 - 8 of 8

Conceptualizing documentation on the Web: an evaluation of different heuristic-based models for counting links between university web sites

by Mike Thelwall - Journal of the American Society for Information Science and Technology , 2002
"... models for counting links between university web sites ..."
Abstract - Cited by 36 (19 self) - Add to MetaCart
models for counting links between university web sites

Towards Logical Hypertext Structure - A Graph-Theoretic Perspective

by Alexander Mehler, Er Mehler, Matthias Dehmer, Rüdiger Gleim - Proc. of I2CS’04, Guadalajara/Mexico, Lecture Notes in Computer Science, Berlin-New , 2004
"... Facing the retrieval problem according to the overwhelming set of documents online the adaptation of text categorization to web units has recently been pushed. The aim is to utilize categories of web sites and pages as an additional retrieval criterion. In this context, the bagof -words model ha ..."
Abstract - Cited by 11 (8 self) - Add to MetaCart
Facing the retrieval problem according to the overwhelming set of documents online the adaptation of text categorization to web units has recently been pushed. The aim is to utilize categories of web sites and pages as an additional retrieval criterion. In this context, the bagof -words model has been utilized just as HTML tags and link structures.

H.V.Jagadish. Getting work done on the web: Supporting transactional queries

by Yunyao Li, Rajasekar Krishnamurthy, Shivakumar Vaithyanathan, H. V. Jagadish - In SIGIR , 2006
"... Many searches on the web have a transactional intent. In this paper we argue that pages satisfying transactional needs can be distinguished from the more common pages that have some information and links, but cannot be used to execute a transaction. Based on this hypothesis, we provide a recipe for ..."
Abstract - Cited by 8 (3 self) - Add to MetaCart
Many searches on the web have a transactional intent. In this paper we argue that pages satisfying transactional needs can be distinguished from the more common pages that have some information and links, but cannot be used to execute a transaction. Based on this hypothesis, we provide a recipe for constructing a transaction annotator. By constructing an annotator with one corpus and then demonstrating its classification performance on another, we establish its robustness. Finally, we show experimentally that a search procedure that exploits such pre-annotation greatly outperforms traditional search for transactional searches. 1.

Web Genre Benchmark Under Construction

by Marina Santini, Serge Sharoff
"... The project presented in this article focuses on the creation of web genre benchmarks (a.k.a. web genre reference corpora or web genre test collections), i.e. newly conceived test collections against which it will be possible to judge the performance of future genre-enabled web applications. The cre ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
The project presented in this article focuses on the creation of web genre benchmarks (a.k.a. web genre reference corpora or web genre test collections), i.e. newly conceived test collections against which it will be possible to judge the performance of future genre-enabled web applications. The creation of web genre benchmarks is of key importance for the next generation of web applications because, at present, it is impossible to evaluate existing and in-progress genre-enabled prototypes. We suggest focusing on the following key points: 1) propose a characterisation of genre suitable for digital environments and empirical approaches shared by a number of genre experts working in automatic genre identification; 2) define the criteria for the construction of web genre benchmarks and draw up annotation guidelines; 3) create web genre benchmarks in several languages; 4) validate the methodology and evaluate the results. We describe work in progress and our plans for future development. Since it is sometimes difficult to anticipate the difficulties that will arise when developing a large resource,

Language engineering techniques for web archiving

by José Coch, Julien Masanès , 2004
"... Abstract: Advanced Information processing can enable automatic location of content on the Web, decisions making on its suitability for archiving and thus can ameliorate dramatically accuracy and efficiency for building of large scale Web Archive. This paper presents preliminary results from a resear ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract: Advanced Information processing can enable automatic location of content on the Web, decisions making on its suitability for archiving and thus can ameliorate dramatically accuracy and efficiency for building of large scale Web Archive. This paper presents preliminary results from a research project (WATSON) aiming at adapting various Language Engineering technologies to facilitate large scale Web archiving as well as Web archives mining. The former includes pre-filtering and categorization of sites to define, based on criteria, a focus subset on the Web to be continuously crawled and site categorization to facilitate manual selection of important deep Web site. The result achieved in pre-filtering of commercial Web sites are 100% in precision for 70 % of recall. A work station prototype aggregating useful information for professional is presented. The latter encompasses collections mining with emphasis on content evolution study and analysis of political discourse. We present results applied to the 2002 French election collection made by BnF. 2 José Coch et al.

Centre for Translation Studies

by Serge Sharoff
"... Classifying Web corpora into domain and genre using automatic feature identification ..."
Abstract - Add to MetaCart
Classifying Web corpora into domain and genre using automatic feature identification

unknown title

by Marina Santini, Georg Rehm, Serge Sharoff, Er Mehler
"... In recent years, a multitude of most interesting research has been carried out in linguistics, psycholinguistics, computational linguistics and information retrieval with regard to the topic of web genres. Despite the increasing interest in this novel and innovative field within different communitie ..."
Abstract - Add to MetaCart
In recent years, a multitude of most interesting research has been carried out in linguistics, psycholinguistics, computational linguistics and information retrieval with regard to the topic of web genres. Despite the increasing interest in this novel and innovative field within different communities, there is still a significant lack in literature, especially concerning edited collections and journal issues that provide an overview of recent research. The aim of this special issue of the Journal for Language Technology and Computational Linguistics is to contribute to filling this gap. More specifically, this issue is dedicated to automatic genre identification. 1 Genres are categories that subsume texts which have multiple features in common, most importantly, a shared communicative purpose. Genres form and evolve within specific discourse communities, are instantiated as well as enforced and most often also given a name by members of their respective discourse communities. An important characteristic is that their users are able to recognise certain genres, such as, for example, an invoice, a business letter, a shopping list, or a menu very quickly based on genre-specific properties (such as, for example, a conventionalized text structure,

On the Notion of Genre in Digital Preservation ∗

by Fiorella Foscarini, Yunhyong Kim, Christopher A. Lee, Er Mehler, Gillian Oliver, Seamus Ross
"... In this paper, we discuss the notion of genre as a basis for addressing the problem of context representation in digital preservation. We outline several reference points for the notion of genre. This includes a review of diplomatic principles that can support and enhance the power of genre as a key ..."
Abstract - Add to MetaCart
In this paper, we discuss the notion of genre as a basis for addressing the problem of context representation in digital preservation. We outline several reference points for the notion of genre. This includes a review of diplomatic principles that can support and enhance the power of genre as a key to capture information about context relations. Further, we discuss the impact of open genre models and open topic models in information retrieval and finally present a list of research questions concerning future research in automation of digital preservation.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University