| J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proceedings of the Conference on Human Factors in Computing Systems CHI'97, 1997. |
....be and hence the higher the b value. The higher the b value, the lower the Zipfian exponent, which explains the distributions seen in Figures 2. 8. Related Work There is related work in the area of evolution of the web as well as in web graph analysis. In the area of web evolution, Pitkow et al. [17] presented a model that explains some factors in the survival and change dynamics of documents. Cho et al. [9] computed the lifespan of pages in five different domains, namely .gov, net, org, edu, and .com, and showed that it varies widely. Smaller studies on how often web pages change were ....
J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proc. ACM SIGCHI, 1997.
....for proxy caches [11] and identified the difficulty of evaluating prefetching proxies under current schemes. A preloading proxy cache typically chooses what to preload based on either user retrieval history (e.g. 27, 20] or the content of the current or recently retrieved pages (e.g. [7, 6, 17, 29, 14, 13]) Neither are modeled well (if at all) in artificial workloads and captured traces which are typically used in evaluation. In the case where the workload is a captured trace from real users, the content is typically unavailable (at least not in the form that the users saw) Additionally, since a ....
J. E. Pitkow and P. L. Pirolli. Life, death, and lawfulness on the electronic frontier. In ACM Conference on Human Factors in Computing Systems, Atlanta, GA, Mar. 1997.
....many of the resources being retrieved contain textual content, which can be analyzed, and Web pages contain hyperlinks to other resources. This link structure can carry significant information implicitly, such as approval, recommendation, similarity, and relevance. Pitkow and Pirolli [PP97] have observed that hyperlinks, when employed in a nonrandom format, provide semantic linkages between objects, much in the same manner that citations link documents to other related documents. This observation can be restated as most Web pages are linked to others with related content. This ....
.... Lyc02b] focused crawlers (e.g. CGMP98, CvdBD99, BSHJ 99, Men97, MB00, Lie97, RM99] linkage analyzers (e.g. BFJ96, DH99, KRRT99, FLGC02, CDR 99, BH98b, BP98, DGK IBM00, Kle98] and intelligent Web agents (e.g. AFJM95, JFM97, Mla96, Lie95, Lie97, MB00, BS97, LaM96, LaM97, Lie97, PP97, Dav99a] as we will describe below. While in this dissertation we are concerned with extracting information from Web content to guide user action prediction, the utility of Web content is much broader. Thus in this chapter we consider extensively where and how long held intuitions about the ....
[Article contains additional citation context not shown here]
James E. Pitkow and Peter L. Pirolli. Life, death, and lawfulness on the electronic frontier. In ACM Conference on Human Factors in Computing Systems, Atlanta, GA, March 1997.
....importance scores for Web pages [16] In Section 2 we discuss how PageRank and other iterative algorithms relate to our work. and q. These methods have been applied to cluster scientific papers according to topic [18, 21] More recently, the co citation method has been used to cluster Web pages [12, 17]. As discussed in Section 1, our algorithm can be thought of as a generalization of cocitation where the similarity of citing documents is also considered, recursively. In terms of graph structure, co citation scores between any two nodes are computed only from their immediate neighbors, whereas ....
James Pitkow and Peter Pirolli. Life, death, and lawfulness on the electronic frontier. In Proceedings of the Conference on Human Factors in Computing Systems, Atlanta, Georgia, 1997.
....individual communities, however, have not concerned the relationship between communities. To build the web community chart, we use the notion of authorities and hubs not only to identify communities, but also to deduce their relationships. Recent document clustering approaches on the Web, such as [11, 3], have also exploited link analysis for clustering web pages, and [11] also considered relationships between clusters. Pitkow and Pirolli [11] proposed clustering algorithms using co citation analysis. They performed hierarchical clustering to show relationships between clusters by a hierarchy. ....
....communities. To build the web community chart, we use the notion of authorities and hubs not only to identify communities, but also to deduce their relationships. Recent document clustering approaches on the Web, such as [11, 3] have also exploited link analysis for clustering web pages, and [11] also considered relationships between clusters. Pitkow and Pirolli [11] proposed clustering algorithms using co citation analysis. They performed hierarchical clustering to show relationships between clusters by a hierarchy. Rather, we create a graph of communities, since, in our experiments, the ....
[Article contains additional citation context not shown here]
J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proceedings of International Conference on Computer and Human Interaction, 1997.
....of terabytes. The growth rate of the Web is even more dramatic. According to [41, 42] the size of the Web has doubled in less than two years, and this growth rate is projected to continue for the next two years. Aside from these newly created pages, the existing pages are continuously updated [52, 58, 24, 17]. For example, in our own study of over half a million pages over 4 months [17] we found that about 23 of pages changed daily. In the .com domain 40 of the pages changed daily, and the half life of pages is about 10 days (in 10 days half of the pages are gone, i.e. their URLs are no longer ....
James Pitkow and Peter Pirolli. Life, death, and lawfulness on the electronic frontier. In Proceedings of the Conference on Human Factors in Computing Systems CHI'97, 1997.
....what to prefetch. Typical approaches have used Markovian techniques (e.g. 28, 55] on the history of Web page references to recognize patterns of activity. Others prefetch bookmarked pages and oftenrequested objects (e.g. 50] Still others prefetch links from the currently requested page [46, 13, 10, 57, 38, 42, 37, 51, 26, 34]. However, prefetching all of the links of the current page is not a viable option, given that the number of links per page can be quite large, and that a prefetching system typically has only a limited amount of time to prefetch before the user makes a new selection. While likely heavy tailed ....
J. E. Pitkow and P. L. Pirolli. Life, Death, and Lawfulness on the Electronic Frontier. In ACM Conference on Human Factors in Computing Systems, Atlanta, GA, Mar. 1997.
....beginning. Procedure [1]while (true) 2] url # selectToCrawl(AllUrls) 3] page # crawl(url) 4] if #url # CollUrls# then [5] update(url, page) 6] else [7] tmpurl # selectToDiscard(CollUrls) 8] discard(tmpurl) 9] save(url, page) 10] CollUrls # #CollUrls tmpurl # # url [11] newurls # extractUrls(page) 12] AllUrls # AllUrls # newurls Figure 11: Conceptual operational model of an incremental crawler the introduction, and the right hand side corresponds to the periodic crawler. In the next section, we discuss how we can implement an effective incremental ....
....engines report numbers similar to this. tion. We believe these references are complementary to our work, because we present an incremental crawler architecture, which can use any of the algorithms in these papers. References [13] and [6] experimentally study how often web pages change. Reference [11] studies the relationship between the desirability of a page and its lifespan. However, none of these studies are as extensive as ours in terms of the scale and the length of the experiment. Also, their focus is different from ours. Reference [13] investigates page changes to improve web caching ....
[Article contains additional citation context not shown here]
J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proceedings of International Conference on Computer and Human Interaction, 1997.
.... the list, based on graphics elements (for a detailed review, see [7] One of the UTECDSR model s components uses references between documents to determine the subject of the document [2] 3] Creating groups of documents that share a common subject has been presented by Pitkow and Pirolli [5]. According to their method, if two different documents (A and B) are cited by a third document (C) it can be assumed that the two documents are interconnected. The more references made to a group of documents, the greater the extent of their inter connection, in which case they can be grouped ....
....construction of citation indexing [4] 2 2. PRESENTING THE IDEA Current search engines generally produce lists of hundreds and sometimes thousands of documents that match the search criteria. Since various studies have found that an average user only scans the first 10 to 20 documents in a list [5], the ranking of the documents in the list is critical. There are several ranking methods which sort results based on the number of appearances of the search terms, or on the age of the document, or on the use of common keywords, etc. The central idea in this study is to create a ranking within ....
Pitkow, J., Pirolli, P. (1996). Life, death, and lawfulness on the electronic frontier, Proceedings of CHI '97, ACM, April 1996, 213-220.
.... with the continued growth of the Web and the small amount of ontologist surfing time available per node to augment submissions, purely manual approaches cannot find high quality pages about a topic as effectively as a high quality tool that makes use of implicit judgments in the form of hyperlinks [3, 4, 2, 11, 12]. We consider the problem of generating high quality, relevant, links for topics in a taxonomy tree, which can then be presented to a human ontologist for vetting and annotation. Our system is designed to be used in a two phase taxonomy construction and maintenance process: 1) The ontologist ....
J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. Proc. ACM SIGCHI, 1997.
....driving applications that motivate (and are motivated by) a better understanding of the neighborhood structure on the web. In particular, the second generation of data service applications on the web including advanced search applications [16, 17, 10] browsing and information foraging [14, 39, 15, 40, 19], community extraction [28] taxonomy construction [30, 29] have all taken tremendous advantage of knowledge about the hyperlink structure of the web. As just one example, let us mention the community extraction algorithm of [28] In this algorithm, a characterization of degree sequences ....
J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. Proc. ACM SIGCHI, 1997.
....to improve freshness of the collection. We believe these references are complementary to our work, because we present an incremental crawler architecture, which can use any of the algorithms in these papers. References [WM99] and [DFK99] experimentally study how often web pages change. Reference [PP97] studies the relationship between the desirability of a page and its lifespan. However, none of these studies are as extensive as ours in terms of the scale and the length of the experiment. Also, their focus is di#erent from ours. Reference [WM99] investigates page changes to improve web ....
....between the desirability of a page and its lifespan. However, none of these studies are as extensive as ours in terms of the scale and the length of the experiment. Also, their focus is di#erent from ours. Reference [WM99] investigates page changes to improve web caching policies, and reference [PP97] studies how page changes are related to access patterns. 7 Conclusion In this paper we have studied how to build an e#ective incremental crawler. To understand how the web evolves over time, we first described a comprehensive experiment, conducted on 720,000 web pages from 270 web sites over 4 ....
James Pitkow and Peter Pirolli. Life, death, and lawfulness on the electronic frontier. In Proceedings of International Conference on Computer and Human Interaction, 1997.
....WebBook and WebForager [6] allow users to define, visualize, and manipulate groups of related web pages. More relevant to the concerns of this paper are techniques that analyze link structure to rank and group items. Pitkow and Pirolli developed clustering algorithms based on co citation analysis [12] and categorization algorithms that utilized hyperlink structure [10] Kleinberg formalized the quality of documents within a hyper linked collection using the concept of authority [8] At first pass, an authoritative document is one that many other documents link to. However, this notion can be ....
Pitkow, J., and Pirolli, P. Life, Death, and Lawfulness on the Electronic Frontier, in Proceedings of CHI'97 (Atlanta GA, March 1997), ACM Press, 383-390.
....techniques to compute similarities between paths and to make recommendations on this basis for example, to recommend pages to you that others browsed in close proximity to pages you browsed. Other techniques extract information from multiple sources. For example, Pirolli, Pitkow, and Rao [33, 34] combined web links with web usage data and text similarity to categorize and cluster web pages. Other work has focused on extracting information from online conversations, such as Usenet. PHOAKS [18] mines messages in Usenet newsgroups looking for mentions of web pages. It categorizes and ....
Pitkow, J., and Pirolli, P. Life, Death, and Lawfulness on the Electronic Frontier, in Proceedings of CHI'97 (Atlanta GA, March 1997), ACM Press, 383-390.
....the structure of links between sites. This approach builds on the intuition that when the author of one site chooses to link to another, this often implies both that the sites have similar content and that the author is endorsing the content of the linked to site. Pirolli, Pitkow and colleagues [12, 13] experimented with link based algorithms for clustering and categorizing web pages. Kleinberg s HITS algorithm [8] defines authoritative and hub pages within a hypertext collection. Authorities and hubs are mutually dependent: a good authority is a page that is linked to by many hubs, and a good ....
Pitkow, J., and Pirolli, P. Life, Death, and Lawfulness on the Electronic Frontier, in Proceedings of CHI'97 (Atlanta GA, March 1997), ACM Press, 383-390.
....the pages in a bullseye layout, a series of concentric circles each containing pages of equal degree. Pirolli, Pitkow, and colleagues at Xerox PARC have done a great deal of work that analyzes web structure and usage data, attempting to categorize and cluster web pages. Pitkow and Pirolli [10] describe clustering algorithms based on co citation analysis [3] The intuition is that if two documents, say A and B, are both cited by a third document, this is evidence that A and B are related. The more often a pair of documents is co cited, the stronger the relationship. They applied two ....
Pitkow, J., and Pirolli, P. Life, Death, and Lawfulness on the Electronic Frontier, in Proceedings of CHI'97 (Atlanta GA, March 1997), ACM Press, 383-390.
....to Small [1973] For a pair of documents p and q, the former quantity is equal to the number of documents cited by both p and q, and the latter quantity is the number of documents that cite both p and q. Co citation has been used as a measure of the similarity of www pages by Larson [1996] and by Pitkow and Pirolli [1997]. Weiss et al. 1996] define linked based similarity measures for pages in a hypertext environment that generalize co citation and bibliographic coupling to allow for arbitrarily long chains of links. Several methods have been proposed in this context to produce clusters from a set of nodes ....
....context to produce clusters from a set of nodes annotated with such similarity information. Small and Griffith [1974] use breadth first search to compute the connected components of the undirected graph in which two nodes are joined by an edge if and only if they have a positive co citation value. Pitkow and Pirolli [1997] apply this algorithm to study the link based relationships among a collection of www pages. One can also use principal components analysis [Hotelling 1933; Jolliffe 1986] and related dimension reduction techniques such as multidimensional scaling to cluster a collection of nodes. In this ....
[Article contains additional citation context not shown here]
PITKOW, J., AND PIROLLI, P. 1997. Life, death, and lawfulness on the electronic frontier. In Proceedings of ACM SIGCHI Conference on Human Factors in Computing Systems (CHI '97) (Atlanta, Ga., Mar. 22--27). ACM, New York, pp. 383--390.
....were considered) Raghupathi and Nerur [10] analyzed 155 authors in the field of artificial intelligence with data extracted from the Science Citation Index for the period of 1980 1995. They used a similar technique to [1] finding 14 factors that were labelled manually. Pitkow and Pirolli [9] used the method of [13] applied towards sets of hypertext documents on the World Wide Web, transferring the concept of scientific publication citations to hypertext links on the web. 5,582 HTML and 15,139 non HTML documents were considered and clustered using complete linkage hierarchical ....
J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proceedings of Human Factors in Computing Systems, pages 383--390, 1997.
....browse the web. Many of those that are content based depend on the contents of a page and or the text contained in or around anchors to help determine what to suggest to the user (e.g. AFJM95, JFM97, Mla96, Lie95, Lie97, MB00, BS97, LaM96, LaM97] or to prefetch links for the user (e.g. Lie97, PP97, Dav99] By comparing the text of neighboring pages, we can estimate the relevance for pages neighboring the current one. We also find out how well anchor text describes the targeted page. 3 Experimental Method 3.1 Data Set 3.1.1 Initial Data Set Ideally, when characterizing the pages of ....
....similarity measures (but neither does it hurt) Finally, we have shown that titles, descriptions, and anchor text all have relatively high mean term probabilities (and high mean TFIDF scores) implying that these page proxies represent at least part of the target page well. Pitkow and Pirolli [PP97] have observed that hyperlinks, when employed in a non random format, provide semantic linkages between objects, much in the same manner that citations link 13 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 a) Mean TFIDF Score Anchor w text before Anchor w text after Anchor w text both 0 0.2 0.4 ....
James E. Pitkow and Peter L. Pirolli. Life, Death, and Lawfulness on the Electronic Frontier. In ACM Conference on Human Factors in Computing Systems, Atlanta, GA, March 1997.
....the collection they are connected with) Pirolli, Pitkow, and Rao [10] developed a categorization algorithm that used hyperlink structure, text similarity, and user access data to categorize web pages into various functional roles, such as head , index , and content . Later Pitkow and Pirolli [12] experimented with clustering algorithms based on co citation analysis [5] in which pairs of documents were clustered based on the number of times they were both cited by a third document. While our work shares many goals of previous work, it differs in several respects. First, as discussed ....
Pitkow, J., and Pirolli, P. Life, Death, and Lawfulness on the Electronic Frontier, in Proceedings of CHI'97 (Atlanta GA, March 1997), ACM Press, 383-390.
....challenges. RELATED WORK: RESEARCH INTO EXTRACTING AND VISUALIZING HIGH LEVEL STRUCTURES FROM THE WEB Many researchers have sought to define useful, higher level structures that can be extracted from the web (or, more generally, any hyperlinked collection of documents) such as collections [26], localities [24] patches or books [6] In terms of the dimensions we have introduced, this work has concentrated mostly on means of organizing collections of items, in particular on algorithms to extract useful structures. Less attention has been paid to the composition of collections ....
....links between web pages to find pages related to a given set of pages and to infer the topic and function of pages. Marchiori [21] developed an algorithm that used information about links (and the contents of linked to pages) to reorder the results returned by search engines. Pitkow and Pirolli [26] report cluster algorithms based on co citation analysis[11] The intuition is that if two documents, say A and B, are both cited by a third document, this is evidence that A and B are related. The more often a pair of documents is co cited, the stronger the relationship. They applied two ....
[Article contains additional citation context not shown here]
Pitkow, J., and Pirolli, P. Life, Death, and Lawfulness on the Electronic Frontier, in Proceedings of CHI'97 (Atlanta GA, March 1997), ACM Press, 383-390.
....been a variety of agents proposed to help people browse the web. Many of those that are content based depend on the contents of a page and or the text contained in or around anchors to help determine what to suggest to the user (e.g. 19, 27, 24, 26, 3] or to prefetch links for the user (e.g. [24, 13, 28]) By comparing the text of neighboring pages, we can estimate the relevance for pages neighboring the current one. We also find out how well anchor text describes the targeted page. 3 Experimental Method 3.1 Data Set 3.1.1 Initial Data Set Ideally, when characterizing the pages of the WWW, ....
....similarity measures (but neither does it hurt) Finally, we have shown that titles, descriptions, and anchor text all have relatively high mean term probabilities (and high meanTFIDF scores) implying that these page proxies represent at least part of the target page well. Pitkow and Pirolli [28] have observed that hyperlinks, when employed in a non random format, provide semantic linkages between objects, much in the same manner that citations link documents to other related documents. We have demonstrated that this semantic linkage, as approximated by textual similarity, is measurably ....
J. E. Pitkow and P. L. Pirolli. Life, Death, and Lawfulness on the Electronic Frontier. In ACM Conference on Human Factors in Computing Systems, Atlanta, GA, Mar. 1997.
....study options for defining boundaries of replicated collections. We carefully design these measures so that in addition to being good measures, we can e#ciently compute them over hundreds of gigabytes of data on disk. Our work is in contrast to recent work in the Information Retrieval domain [PP97] where the emphasis is on accurately comparing link structures of document collections, when the document collections are small. We then develop e#cient heuristic algorithms to identify replicated sets of pages and collections. Improved crawling: We discuss how we use replication information ....
James Pitkow and Peter Pirolli. Life, death, and lawfulness on the electronic frontier. In International conference on Computer and Human Interaction (CHI'97), 1997.
....and news organizations on the web. Of course, in contrast to search engines, our approach requires that the user has already found a page of interest. Recent work in information retrieval on the web has recognized that the hyperlink structure can be very valuable for locating information [18, 3, 7, 23, 19, 25, 24, 6, 17, 5]. This assumes that if there is a link from page v and w, then the author of v recommends page w, and links often connect related pages. In this paper, we describe the Companion and Cocitation algorithms, two algorithms which use only the hyperlink structure of the web to identify related web ....
....correlated, since URLs which have a large number of siblings to consider in the cocitation algorithm also generally produce a large neighborhood graph for processing in the companion algorithm. 5 Related Work Many researchers have proposed schemes for using the hyperlink structure of the web [18, 3, 7, 23, 19, 25, 24, 6, 17, 5]. For the most part, this work does not discuss the finding of related pages, with four exceptions discussed below. We know of only one previous work that expoits the order of links: Chakrabarti et al. 9] use the links and their order to categorize web pages and they show that the links that are ....
[Article contains additional citation context not shown here]
J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proceedings of the Conference on Human Factors in Computing Systems (CHI 97), pages 383--390, March 1997.
No context found.
Pitkow, J. and Pirolli, P. (1997) Life, death, and lawfulness on the electronic frontier. Proceedings of the Conference on Human Factors in Computing Systems, CHI '97 (pp. 383-390).
....of approximations for analyzing and predicting information scent. These techniques are based on psychological models [6] which are closely related to standard information retrieval techniques, and Web data mining techniques based on the analysis of Content, Usage, and hyperlink Topology (CUT, [3,10]) For more details, see [1] 2 Furnas referred to such intermediate information as residue [4] Reverse Scent Flow to Identify Information Need A well traveled path may indicate a group of users who have very similar information goals and are guided by the scent of the environment. ....
....LRS Pa th LRS Pa ths Embed Detail Window Ex tr ac t Access Log Observed Usage Simu lated Usage Simu lator select Figure 1: Data State Model for Web Scent Visualization Page 4 values reflect how users voted with their clicks in finding relevant information. co citation graph [10], reflects the frequency that two nodes were linked to by the same page. The edge values provide an indication of the authoritative relevance of pages to one another. Spreading Activation Assessments of Scent We use a spreading activation algorithm [7] on the various graphs to compute relevance ....
Pitkow, J. and Pirolli, P. (1997). Life, death, and lawfulness on the electronic frontier. Proceedings of the Conference on Human Factors in Computing Systems, CHI '97 (pp. 383-390).
....To the extent that surfing predictions improve text based search results, we would expect that a more informed approach would yield better improvements than a random walk model. 2. 2 Recommendation of Related Pages Recently, tools have become available for suggesting related pages to surfers [10, 16, 18]. The What s Related tool button on the Netscape browser developed by Alexa, provides recommendations based on content, link structure, and usage patterns. Similar tools for specific repositories of WWW content are also provided by Autonomy. One can think of these tools as making the prediction ....
Pitkow, J. and Pirolli, P. (1997). Life, death, and lawfulness on the electronic frontier. Proceedings of the Conference on Human Factors in Computing Systems, (CHI '97) Atlanta, GA.
No context found.
J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proceedings of the Conference on Human Factors in Computing Systems CHI'97, 1997.
No context found.
James Pitkow and Peter Pirolli. Life, death, and lawfulness on the electronic frontier. In Proceedings of the Conference on Human Factors in Computing Systems, Atlanta, Georgia, 1997.
No context found.
J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proceedings of the ACM Conference on Human Factors in Computing Systems, Atlanta, Georgia, Mar. 1997.
No context found.
James E. Pitkow and Peter L. Pirolli. Life, death, and lawfulness on the electronic frontier. In ACM Conference on Human Factors in Computing Systems, Atlanta, GA, March 1997.
No context found.
Pitkow, J., Pirolli, P. (1997): Life, Death, and Lawfulness on the Electronic Frontier. Proceedings of the Conference on Human Factors in Computing Systems (CHI'97). pp. 383--390
No context found.
Pitkow, J. and Pirolli, P. Life, death, and lawfulness on the electronic frontier, in Proceedings of CHI'97 (Atlanta GA, March 1997), ACM Press, 383-390.
No context found.
Pitkow, J., and Pirolli, P. Life, Death, and Lawfulness on the Electronic Frontier, in Proceedings of CHI'97 (Atlanta GA, March 1997), ACM Press, 383-390.
No context found.
J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proceedings of the Conference on Human Factors in Computing Systems CHI'97, 1997.
No context found.
J. Pitkow and P. Pirolli: Life, Death, and Lawfulness on the Electronic Frontier, Proceedings of the Conference on Human Factors in Computing Systems (CHI 97) (1997), 383-390
No context found.
Pitkow, J., Pirolli, P. Life, death, and lawfulness on the electronic frontier. Proceedings of CHI '97, (Atlanta, Georgia, USA, April 1997), ACM Press, 383-390.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC