Results 1 - 10
of
74
Statistical properties of community structure in large social and information networks
"... A large body of work has been devoted to identifying community structure in networks. A community is often though of as a set of nodes that has more connections between its members than to the remainder of the network. In this paper, we characterize as a function of size the statistical and structur ..."
Abstract
-
Cited by 65 (6 self)
- Add to MetaCart
A large body of work has been devoted to identifying community structure in networks. A community is often though of as a set of nodes that has more connections between its members than to the remainder of the network. In this paper, we characterize as a function of size the statistical and structural properties of such sets of nodes. We define the network community profile plot, which characterizes the “best ” possible community—according to the conductance measure—over a wide range of size scales, and we study over 70 large sparse real-world networks taken from a wide range of application domains. Our results suggest a significantly more refined picture of community structure in large real-world networks than has been appreciated previously. Our most striking finding is that in nearly every network dataset we examined, we observe tight but almost trivial communities at very small scales, and at larger size scales, the best possible communities gradually “blend in ” with the rest of the network and thus become less “community-like.” This behavior is not explained, even at a qualitative level, by any of the commonly-used network generation models. Moreover, this behavior is exactly the opposite of what one would expect based on experience with and intuition from expander graphs, from graphs that are well-embeddable in a low-dimensional structure, and from small social networks that have served as testbeds of community detection algorithms. We have found, however, that a generative model, in which new edges are added via an iterative “forest fire” burning process, is able to produce graphs exhibiting a network community structure similar to our observations.
Communication networks from the enron email corpus ”it’s always about the people. enron is no different
- Computational & Mathematical Organization Theory
, 2005
"... The Enron email corpus is appealing to researchers because it is a) a large scale email collection from b) a real organization c) over a period of 3.5 years. In this paper we contribute to the initial investigation of the Enron email dataset from a social network analytic perspective. We report on h ..."
Abstract
-
Cited by 41 (6 self)
- Add to MetaCart
The Enron email corpus is appealing to researchers because it is a) a large scale email collection from b) a real organization c) over a period of 3.5 years. In this paper we contribute to the initial investigation of the Enron email dataset from a social network analytic perspective. We report on how we enhanced and refined the Enron corpus with respect to relational data and how we extracted communication networks from it. We apply various network analytic techniques in order to explore structural properties of the networks in Enron and to identify key players across time. Our initial results indicate that during the Enron crisis the network had been denser, more centralized and more connected than during normal times. Our data also suggests that during the crisis the communication among Enron’s employees had been more diverse with respect to people’s formal positions, and that top executives had formed a tight clique with mutual support and highly brokered interactions with the rest of organization. The insights gained with the analyses we perform and propose are of potential further benefit for modeling the development of crisis scenarios in organizations and the investigation of indicators of failure.
Community structure in large networks: Natural cluster sizes and the absence of large welldefined clusters
- CoRR
"... A large body of work has been devoted to defining and identifying clusters or communities in social and information networks, i.e., in graphs in which the nodes represent underlying social entities and the edges represent some sort of interaction between pairs of nodes. Most such research begins wit ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
A large body of work has been devoted to defining and identifying clusters or communities in social and information networks, i.e., in graphs in which the nodes represent underlying social entities and the edges represent some sort of interaction between pairs of nodes. Most such research begins with the premise that a community or a cluster should be thought of as a set of nodes that has more and/or better connections between its members than to the remainder of the network. In this paper, we explore from a novel perspective several questions related to identifying meaningful communities in large social and information networks, and we come to several striking conclusions. Rather than defining a procedure to extract sets of nodes from a graph and then attempt to interpret these sets as a “real ” communities, we employ approximation algorithms for the graph partitioning problem to characterize as a function of size the statistical and structural properties of partitions of graphs that could plausibly be interpreted as communities. In particular, we define the network community profile plot, which characterizes the “best ” possible community—according to the conductance measure—over a wide range of size scales. We study over 100 large real-world networks, ranging from traditional and on-line social networks, to technological and information networks and
Spam corpus creation for trec
- Stanford University
, 2005
"... 2005) introduces a standard testing framework that is designed to model a spam filter’s usage as closely as possible, to measure quantities that reflect the filter’s effectiveness for its intended purpose, and to yield repeatable (i.e. controlled and statistically valid) results. The TREC Spam Filte ..."
Abstract
-
Cited by 32 (7 self)
- Add to MetaCart
2005) introduces a standard testing framework that is designed to model a spam filter’s usage as closely as possible, to measure quantities that reflect the filter’s effectiveness for its intended purpose, and to yield repeatable (i.e. controlled and statistically valid) results. The TREC Spam Filter Evaluation Toolkit is free software that, given a corpus and a filter, automatically runs the filter on each message in the corpus, compares the result to the gold standard for the corpus, and reports effectiveness measures with 95% confidence limits. The corpus consists of a chronological sequence of email messages, and a gold standard judgement for each message. We are concerned here with the creation of appropriate corpora for use with the toolkit. It is a simple matter to capture all the email delivered to a recipient or a set of recipients. Using this captured email in a public corpus, as for the other TREC tasks, is not so simple. Few individuals are willing to publish their email, because doing so would compromise their privacy and the privacy of their correspondents. So we are left with the choice between using an artificial public collection of messages and using a more realistic collection that must be kept private. Artificial collections (spamassassin.org, 2003; Androutsopoulos et al., 2000; Michelakis et al., 2004) may be created by using mailing list messages as opposed to personal email, by selecting non-sensitive messages from a real email collection, by mixing messages from diverse sources, or by obfuscating genuine messages 1. All of these approaches conflict with our design criteria – that real filter usage be modelled as closely as possible – and may compromise the very information that filters use to discriminate ham from spam, either by removing pertinent details or by introducing extraneous information that may aid or hinder the filter. 1 The majority of filters we have evaluated exhibit pathologies on the PU obfuscated corpora.
TREC 2005 Spam Track Overview
- IN THE FOURTEENTH TEXT RETRIEVAL CONFERENCE (TREC 2005) PROCEEDINGS
, 2005
"... TREC's Spam Track introduces a standard testing framework that presents a chronological sequence of email messages, one at a time, to a spam filter for classification. The filter yields a binary judgement (spam or ham [i.e. ..."
Abstract
-
Cited by 31 (9 self)
- Add to MetaCart
TREC's Spam Track introduces a standard testing framework that presents a chronological sequence of email messages, one at a time, to a spam filter for classification. The filter yields a binary judgement (spam or ham [i.e.
Extracting personal names from emails: Applying named entity recognition to informal text
- In HLT-EMNLP
, 2005
"... There has been little prior work on Named Entity Recognition for ”informal ” documents like email. We present two methods for improving performance of person name recognizers for email: emailspecific structural features and a recallenhancing method which exploits name repetition across multiple docu ..."
Abstract
-
Cited by 29 (8 self)
- Add to MetaCart
There has been little prior work on Named Entity Recognition for ”informal ” documents like email. We present two methods for improving performance of person name recognizers for email: emailspecific structural features and a recallenhancing method which exploits name repetition across multiple documents. 1
Name reference resolution in organizational email archives
- In SIAM
, 2006
"... Online communications provide a rich resource for understanding social networks. Information about the actors, and their dynamic roles and relationships, can be inferred from both the communication content and traffic structure. A key component in the analysis of online communications such as email ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
Online communications provide a rich resource for understanding social networks. Information about the actors, and their dynamic roles and relationships, can be inferred from both the communication content and traffic structure. A key component in the analysis of online communications such as email is the resolution of name references within the body of the message. Name reference resolution relies on the context of the message; both the content of the message and the sender and recipients ’ relationships can help to resolve a reference. Here we investigate a variety of approaches which make use of the email traffic network to disambiguate email name references. The email traffic network serves as a proxy for inferring relationships. These relationships in turn help us infer likely candidates for the name references. Our initial findings suggest that simple temporal models can help us effectively resolve name references. For the class of models proposed, performance is maximized by exploiting long-term traffic statistics to rank candidates. 1
Helios: Heterogeneous multiprocessing with satellite kernels
- In Proceedings of the 22nd ACM Symposium on Operating Systems Principles
, 2009
"... Helios is an operating system designed to simplify the task of writing, deploying, and tuning applications for heterogeneous platforms. Helios introduces satellite kernels, which export a single, uniform set of OS abstractions across CPUs of disparate architectures and performance characteristics. A ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Helios is an operating system designed to simplify the task of writing, deploying, and tuning applications for heterogeneous platforms. Helios introduces satellite kernels, which export a single, uniform set of OS abstractions across CPUs of disparate architectures and performance characteristics. Access to I/O services such as file systems are made transparent via remote message passing, which extends a standard microkernel message-passing abstraction to a satellite kernel infrastructure. Helios retargets applications to available ISAs by compiling from an intermediate language. To simplify deploying and tuning application performance, Helios exposes an affinity metric to developers. Affinity provides a hint to the operating system about whether a process would benefit from executing on the same platform as a service it depends upon. We developed satellite kernels for an XScale programmable I/O card and for cache-coherent NUMA architectures. We offloaded several applications and operating system components, often by changing only a single line of metadata. We show up to a 28% performance improvement by offloading tasks to the XScale I/O card. On a mail-server benchmark, we show a 39 % improvement in performance by automatically splitting the application among multiple NUMA domains.
Email thread reassembly using similarity matching
- In Proc. of CEAS
, 2006
"... Email thread reassembly is the task of linking messages by parentchild relationships. In this paper, we present two approaches to address this problem. One exploits previously undocumented header information from the Microsoft Exchange Protocol. The other uses string similarity metrics and a heurist ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Email thread reassembly is the task of linking messages by parentchild relationships. In this paper, we present two approaches to address this problem. One exploits previously undocumented header information from the Microsoft Exchange Protocol. The other uses string similarity metrics and a heuristic algorithm to reassemble threads in the absence of header information. The pros and cons of both methods are discussed. The similarity matching method is evaluated using the Enron email corpus and found to perform well. 1.
NER Systems that Suit User's Preferences: Adjusting the RecallPrecision Trade-off for Entity Extraction
- In HLT/NAACL
, 2006
"... We describe a method based on “tweaking” an existing learned sequential classifier to change the recall-precision tradeoff, guided by a user-provided performance criterion. This method is evaluated on the task of recognizing personal names in email and newswire text, and proves to be both simple and ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
We describe a method based on “tweaking” an existing learned sequential classifier to change the recall-precision tradeoff, guided by a user-provided performance criterion. This method is evaluated on the task of recognizing personal names in email and newswire text, and proves to be both simple and effective. 1

