Results 1 - 10
of
614
Sampling-based estimation of the number of distinct values of an attribute.
- In Proc. of the 1995 Intl. Conf. on Very Large Data Bases,
, 1995
"... ..."
Adaptive cleaning for rfid data streams
, 2006
"... ABSTRACT To compensate for the inherent unreliability of RFID data streams, most RFID middleware systems employ a "smoothing filter", a sliding-window aggregate that interpolates for lost readings. In this paper, we propose SMURF, the first declarative, adaptive smoothing filter for RFID ..."
Abstract
-
Cited by 101 (0 self)
- Add to MetaCart
(Show Context)
ABSTRACT To compensate for the inherent unreliability of RFID data streams, most RFID middleware systems employ a "smoothing filter", a sliding-window aggregate that interpolates for lost readings. In this paper, we propose SMURF, the first declarative, adaptive smoothing filter for RFID data cleaning. SMURF models the unreliability of RFID readings by viewing RFID streams as a statistical sample of tags in the physical world, and exploits techniques grounded in sampling theory to drive its cleaning processes. Through the use of tools such as binomial sampling and π-estimators, SMURF continuously adapts the smoothing window size in a principled manner to provide accurate RFID data to applications.
Obtaining information while preserving privacy: A Markov perturbation method for tabular data
- IN JOINT STATISTICAL MEETINGS
, 1997
"... Preserving privacy appears to conflict with providing information. Statistical information can, however, be provided while preserving a specified level of confidentiality protection. The general approach is to provide disclosure-limited data that maximizes its statistical utility subject to confide ..."
Abstract
-
Cited by 39 (11 self)
- Add to MetaCart
Preserving privacy appears to conflict with providing information. Statistical information can, however, be provided while preserving a specified level of confidentiality protection. The general approach is to provide disclosure-limited data that maximizes its statistical utility subject to confidentiality constraints. Disclosure limitation based on Markov chain methods that respect the underlying uncertainty in real data is examined. For use with categorical data tables a method called Markov perturbation is proposed as an extension of the PRAM method of Kooiman, Willenborg, and Gouweleeuw (1997). Markov perturbation allows cross-classified marginal totals to be maintained and promises to provide more information than the commonly used cell suppression technique.
Satisfying disclosure restrictions with synthetic data sets
- Journal of O¢ cial Statistics
, 2002
"... To avoid disclosures, Rubin proposed creating multiple, synthetic data sets for public release so that (i) no unit in the released data has sensitive data from an actual unit in the population, and (ii) statistical procedures that are valid for the original data are valid for the released data. In t ..."
Abstract
-
Cited by 36 (12 self)
- Add to MetaCart
To avoid disclosures, Rubin proposed creating multiple, synthetic data sets for public release so that (i) no unit in the released data has sensitive data from an actual unit in the population, and (ii) statistical procedures that are valid for the original data are valid for the released data. In this article, I show through simulation studies that valid inferences can be obtained from synthetic data in a variety of settings, including simple random sampling, probability proportional to size sampling, two-stage cluster sampling, and stratied sampling. I also provide guidance on specifying the number and size of synthetic data sets and demonstrate the benet of including design variables in the released data sets. 1
To model or not to model? Competing modes of inference for finite population sampling
, 2003
"... ..."
Profiling network performance for multi-tier data center applications
- In Proc. NSDI
, 2011
"... Network performance problems are notoriously tricky to diagnose, and this is magnified when applications are often split into multiple tiers of application components spread across thousands of servers in a data center. Problems often arise in the communication between the tiers, where either the ap ..."
Abstract
-
Cited by 33 (9 self)
- Add to MetaCart
(Show Context)
Network performance problems are notoriously tricky to diagnose, and this is magnified when applications are often split into multiple tiers of application components spread across thousands of servers in a data center. Problems often arise in the communication between the tiers, where either the application or the network (or both!) could be to blame. In this paper, we present SNAP, a scalable network-application profiler that guides developers in identifying and fixing performance problems. SNAP passively collects TCP statistics and socket-call logs with low computation and storage overhead, and correlates across shared resources (e.g., host, link, switch) and connections to pinpoint the location of the problem (e.g., send buffer mismanagement, TCP/application conflicts, application-generated microbursts, or network congestion). Our one-week deployment of SNAP in a production data center (with over 8,000 servers and over 700 application components) has already helped developers uncover 15 major performance problems in application software, the network stack on the server, and the underlying network. 1
Online Aggregation for Large MapReduce Jobs
"... In online aggregation, a database system processes a user’s aggregation query in an online fashion. At all times during processing, the system gives the user an estimate of the final query result, with the confidence bounds that become tighter over time. In this paper, we consider how online aggrega ..."
Abstract
-
Cited by 32 (0 self)
- Add to MetaCart
In online aggregation, a database system processes a user’s aggregation query in an online fashion. At all times during processing, the system gives the user an estimate of the final query result, with the confidence bounds that become tighter over time. In this paper, we consider how online aggregation can be built into a MapReduce system for large-scale data processing. Given the MapReduce paradigm’s close relationship with cloud computing (in that one might expect a large fraction of MapReduce jobs to be run in the cloud), online aggregation is a very attractive technology. Since large-scale cloud computations are typically pay-as-you-go, a user can monitor the accuracy obtained in an online fashion, and then save money by killing the computation early once sufficient accuracy has been obtained. 1.
Using Calibration Weighting to Adjust for Nonresponse and Coverage Errors, Survey Methodology 32
, 2006
"... Calibration forces the weighted estimates of certain variables to match known or alternatively estimated population totals called benchmarks. It can be used to correct for sample-survey nonresponse or for coverage error resulting from frame undercoverage or unit duplication. The quasi-randomization ..."
Abstract
-
Cited by 30 (9 self)
- Add to MetaCart
Calibration forces the weighted estimates of certain variables to match known or alternatively estimated population totals called benchmarks. It can be used to correct for sample-survey nonresponse or for coverage error resulting from frame undercoverage or unit duplication. The quasi-randomization theory supporting its use in nonresponse adjustment treats response as an additional phase of random sampling. The functional form of a quasi-random response model is assumed to be known, its parameter values estimated implicitly through the creation of calibra-tion weights. Unfortunately, calibration depends upon known benchmark totals 1 while the variables in a plausible model for survey response are not necessarily the same as the benchmark variables. Moreover, it may be prudent to keep the number of explanatory variables in a response model small. We will address using calibration to adjust for nonresponse when the explanatory model variables and benchmark variables are allowed to differ as long as the number of benchmark variables is at least as great as the number of model variables. Data from National Agricultural Statistical Service’s 2002 Census of Agriculture and simulations based upon that data will be used to illustrate alternative adjustments for nonresponse. The paper concludes with some remarks about extension of the methodology to adjustment for coverage error.