Abstract:
Many applications are characterized by having naturally incomplete data on customers – where data on only some fixed set of local variables is gathered. However, having a more complete picture can help build better models. The naïve solution to this problem – acquiring complete data for all customers – is often impractical due to the costs of doing so. A possible alternative is to acquire complete data for “some ” customers and to use this to improve the models built. The data acquisition problem is determining how many, and which, customers to acquire additional data from. In this paper we suggest using active learning based approaches for the data acquisition problem. In particular, we present initial methods for data acquisition and evaluate these methods experimentally on web usage data and UCI datasets. Results show that the methods perform well and indicate that active learning based methods for data acquisition can be a promising area for data mining research. 1.
Citations
|
2138
|
UCI Repository of Machine Learning Databases
– Merz, Murphy
- 1996
|
|
505
|
The EM Algorithm and Extensions
– McLachlan, Krishnan
- 1996
|
|
445
|
Statistical analysis with missing data
– Little, Rubin
- 1986
|
|
261
|
Active learning with statistical models
– Cohn, Ghahramani, et al.
- 1995
|
|
168
|
Selective sampling using the query by committee algorithm
– Freund, Seung, et al.
- 1997
|
|
162
|
Information-based objective functions for active data selection
– MacKay
- 1992
|
|
99
|
Neural network exploration using optimal experiment design
– Cohn
- 1994
|
|
58
|
The usefulness of optimum experimental designs
– Atkinson
- 1996
|
|
44
|
Selecting concise training sets from clean data
– Plutowski, White
- 1993
|
|
23
|
2001]: ‘Active Learning for Structure in Bayesian Networks
– Tong, Koller
|
|
17
|
Multiple imputation for multivariate missing-data problems: a data analyst’s perspective
– Schafer, Olsen
- 1998
|
|
16
|
Personalization from incomplete data: what you don’t know can hurt
– Padmanabhan, Zheng, et al.
- 2001
|
|
10
|
Minimizing Statistical Bias with Queries
– Cohn
- 1997
|
|
3
|
1998, Additive Logistic Regression: A statistical view of Boosting
– Hastie, T, et al.
|
|
2
|
Survey Sampling: Theory and Methods
– Chaudhuri, Stenger
- 1992
|
|
1
|
Working with missing data. Family Science Review
– Acock
- 1997
|
|
1
|
Active Learning in Neural Networks, working paper in the university of Bielefeld
– Hasenjäger, Ritter
- 1999
|
|
1
|
Active Learning for
– Maytal, Provost
- 2001
|
|
1
|
Multiple Imputation for Missing Data
– Yuan
- 2000
|