Results 1 -
8 of
8
Geotagging one hundred million twitter accounts with total variation minimization
- In Big Data (Big Data), 2014 IEEE International Conference on
, 2014
"... Abstract—Geographically annotated social media is ex-tremely valuable for modern information retrieval. However, when researchers can only access publicly-visible data, one quickly finds that social media users rarely publish location information. In this work, we provide a method which can geolocat ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
(Show Context)
Abstract—Geographically annotated social media is ex-tremely valuable for modern information retrieval. However, when researchers can only access publicly-visible data, one quickly finds that social media users rarely publish location information. In this work, we provide a method which can geolocate the overwhelming majority of active Twitter users, independent of their location sharing preferences, using only publicly-visible Twitter data. Our method infers an unknown user’s location by examining their friend’s locations. We frame the geotagging problem as an optimization over a social network with a total variation-based objective and provide a scalable and distributed algorithm for its solution. Furthermore, we show how a robust estimate of the geographic dispersion of each user’s ego network can be used as a per-user accuracy measure which is effective at removing outlying errors. Leave-many-out evaluation shows that our method is able to infer location for 101,846,236 Twitter users at a median error of 6.38 km, allowing us to geotag over 80 % of public tweets. Keywords-Social and Information Networks; Data mining; Optimization
CONTINUUM LIMIT OF TOTAL VARIATION ON POINT CLOUDS
, 2014
"... We consider point clouds obtained as random samples of a measure on a Euclidean domain. A graph representing the point cloud is obtained by assigning weights to edges based on the distance be-tween the points they connect. Our goal is to develop mathematical tools needed to study the consistency, a ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We consider point clouds obtained as random samples of a measure on a Euclidean domain. A graph representing the point cloud is obtained by assigning weights to edges based on the distance be-tween the points they connect. Our goal is to develop mathematical tools needed to study the consistency, as the number of available data points increases, of graph-based machine learning algorithms for tasks such as clustering. In particular, we study when is the cut capacity, and more generally total variation, on these graphs a good approximation of the perimeter (total variation) in the continuum setting. We address this question in the setting of Γ-convergence. We obtain almost optimal conditions on the scaling, as number of points increases, of the size of the neighborhood over which the points are connected by an edge for the Γ-convergence to hold. Taking the limit is enabled by a new metric which allows to suitably compare functionals defined on different point clouds.
Minimal Dirichlet energy partitions for graphs
, 2014
"... Motivated by a geometric problem, we introduce a new non-convex graph partitioning objective where the optimality criterion is given by the sum of the Dirichlet eigenvalues of the partition components. A relaxed formulation is identified and a novel rearrangement algorithm is proposed, which we show ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Motivated by a geometric problem, we introduce a new non-convex graph partitioning objective where the optimality criterion is given by the sum of the Dirichlet eigenvalues of the partition components. A relaxed formulation is identified and a novel rearrangement algorithm is proposed, which we show is strictly decreasing and converges in a finite number of iterations to a local minimum of the relaxed objective function. Our method is applied to several clustering problems on graphs constructed from synthetic data, MNIST handwritten digits, and manifold discretizations. The model has a semi-supervised extension and provides a natural representative for the clusters as well.
Local barycentric coordinates
- ACM Trans. Graph
, 2014
"... manipulated control points are deformed, as indicated by the logarithmic color-coding of the displacement magnitude. Barycentric coordinates yield a powerful and yet simple paradigm to interpolate data values on polyhedral domains. They represent interior points of the domain as an affine combinatio ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
manipulated control points are deformed, as indicated by the logarithmic color-coding of the displacement magnitude. Barycentric coordinates yield a powerful and yet simple paradigm to interpolate data values on polyhedral domains. They represent interior points of the domain as an affine combination of a set of control points, defining an interpolation scheme for any function defined on a set of control points. Numerous barycentric coordinate schemes have been proposed satisfying a large variety of properties. However, they typically define interpolation as a combination of all control points. Thus a local change in the value at a single control point will create a global change by propagation into the whole domain. In this context, we present a family of local barycentric coordinates (LBC), which select for each interior point a small set of control points and satisfy common requirements on barycentric coordinates, such as linearity, non-negativity, and smoothness. LBC are achieved through a convex optimization based on total variation, and provide a compact representation that reduces memory footprint and allows for fast deformations. Our experiments show that LBC provide more local and finer control on shape deformation than previous approaches, and lead to more intuitive deformation results.
CONSISTENCY OF CHEEGER AND RATIO GRAPH CUTS
, 2014
"... This paper establishes the consistency of a family of graph-cut-based algorithms for clus-tering of data clouds. We consider point clouds obtained as samples of a ground-truth measure. We in-vestigate approaches to clustering based on minimizing objective functionals defined on proximity graphs of t ..."
Abstract
- Add to MetaCart
This paper establishes the consistency of a family of graph-cut-based algorithms for clus-tering of data clouds. We consider point clouds obtained as samples of a ground-truth measure. We in-vestigate approaches to clustering based on minimizing objective functionals defined on proximity graphs of the given sample. Our focus is on functionals based on graph cuts like the Cheeger and ratio cuts. We show that minimizers of the these cuts converge as the sample size increases to a minimizer of a corresponding continuum cut (which partitions the ground truth measure). Moreover, we obtain sharp conditions on how the connectivity radius can be scaled with respect to the number of sample points for the consistency to hold. We provide results for two-way and for multiway cuts. Furthermore we provide numerical experiments that illustrate the results and explore the optimality of scaling in dimension two.
Tight Continuous Relaxation of the Balanced k-Cut Problem
"... Spectral Clustering as a relaxation of the normalized/ratio cut has become one of the standard graph-based clustering methods. Existing methods for the compu-tation of multiple clusters, corresponding to a balanced k-cut of the graph, are either based on greedy techniques or heuristics which have we ..."
Abstract
- Add to MetaCart
(Show Context)
Spectral Clustering as a relaxation of the normalized/ratio cut has become one of the standard graph-based clustering methods. Existing methods for the compu-tation of multiple clusters, corresponding to a balanced k-cut of the graph, are either based on greedy techniques or heuristics which have weak connection to the original motivation of minimizing the normalized cut. In this paper we pro-pose a new tight continuous relaxation for any balanced k-cut problem and show that a related recently proposed relaxation is in most cases loose leading to poor performance in practice. For the optimization of our tight continuous relaxation we propose a new algorithm for the difficult sum-of-ratios minimization problem which achieves monotonic descent. Extensive comparisons show that our method outperforms all existing approaches for ratio cut and other balanced k-cut criteria. 1
An Incremental Reseeding Strategy for Clustering
"... In this work we propose a simple and easily parallelizable algorithm for multiway graph par-titioning. The algorithm alternates between three basic components: diffusing seed vertices over the graph, thresholding the diffused seeds, and then randomly reseeding the thresholded clus-ters. We demonstra ..."
Abstract
- Add to MetaCart
(Show Context)
In this work we propose a simple and easily parallelizable algorithm for multiway graph par-titioning. The algorithm alternates between three basic components: diffusing seed vertices over the graph, thresholding the diffused seeds, and then randomly reseeding the thresholded clus-ters. We demonstrate experimentally that the proper combination of these ingredients leads to an algorithm that achieves state-of-the-art performance in terms of cluster purity on standard benchmarks datasets. Moreover, the algorithm runs an order of magnitude faster than the other algorithms that achieve comparable results in terms of accuracy [1]. We also describe a coarsen, cluster and refine approach similar to [2, 3] that removes an additional order of magnitude from the runtime of our algorithm while still maintaining competitive accuracy. 1