### BibTeX

@MISC{_a.appendix,

author = {},

title = {A. Appendix A.1. Convergence in Log-likelihood},

year = {}

}

### OpenURL

### Abstract

We experiment with all the four algorithms (EM, Anti-annealing, BFGS, ECG) on the three datasets explained in the main paper (section 5.1). We run each algorithm 10 times on each dataset and observe how each of these algorithms converges in terms loglikelihood. The average log-likelihood values are plotted in Figure A.1. For closer examination, we also plot the zoomed in versions for each associated plots (bottom row). We observe that all the four algorithms converge fairly quickly to values close to the optimum log-likelihood, even though they are far away from true parameters in the parameter space. The reason for such behavior is that the log-likelihood values are dominated by the larger clusters, whose parameters are learned quickly. Although regular EM exhibits slow convergence for parameters of smaller clusters, these smaller clusters have relatively less impact on loglikelihood values. The zoomed-in plots show that the deterministic anti-annealing method achieves slightly better average log-likelihood than the other methods, because it can learn the parameters of smaller clusters fast and more accurately. The non-monotonic behavior of the log-likelihood values for the deterministic anti-annealing method is due to the change in temperatures, which essentially change the objective function being optimized. We also plot the distribution of the final log-likelihood values over the 10 repeated runs (with random initialization) for all four algorithms. The results are shown in Figure A.1. While there are only minor differences in terms of the final log-likelihood values, we see that the deterministic anti-annealing method is more stable and consistently provides slightly better average log-likelihood values. A.2. Gradients used by ECG and BFGS The Conjugate Gradient and Quasi-Newton methods (BFGS) are known to outperform first order gradientbased methods for special cases when the objective function is elongated, and the conjugate direction is a better direction than the steepest gradient direction. These methods do not need to explicitly compute the Hessian, and can work by only computing gradient. We derive here the gradient functions that are used by both the conjugate gradient and BFGS methods. Let Q denote the complete log-likelihood function:

### Keyphrases

deterministic anti-annealing method objective function final log-likelihood value conjugate gradient log-likelihood value loglikelihood value minor difference gradient direction gradient function special case optimum log-likelihood conjugate direction bottom row first order true parameter algorithm converges non-monotonic behavior complete log-likelihood function le impact bfgs method average log-likelihood value term loglikelihood parameter space main paper regular em exhibit slow convergence better average log-likelihood value zoomed-in plot random initialization algorithm converge better average log-likelihood quasi-newton method