Results 1  10
of
12
Integer programming approaches to haplotype inference by pure parsimony
 IEEE/ACM Transactions on Computational Biology and Bioinformatics
, 2006
"... Abstract—In 2003, Gusfield introduced the Haplotype Inference by Pure Parsimony (HIPP) problem and presented an integer program (IP) that quickly solved many simulated instances of the problem [1]. Although it solved well on small instances, Gusfield’s IP can be of exponential size in the worst case ..."
Abstract

Cited by 43 (2 self)
 Add to MetaCart
(Show Context)
Abstract—In 2003, Gusfield introduced the Haplotype Inference by Pure Parsimony (HIPP) problem and presented an integer program (IP) that quickly solved many simulated instances of the problem [1]. Although it solved well on small instances, Gusfield’s IP can be of exponential size in the worst case. Several authors [2], [3] have presented polynomialsized IPs for the problem. In this paper, we further the work on IP approaches to HIPP. We extend the existing polynomialsized IPs by introducing several classes of valid cuts for the IP. We also present a new polynomialsized IP formulation that is a hybrid between two existing IP formulations and inherits many of the strengths of both. Many problems that are too complex for the exponentialsized formulations can still be solved in our new formulation in a reasonable amount of time. We provide a detailed empirical comparison of these IP formulations on both simulated and real genotype sequences. Our formulation can also be extended in a variety of ways to allow errors in the input or model the structure of the population under consideration. Index Terms—Computations on discrete structures, integer programming, biology and genetics, haplotype inference. 1
TOWARD AN ALGEBRAIC UNDERSTANDING OF HAPLOTYPE INFERENCE BY PURE PARSIMONY
"... Haplotype inference by pure parsimony (HIPP) is known to be NPHard. Despite this, many algorithms successfully solve HIPP instances on simulated and real data. In this paper, we explore the connection between algebraic rank and the HIPP problem, to help identify easy and hard instances of the probl ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Haplotype inference by pure parsimony (HIPP) is known to be NPHard. Despite this, many algorithms successfully solve HIPP instances on simulated and real data. In this paper, we explore the connection between algebraic rank and the HIPP problem, to help identify easy and hard instances of the problem. The rank of the input matrix is known to be a lower bound on the size an optimal HIPP solution. We show that this bound is almost surely tight for data generated by randomly pairing p haplotypes derived from a perfect phylogeny when the number of distinct population members is more than ( 1+ǫ)p log p (for some positive ǫ). Moreover, with only a constant multiple more population 2 members, and a common mutation, we can almost surely recover an optimal set of haplotypes in polynomial time. We examine the algebraic effect of allowing recombination, and bound the effect recombination has on rank. In the process, we prove a stronger version of the standard haplotype lower bound. We also give a complete classification of the rank of a haplotype matrix derived from a galled tree. This classification identifies a set of problem instances with recombination when the rank lower bound is also tight for the HIPP problem.
Family Trios Phasing and Missing data recovery
"... Although there exist many phasing methods for unrelated adults or pedigrees, phasing and missing data recovery for data representing family trios is lagging behind. This work is an attempt to fill this gap by considering the following problem. Given a set of genotypes partitioned into family trios, ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Although there exist many phasing methods for unrelated adults or pedigrees, phasing and missing data recovery for data representing family trios is lagging behind. This work is an attempt to fill this gap by considering the following problem. Given a set of genotypes partitioned into family trios, find for each trio a quartet of parent haplotypes which agree
Haplotype Tagging using Support Vector Machines
"... Abstract — Constructing a complete human haplotype map can help in associating complex diseases with SNPs (single nucleotide polymorphisms). Unfortunately, the number of SNPs is very large and it is costly to sequence many individuals. Therefore, it is desirable to reduce the number of SNPs that sho ..."
Abstract
 Add to MetaCart
Abstract — Constructing a complete human haplotype map can help in associating complex diseases with SNPs (single nucleotide polymorphisms). Unfortunately, the number of SNPs is very large and it is costly to sequence many individuals. Therefore, it is desirable to reduce the number of SNPs that should be sequenced to a small number of informative representatives called tag SNPs. Depending on the application, tagging can achieve either budget savings by inferring nontag SNPs from tag SNPs or shortening lengthy and difficult to handle SNP sequences obtained from Affimetrix Map Array. Tagging should first choose which SNPs to use as tags and then predict the unknown nontag SNPs from the known tags. In this paper we propose a new SNP prediction using a robust tool for classification – Support Vector Machine (SVM). For tag selection we use a fast stepwise tag selection algorithm. An extensive experimental study on various datasets including 3 regions from HapMap shows that the tag selection based on SVM SNP prediction can reach the same prediction accuracy as the methods of Halldorson et al. [7] on the LPL using significantly fewer tags. For example, our method reaches 90 % SNP prediction accuracy using only 3 tags for Daly et al. [6] dataset with 103 SNPs. The proposed tagging method is also more accurate (but considerably slower) than multivariate linear regression method of He et al. [12]. I.
Phasing and Missing Data Recovery in Family
"... Abstract. Although there exist many phasing methods for unrelated adults or pedigrees, phasing and missing data recovery for data representing family trios is lagging behind. This paper is an attempt to fill this gap by considering the following problem. Given a set of genotypes partitioned into fam ..."
Abstract
 Add to MetaCart
Abstract. Although there exist many phasing methods for unrelated adults or pedigrees, phasing and missing data recovery for data representing family trios is lagging behind. This paper is an attempt to fill this gap by considering the following problem. Given a set of genotypes partitioned into family trios, find for each trio a quartet of parent haplotypes which agree with all three genotypes and recover the SNP values missed in given genotype data. Our contributions include (i) formulating the pureparsimony trio phasing and the trio missing data recovery problems, (ii) proposing two new greedy and integer linear programming based solution methods, and (iii)extensive experimental validation of proposed methods showing advantage over the previously known methods.
ILP Methods for Family Trio Phasing
"... In population genotyping, it is common to genotype family trios consisting of the two parents and their child since that allows to recover haplotypes with higher confidence. Interestingly, the available software tools are primarily intended to phase only unrelated genotypes. In this section we first ..."
Abstract
 Add to MetaCart
In population genotyping, it is common to genotype family trios consisting of the two parents and their child since that allows to recover haplotypes with higher confidence. Interestingly, the available software tools are primarily intended to phase only unrelated genotypes. In this section we first formulate the problem and describe specificity of family trio phasing and then analyze existing computational tools and discuss the pure parsimony objective. In the following section we give three integer linear program formulations and compare their runtime for the Daly et al [18] data. The haplotypes of children is much harder to recover than haplotypes of parents since we are not aware of recombinations which may happened when parents haplotypes are inherited by a child. Therefore, for simplicity, we assume no recombinations in child chromosomes and that exactly one child chromosome is inherited from one parent and another from the other parent. Formally, given a set of genotypes partitioned into family trios, the Trio Phasing Problem (TPP) requires to find for each trio a quartet of parent haplotypes which agree with all three genotypes. A simple logical analysis allows to substantially decrease uncertainty of phasing. For example, for two SNP’s in a trio with parent genotypes f = 22 and m = 02, and the child genotype k = 01, there is a unique feasible phasing of the parents: f1 = 10, f2 = 01, m1 = 01, m2 = 00 such that
An Application of Integer Linear Programming to the Haplotype Inference by Pure Parsimony Problem
"... I hereby declare that I am the sole author of this thesis. I authorize the University of Waterloo to lend this thesis to other institutions or individuals for the purpose of scholarly research. Ian M. Harrower I further authorize the University of Waterloo to reproduce this thesis by photocopying or ..."
Abstract
 Add to MetaCart
(Show Context)
I hereby declare that I am the sole author of this thesis. I authorize the University of Waterloo to lend this thesis to other institutions or individuals for the purpose of scholarly research. Ian M. Harrower I further authorize the University of Waterloo to reproduce this thesis by photocopying or other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research.
DISCRETE ALGORITHMS FOR ANALYSIS OF GENOTYPE DATA
, 2007
"... Accessibility of highthroughput genotyping technology makes possible genomewide association studies for common complex diseases. When dealing with common diseases, it is necessary to search and analyze multiple independent causes resulted from interactions of multiple genes scattered over the enti ..."
Abstract
 Add to MetaCart
Accessibility of highthroughput genotyping technology makes possible genomewide association studies for common complex diseases. When dealing with common diseases, it is necessary to search and analyze multiple independent causes resulted from interactions of multiple genes scattered over the entire genome. The optimization formulations for searching diseaseassociated risk/resistant factors and predicting disease susceptibility for given casecontrol study have been introduced. Several discrete methods for disease association search exploiting greedy strategy and topological properties of casecontrol studies have been developed. New disease susceptibility prediction methods based on the developed search methods have been validated on datasets from casecontrol studies for several common diseases. Our experiments compare favorably the proposed algorithms with the existing association search and susceptibility prediction methods.