Results 1 - 10
of
7,109
Base-calling of automated sequencer traces using phred. I. Accuracy Assessment
- GENOME RES
, 1998
"... The availability of massive amounts of DNA sequence information has begun to revolutionize the practice of biology. As a result, current large-scale sequencing output, while impressive, is not adequate to keep pace with growing demand and, in particular, is far short of what will be required to obta ..."
Abstract
-
Cited by 1653 (4 self)
- Add to MetaCart
The availability of massive amounts of DNA sequence information has begun to revolutionize the practice of biology. As a result, current large-scale sequencing output, while impressive, is not adequate to keep pace with growing demand and, in particular, is far short of what will be required to obtain the 3-billion-base human genome sequence by the target date of 2005. To reach this goal, improved automation will be essential, and it is particularly important that human involvement in sequence data processing be significantly reduced or eliminated. Progress in this respect will require both improved accuracy of the data processing software and reliable accuracy measures to reduce the need for human involvement in error correction and make human review more efficient. Here, we describe one step toward that goal: a base-calling program for automated sequencer traces, phred, with improved accuracy. phred appears to be the first base-calling program to achieve a lower error rate than the ABI software, averaging 40%–50 % fewer errors in the data sets examined independent of position in read, machine running conditions, or sequencing chemistry.
The diploid genome sequence of an individual human
- PLoS Biol
"... Presented here is a genome sequence of an individual human. It was produced from;32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given r ..."
Abstract
-
Cited by 293 (6 self)
- Add to MetaCart
(Show Context)
Presented here is a genome sequence of an individual human. It was produced from;32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2–206 bp), 292,102 heterozygous insertion/deletion events (indels)(1–571 bp), 559,473 homozygous indels (1–82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22 % of all events identified in the donor, however they involve 74 % of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44 % of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of
Complete Genome Sequence of Methanobacterium thermoautotrophicum ΔH: Functional . . .
- J. Bacteriol
, 1997
"... the ORF-encoded polypeptides are related to sequences with unknown functions, and 496 (27%) have little or no homology to sequences in public databases. Comparisons with Eucarya-, Bacteria-, and Archaea -specific databases reveal that 1,013 of the putative gene products (54%) are most similar to p ..."
Abstract
-
Cited by 199 (4 self)
- Add to MetaCart
the ORF-encoded polypeptides are related to sequences with unknown functions, and 496 (27%) have little or no homology to sequences in public databases. Comparisons with Eucarya-, Bacteria-, and Archaea -specific databases reveal that 1,013 of the putative gene products (54%) are most similar to polypeptide sequences described previously for other organisms in the domain Archaea. Comparisons with the Methanococcus jannaschii genome data underline the extensive divergence that has occurred between these two methanogens; only 352 (19%) of M. thermoautotrophicum ORFs encode sequences that are >50% identical to M. jannaschii polypeptides, and there is little conservation in the relative locations of orthologous genes. When the M. thermoautotrophicum ORFs are compared to sequences from only the eucaryal and bacterial domains, 786 (42%) are more similar to bacterial sequences and 241 (13%) are more similar to eucaryal sequences. The bacterial domain-like gene products include the ma
PCR mapping of integrons reveals several novel combinations of resistance genes
- Antimicrobial Agents and Chemotherapy
, 1995
"... PCR mapping of integrons reveals several novel combinations of resistance genes. ..."
Abstract
-
Cited by 191 (4 self)
- Add to MetaCart
(Show Context)
PCR mapping of integrons reveals several novel combinations of resistance genes.
ARACHNE: a whole-genome shotgun assembler
- Genome Res
, 2002
"... We describe a new computer system, called ARACHNE, for assembling genome sequence using paired-end whole-genome shotgun reads. ARACHNE has several key features, including an efficient and sensitive procedure for finding read overlaps, a procedure for scoring overlaps that achieves high accuracy by c ..."
Abstract
-
Cited by 177 (7 self)
- Add to MetaCart
We describe a new computer system, called ARACHNE, for assembling genome sequence using paired-end whole-genome shotgun reads. ARACHNE has several key features, including an efficient and sensitive procedure for finding read overlaps, a procedure for scoring overlaps that achieves high accuracy by correcting errors before assembly, read merger based on forward-reverse links, and detection of repeat contigs by forward-reverse link inconsistency. To test ARACHNE, we created simulated reads providing ∼10-fold coverage of the genomes of H. influenzae, S. cerevisiae, and D. melanogaster, as well as human chromosomes 21 and 22. The assemblies of these simulated reads yielded nearly complete coverage of the respective genomes, with a small number of contigs joined into a smaller number of supercontigs (or scaffolds). For example, analysis of the D. melanogaster genome yielded ∼98 % coverage with an N50 contig length of 324 kb and an N50 supercontig length of 5143 kb. The assembly accuracy was high, although not perfect: small errors occurred at a frequency of roughly 1 per 1 Mb (typically, deletion of ∼1 kb in size), with a very small number of other misassemblies. The assembly was rapid: the Drosophila assembly required only 21 hours on a single 667 MHz processor and used 8.4 Gb of memory. Shotgun sequencing was introduced by Sanger et al. (1977) and has remained the mainstay of genome sequence assembly for nearly 25 years. The method involves obtaining random
J.C.: The sorcerer ii global ocean sampling expedition: Northwest atlantic through eastern tropical pacific. PLoS Biol
, 2007
"... The world’s oceans contain a complex mixture of micro-organisms that are for the most part, uncharacterized both genetically and biochemically. We report here a metagenomic study of the marine planktonic microbiota in which surface (mostly marine) water samples were analyzed as part of the Sorcerer ..."
Abstract
-
Cited by 151 (6 self)
- Add to MetaCart
(Show Context)
The world’s oceans contain a complex mixture of micro-organisms that are for the most part, uncharacterized both genetically and biochemically. We report here a metagenomic study of the marine planktonic microbiota in which surface (mostly marine) water samples were analyzed as part of the Sorcerer II Global Ocean Sampling expedition.
Hepatitis C virus (HCV) circulates as a population of different but closely related genomes: quasispecies nature of HCV genome distribution
, 1992
"... Hepatitis C virus (HCV) circulates as a population of different but closely related genomes: quasispecies nature of HCV genome distribution. ..."
Abstract
-
Cited by 146 (10 self)
- Add to MetaCart
(Show Context)
Hepatitis C virus (HCV) circulates as a population of different but closely related genomes: quasispecies nature of HCV genome distribution.
A: Analysis of a complete library of putative drug transporter genes in Escherichia coli
- J Bacteriol
"... These include: This article cites 59 articles, 36 of which can be accessed free at: ..."
Abstract
-
Cited by 122 (15 self)
- Add to MetaCart
(Show Context)
These include: This article cites 59 articles, 36 of which can be accessed free at:
Substantial biases in ultra-short read data sets from high-throughput DNA sequencing
- Nucleic Acids Res
, 2008
"... Novel sequencing technologies permit the rapid production of large sequence data sets. These tech-nologies are likely to revolutionize genetics and bio-medical research, but a thorough characterization of the ultra-short read output is necessary. We gen-erated and analyzed two Illumina 1G ultra-shor ..."
Abstract
-
Cited by 121 (0 self)
- Add to MetaCart
(Show Context)
Novel sequencing technologies permit the rapid production of large sequence data sets. These tech-nologies are likely to revolutionize genetics and bio-medical research, but a thorough characterization of the ultra-short read output is necessary. We gen-erated and analyzed two Illumina 1G ultra-short read data sets, i.e. 2.8 million 27mer reads from a Beta vulgaris genomic clone and 12.3 million 36mers from the Helicobacter acinonychis genome. We found that error rates range from 0.3 % at the beginning of reads to 3.8 % at the end of reads. Wrong base calls are frequently preceded by base G. Base sub-stitution error frequencies vary by 10- to 11-fold, with A>C transversion being among the most fre-quent and C>G transversions among the least fre-quent substitution errors. Insertions and deletions of single bases occur at very low rates. When simu-lating re-sequencing we found a 20-fold sequencing coverage to be sufficient to compensate errors by correct reads. The read coverage of the sequenced regions is biased; the highest read density was found in intervals with elevated GC content. High Solexa quality scores are over-optimistic and low scores underestimate the data quality. Our results show different types of biases and ways to detect them. Such biases have implications on the use and interpretation of Solexa data, for de novo sequencing, re-sequencing, the identification of single nucleotide polymorphisms and DNA methylation sites, as well as for transcriptome analysis.
Genetic and functional analysis of the multiple antibiotic resistance (mar) locus in Escherichia coli
- J
, 1993
"... Genetic and functional analysis of the multiple antibiotic resistance (mar) locus in Escherichia coli. ..."
Abstract
-
Cited by 101 (19 self)
- Add to MetaCart
(Show Context)
Genetic and functional analysis of the multiple antibiotic resistance (mar) locus in Escherichia coli.