Results 1 -
2 of
2
Unsupervised Named-Entity Extraction from the Web: An Experimental Study
- ARTIFICIAL INTELLIGENCE
, 2005
"... The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOW-ITALL’s novel architecture and design princip ..."
Abstract
-
Cited by 205 (37 self)
- Add to MetaCart
The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOW-ITALL’s novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KNOW-ITALL extracted over 50,000 facts, but suggested a challenge: How can we improve KNOW-ITALL’s recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall. List Extraction locates lists of class instances, learns a “wrapper ” for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL’s domainindependent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on named-entity extraction, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KNOW-ITALL a 4-fold to 8-fold increase in recall, while maintaining high precision, and discovered over 10,000 cities missing from the Tipster Gazetteer.
Efficient Techniques for Document Sanitization
"... Sanitization (syn. redaction) of a document involves removing sensitive information from the document, in order to reduce the document’s classification level, possibly yielding an unclassified document. A document may need to be sanitized for a variety of reasons. Government departments usually need ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Sanitization (syn. redaction) of a document involves removing sensitive information from the document, in order to reduce the document’s classification level, possibly yielding an unclassified document. A document may need to be sanitized for a variety of reasons. Government departments usually need to declassify documents before making them public, for instance, in response to Freedom of Information requests. In hospitals, medical records are sanitized to remove sensitive patent information (patient identity information, diagnoses of deadly diseases, etc.). Document sanitization is also critical to companies who need to prevent malafide or inadvertent disclosure of proprietary information while sharing data with outsourced operations. In this paper, we propose ERASE (Efficient RedAction for Securing Entities), a system for performing document sanitization automatically. 1.

