Conceptualization to Develop Machine Learning Techniques for Information Extraction: Consistency Queries
Abstract:
Abstract. The information extraction from documents is an increasingly urgent problem of enterprise knowledge management. Knowledge sources may be internal like text files and forms of business administration processes or external like HTML pages, e.g. When the number of knowledge sources is paramount, substantial computer support is inevitable. Machine learning techniques play a crucial role. A prototypical development system named LExIKON has been developed which supports interactive information extraction from semi-structured documents. The central mechanism inside LExIKON involves learning of formal languages. These formal languages serve as parameters of so-called wrappers which are synthesized programs performing the intended information extraction. The essence of the LExIKON technology and the functionality of the LExIKON development system is sketched by means of a sample session documented and discussed using several screenshots. The automatic generation of – hypothetical – wrappers for information extraction through the invocation of machine learning techniques is raising several questions. What can we expect of a wrapper generated in case it is not yet completely correct? Can we generate wrappers in a properly incremental fassion? For answering those practically relevant questions, a new formal framework of learning – learning by consistency queries – is introduced and studied. The overall scenario of learning by consistency queries for information extraction
Citations
| 624 | Language identification in the limit – Gold - 1967 |
| 535 | Theory of Recursive Functions and Effective Computability – Rogers - 1967 |
| 528 | Queries and concept learning – Angluin - 1988 |
| 25 | Queries revisited – Angluin - 2001 |
| 17 | A unifying approach to html wrapper representation and learning – Grieser, Jantke, et al. - 2000 |
| 17 | Combining postulates of naturalness in inductive inference – Jantke, Beick - 1981 |
| 2 | Consistency queries in information extraction – Grieser, Jantke, et al. - 2002 |
| 2 | Learning approaches to wrapper induction – Grieser, Lange - 2001 |

