Go to  Advanced Search

Extracting XML data from HTML repositories

Show full item record

Files in this item

Files Size Format Description   View
ubc_2004-0707.pdf 3.073Mb Adobe Portable Document Format   View/Open
Title: Extracting XML data from HTML repositories
Author: Zhang, Ruth Yuee
Degree Master of Science - MSc
Program Statistics
Copyright Date: 2004
Abstract: There is a vast amount of valuable information in HTML documents, widely distributed across the World Wide Web and across corporate intranets. Unfortunately, HTML is mainly presentation oriented and hard to query. While XML is becoming a standard for online data representation and exchange, there is a huge amount of legacy HTML data containing potentially untapped information. We develop a system to extract desired information (records) from thousands of HTML documents, starting from a small set of examples. Duplicates in the result are automatically detected and eliminated. The result is automatically converted to XML. We propose a novel method to estimate the current coverage of results by the system, based on capture-recapture models with unequal capture probabilities. We also propose techniques for estimating the error rate of the extracted information and an interactive technique for enhancing information quality. To evaluate the method and ideas proposed in this paper, we conduct an extensive set of experiments. The experimental results validate the effectiveness and utility of our system, and demonstrate interesting tradeoffs between running time of information extraction and coverage of results.
URI: http://hdl.handle.net/2429/15823
Series/Report no. UBC Retrospective Theses Digitization Project [http://www.library.ubc.ca/archives/retro_theses/]

This item appears in the following Collection(s)

Show full item record

All items in cIRcle are protected by copyright, with all rights reserved.

UBC Library
1961 East Mall
Vancouver, B.C.
Canada V6T 1Z1
Tel: 604-822-6375
Fax: 604-822-3893