Paper Title

Survey Based on DOM and Visual Clues for Extracting Structure data from Web

Publication Info

Volume: 1 | Issue: 1 | Pages: 1-8

Published On

December, 2015

Downloads

Abstract

This paper studies the problem of extracting data from a Web Page that contains several structured data records. The objective is to segment these data records, extract data items/fields from them and put the data in a database table. This paper proposes a new Method to perform the task automatically. It consists of two steps, (1) Identifying individual data records in a web page, and (2) Aligning and extracting data items from the identified data records. For Step 1, we propose a novel Document Object Model (DOM Trees). A technique based on tree matching. Removal of noise blocks is made from DOM trees. For step 2, we propose a method based on Visual Clues information Segment data records, which is more accurate than existing Methods. This approach enables very accurate Alignment of multiple data records. Experimental results using a large number of Web pages from diverse domains show that the proposed two-step technique is able to segment data records, align and extract data from them very accurately

View more »

Uploaded Document Preview