Talks and Poster Presentations (with Proceedings-Entry):
"User-Guided Wrapping of PDF Documents Using Graph Matching Techniques";
Talk: 10th Int. Conference on Document Analysis and Recognition,
- 07-29-2009; in: "Proc. of the 10th Int. Conf. on Document Analysis and Recognition",
A. Apostolos, M. Cheriet, U. Pal (ed.);
There are a number of established products on the market for wrapping-semi-automatic navigation and extraction of data-from web pages. These solutions make use of the inherent structure of HTML to locate instances of data to be wrapped. As PDF documents do not have such a structure, wrapping PDF documents has long been recognized as
a challenging problem. We have developed a novel system for wrapping PDF documents, which is currently at a prototype stage. A PDF
document is represented as an attributed relational graph, in which nodes represent physical items on the page and edges represent spatial and logical relationships. A wrapper is defined as a subgraph of the document with additional conditions, and can quickly and intuitively be created by a non-expert using the GUI. An algorithm based on subgraph isomorphism is then used to find the data instances
and extract the required data. Experiments show that our approach achieves good results with good execution time.
Project Head Reinhard Pichler:
GraphWrap - Graph-Based Wrapping from PDF Documents
Created from the Publication Database of the Vienna University of Technology.