Publications in Scientific Journals:

R. Fayzrakhmanov:
"A blocks-based geometric model of web pages for automatic processing and information extraction";
Science and Business:Development Ways, 15 (2012), 9; 56 - 64.

English abstract:
A geometric model plays an important role in document understanding, information extraction, and document classification. In this paper, we introduce a block-based geometric model (BGM) of web pages, which is intended for the quantitative and qualitative description of a web pages´ visual representation, in particular, their layout. We also take into account aspects from the CSS specification such as box model and painting order. A BGM provides researchers with the information needed to describe different spatial configurations and features of objects that should be recognized on a web page´s canvas, as well as to engineer efficient methods of information extraction. A proposed model was generated for 650 different web pages from the dataset in the form of ontological model. Acquired information was used in the detailed analysis of spatial relations and ways to simplify the model; the results are discussed in the paper.

