deep-web-paper

Deep Web Paper

Abstract

  • 购物网站: entity-oriented

解决的问题:

  • query generation
  • empty page filtering
  • URL deduplication

Introduction

(1) 购物网站:

  • 商品名称
  • 品牌
  • 价格

(2) 电影网站

(3) 求职网站

输入:

  • a list of retailers’ websites

输出:

  • 高质量的产品

参考

In this paper, we discussed the problem faced by users in scraping the information from the deep web and also discussed the solution of these problems by using our new approach.

In future, this work can be extended by finding the more appropriate method, how efficiently we can store our data in repository and fast get accessed. So that overall efficiency of searching can be improved.

Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites.

While these techniques are shown to be useful, our experience points to a few areas that warrant future studies. For example, in the template generation, our parsing approach only handles HTML “GET” forms but not “POST” forms or javascript forms, which reduces site coverage. In query generation, although Freebase-based entity expansion is useful, certain sites with low traffic or diverse traffic do not get matched with Freebase types effectively using query logs alone. Utilizing additional signals (e.g., entities bootstrapped from crawled pages) for entity expansion is an interesting area. Efficiently enumerate entity query for search forms with multiple input fields is another interesting challenge.

推荐文章