- query generation
- empty page filtering
- URL deduplication
- a list of retailers’ websites
In this paper, we discussed the problem faced by users in scraping the information from the deep web and also discussed the solution of these problems by using our new approach.
In future, this work can be extended by finding the more appropriate method, how efficiently we can store our data in repository and fast get accessed. So that overall efficiency of searching can be improved.
Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites.