Short name: URL Frontier
Long name: Large scale web crawling with URL Frontier
Country: United Kingdom
Call: F4Fp-SME-COD210601 (see call details)
Proposal number: F4Fp-SME-COD210601-02
SUMMARY REMARKS & TESTBEDS
URL Frontier is an ongoing open source project funded by NLNet (https://nlnet.nl/project/URLFrontier/). The aim of the project is to provide an API and reference implementation for a crawl frontier, which can power various web crawlers, independently from their implementation language and scalability.
The project has nearly reached its second milestone, which provides a scalable implementation of the service. A module for StormCrawler (a mature open source web crawler) has already been developed and allows using it alongside the URL Frontier service.
This Fed4Fire Innovative Experiment proposal seeks to deploy the reference implementation of URL Frontier and run it with StormCrawler in order to test a large-scale web crawl. This is to confirm that it can (1) function robustly (2) perform at least as efficiently as existing approaches and (3) can efficiently handle large volumes of data.
- Who’s NGI? Julien Nioche with open source component for web crawlers URL Frontier (read the “Who’s NGI” blog)
- FEC 11 | URL Frontier crawl poster (download the slides)