This repository contains the source code used for the experiments presented in the paper "Document Quality Scoring for Web Crawling" by Francesca Pezzuti, Ariane Mueller, Sean MacAvaney and Nicola Tonellotto, accepted for publication at WOWS2025.
Please, considering citing our paper if you use this code, or a modified version of it.
You can install the requirements using pip:
pip install -r requirements.txt
Web collection:
- ClueWeb22-B (eng):
The query set is obtained randomly sampling English queries from MSM-WS and queries from RQ:
- MSM-WS (MS MARCO Web Search): Link to the dataset
- RQ (Researchy Questions): Link to the dataset
To preprocess the msm-ws query set:
- make sure that queries are stored under "/data/queries/msmarco-ws/msmarco-ws-queries.tsv"
- make sure that qrels are stored under "./../data/qrels/msmarco-ws/cleaned-msmarco-ws-qrels.tsv"
Then, run:
python preproc_querysets.py
To preprocess ClueWeb22-B run:
python preproc_cw22b.py
To crawl with BFS, run the following command using the default config file.
python crawl.py --max_pages -1 --verbosity 1 --exp_name v1 --frontier_type bfs
To crawl with DFS, run the following command using the default config file.
python crawl.py --max_pages -1 --verbosity 1 --exp_name v1 --frontier_type dfs
To crawl with QOracle, run the following command using the default config file.
python crawl.py --max_pages -1 --verbosity 1 --exp_name v1 --frontier_type oracle-quality
To index and rank documents with BM25 the set of documents crawled up to time t=limit by a crawler whose experimental name is 'crawler' use:
python index.py --periodic 1 --limit 2_500_000 --exp_name crawler --benchmark msmarco-ws
By default, this python script automatically retrieves the top k scoring documents for queries for the specified query set and writes the results in the runs
directory; however, this option can be disabled by launching the script with --evaluate False
.