Document Quality Scoring for Web Crawling

This repository contains the source code used for the experiments presented in the paper "Document Quality Scoring for Web Crawling" by Francesca Pezzuti, Ariane Mueller, Sean MacAvaney and Nicola Tonellotto, accepted for publication at WOWS2025.

Credits

Please, considering citing our paper if you use this code, or a modified version of it.

Usage

Installation

You can install the requirements using pip:

pip install -r requirements.txt

Supported datasets

Web collection:

ClueWeb22-B (eng):

The query set is obtained randomly sampling English queries from MSM-WS and queries from RQ:

MSM-WS (MS MARCO Web Search): Link to the dataset
RQ (Researchy Questions): Link to the dataset

Pre-processing

To preprocess the msm-ws query set:

make sure that queries are stored under "/data/queries/msmarco-ws/msmarco-ws-queries.tsv"
make sure that qrels are stored under "./../data/qrels/msmarco-ws/cleaned-msmarco-ws-qrels.tsv"

Then, run:

python preproc_querysets.py

To preprocess ClueWeb22-B run:

python preproc_cw22b.py

Crawling

BreadthFirstSearch (BFS)

To crawl with BFS, run the following command using the default config file.

python crawl.py --max_pages -1 --verbosity 1 --exp_name v1 --frontier_type bfs

DepthFirstSearch (DFS)

To crawl with DFS, run the following command using the default config file.

python crawl.py --max_pages -1 --verbosity 1 --exp_name v1 --frontier_type dfs

QOracle

To crawl with QOracle, run the following command using the default config file.

python crawl.py --max_pages -1 --verbosity 1 --exp_name v1 --frontier_type oracle-quality

Ranking with BM25

To index and rank documents with BM25 the set of documents crawled up to time t=limit by a crawler whose experimental name is 'crawler' use:

python index.py --periodic 1 --limit 2_500_000 --exp_name crawler --benchmark msmarco-ws

By default, this python script automatically retrieves the top k scoring documents for queries for the specified query set and writes the results in the runs directory; however, this option can be disabled by launching the script with --evaluate False.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
repo		repo
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Quality Scoring for Web Crawling

Credits

Usage

Installation

Supported datasets

Pre-processing

Crawling

BreadthFirstSearch (BFS)

DepthFirstSearch (DFS)

QOracle

Ranking with BM25

About

Languages

fpezzuti/quality_crawling

Folders and files

Latest commit

History

Repository files navigation

Document Quality Scoring for Web Crawling

Credits

Usage

Installation

Supported datasets

Pre-processing

Crawling

BreadthFirstSearch (BFS)

DepthFirstSearch (DFS)

QOracle

Ranking with BM25

About

Topics

Resources

Stars

Watchers

Forks

Languages