PubCrawl - The scientific publication crawling suite.
PubCrawl is a Python package for crawling scientific publications. The idea is to provide a simple interface to crawl scientific publications from different sources. The package is designed to be easily extensible to support new sources.
The package is currently in a very early stage of development. The following sources are currently supported:
- arXiv
- pdfs - Plain PDF Files
To start, clone the repository and install the package using the requirements file:
git clone https://github.com/J0nasW/PubCrawl.git
cd pubcrawl
pip install -r requirements.txt
The package is designed to be used as a library. The following example shows how to use the package to crawl publications from arXiv.
The following example shows how to crawl publications from arXiv. The example crawls all publications from the category cs.AI
, processes them and saves the results to a JSON file. Note, that you have to provide a valid JSON file containing the metadata of all arXiv publications. You can download this file from the Kaggle dataset here.
python3 main.py --s arxiv -c cs.AI -f arxiv.json -p
The package can be called using the following arguments:
Argument | Type | Description |
---|---|---|
-s, --source | required | The source to crawl publications from. |
-f, --file | optional | The file containing the metadata of all publications. |
-l, --local_PDFS | optional | Use local PDFs and provide a PDF dicrectory instead of downloading them from GCP. |
-c, --category | optional | The category to crawl publications from. Delimit multiple categories with a |
-p, --process | optional | Whether to process the crawled publications. |
-r, --rows | optional | Number of entries to process (can be useful in development mode) |
This package can be used in combination with other packages to perform downstream tasks. The following packages are currently available:
When cloned the repository, you can use the extra flag -d to activate downstream tasks. PubCrawl will then automatically clone the downstream task repository and install the package using the requirements file.
The following example shows how to crawl all publications from the category cs.AI
and create a graph from them.
python3 main.py --s arxiv -c cs.AI -f arxiv.json -p -d pubgraph
tbd
Copyright (c) 2023, Jonas Wilinski A lot of code regarding the PDF2TXT conversion is taken from this repository, which is licensed under the MIT license. Thank you, Matt Bierbaum, Colin Clement, Kevin O'Keeffe, Alex Alemi for sharing your code!