This project is a Python-based search engine web crawler that integrates advanced information retrieval models, including BM25, TF-IDF, and language models (LM) for document retrieval and ranking. It leverages Elasticsearch for indexing and querying, and utilizes pseudo-relevance feedback to enhance search results. Additionally, it features training and testing phases for different models and explores various smoothing techniques.
data_parser.py
: A script that handles the extraction and parsing of data from crawled web pages. It processes HTML content, removes stop words,- and prepares the data for indexing.
parser.py
: Another parsing utility that might be focused on handling different data formats or additional preprocessing steps.
es.py
: The main script for interacting with Elasticsearch, setting up connections, and handling basic indexing and querying operations.es_index_data.py
: Responsible for indexing parsed documents into Elasticsearch. It structures the data into a format suitable for Elasticsearch indexing.es_retreival_models.py
: Implements various retrieval models on top of Elasticsearch, such as BM25 and TF-IDF. It customizes Elasticsearch queries to- apply these models and retrieves ranked lists of documents.
- BM25:
bm25_result.txt
: Stores the results of the BM25 retrieval model.bm25_pseudo_rel_result.txt
andbm25_pseudo_rel_result_es.txt
: Results incorporating pseudo-relevance feedback to improve retrieval effectiveness.
- Language Models (LM):
lmjm_result.txt
: Results from a language model with Jelinek-Mercer smoothing.lml_result.txt
: Results from another variation of a language model.
- TF-IDF:
tfidf_result.txt
: Results from the TF-IDF retrieval model.
- Okapi TF:
okapitf_result.txt
: Stores results using the Okapi TF retrieval model.
pseudo_rel_feedback.py
: A script implementing pseudo-relevance feedback, which refines search results based on an initial round of retrieval.pseudo_rel_es.py
: Applies pseudo-relevance feedback directly within Elasticsearch.
qrels.adhoc.51-100.AP89.txt
: A file containing relevance judgments, used to evaluate the performance of the retrieval models.query_desc.51-100.short.txt
: A set of queries used to test the retrieval models.results.xlsx
: A spreadsheet containing detailed results and comparisons across different models and experiments.
main.py
: The main entry point for running the project, coordinating the crawling, indexing, and retrieval processes.stemming_ind.py
: Handles stemming of words in the documents, possibly using a predefined list of stemming rules (stem-classes.lst
).stoplist.txt
: A list of stop words to be removed during the parsing process.term_vectors.json
: A JSON file that might store term vectors for documents, aiding in retrieval and ranking.
- Crawling and Parsing: The project starts by crawling the web (if integrated with a crawler) and parsing the HTML content to extract relevant text data.
- Indexing: Parsed data is indexed into Elasticsearch, where it is stored in a structured format suitable for retrieval.
- Retrieval Models: Various retrieval models, including BM25, TF-IDF, and language models, are applied to the indexed data to retrieve and rank documents based on relevance to user queries.
- Pseudo-Relevance Feedback: An initial round of retrieval is refined using pseudo-relevance feedback, improving the ranking of relevant documents.
- Evaluation: The effectiveness of different models is evaluated using standard relevance judgments, and results are stored for comparison.
- Python 3.x
- Elasticsearch
- Kibana (optional, for visualization)
- Python Libraries:
elasticsearch
,requests
,BeautifulSoup
,pandas
,numpy
,nltk
-
Install Elasticsearch and Kibana:
- Follow the official Elasticsearch and Kibana installation guides.
-
Clone the Repository:
git clone <repository-url> cd Search-Engine-Web-Crawler
-
Install Python Dependencies:
pip install -r requirements.txt
-
Configure Elasticsearch:
- Ensure Elasticsearch is running and update the connection settings in
es.py
or other relevant scripts.
- Ensure Elasticsearch is running and update the connection settings in
-
Index Data:
- Use
es_index_data.py
to index the parsed data into Elasticsearch.
python es_index_data.py
- Use
-
Run Retrieval Models:
- Execute
es_retreival_models.py
or other relevant scripts to perform document retrieval.
python es_retreival_models.py
- Execute
-
Evaluate Results:
- Compare the results stored in
.txt
files orresults.xlsx
to evaluate the effectiveness of different retrieval models.
- Compare the results stored in
This project provides a comprehensive implementation of a search engine using Python, Elasticsearch, and advanced retrieval models. It demonstrates the power of combining traditional IR models with modern search technologies and offers a robust framework for further experimentation and development.