Join Communication Chanel for Help and Discussion
A python web crawler that crawls nearly 25 billion pages of to fetch download speed, upload speed, latency, date, distance, country code, server ID, server name, sponsor name, sponsor URL, connection_mode, isp name, isp rating, test rank, test grade, test rating and path of different surveys and stores them in a mysql database.
Download or clone the repository and set up a virtual environment.
git clone
python -m venv env
Installing required packages
pip install -r requirements.txt
MySQL database is used to store the result parsed by the main program.Interaction with the database could be done as follows.
If the table is present already a connection to the same table is provided else new table is made.
from database import Database
table_name = 'student'
fields = {'roll_no':'int(5)','first_name:':'varchar(15)','last_name':'varchar(15)'}
connection = Database(table_name='student',fields=fields,filename='config.ini',section='mysql')
The configuration of the database is taken from an external configuration file.
host = localhost
database = crawler
user = bob
password = bob
The data to be inserted is in the form of dictionary with key as the field name and value as the value of the field.
data = {'roll_no':12,'first_name':'Rajesh','last_name':'Ingle'}
lxml:Used for parsing page content.
fake_useragent: Up to date simple useragent faker with real world database
html5lib :pure-python library for parsing HTML
bs4 : for parsing HTML and XML documents
requests :Requests allows you to send organic, grass-fed HTTP/1.1 requests
multiprocessing : Helps for multiprocessing
mysql-connector-python : Enable python programs to access MySQL database
follow all rules.