Many a times, we need datasets to perform data related tasks like data analytics, data cleaning and machine learning as such. Most data science courses taught in college and online just make the dataset available to you.
I wanted to explore the data creation part so I can progress forward in my data science curriculum.
I can create a dataset using a spreadsheet software but that is very boring and time-consuming. So I wanted to automate this task and also try out Puppeter.
The site I am scraping is Naukri. It is a job aggregator that does not provide any public API for consumption. This makes it the perfect choice for web scraping.
This project was made as a part of the assesment of my Cloud Computing class.
- NodeJS Runtime
- npm (Node Packet Manager)
- Chromium or Chrome Browser (You don't have to install them seperately if you don't have them already. Installing the package automatically installs the browser)
- Azure Account
If you wanna try this on your computer, clone the repository and install the dependencies using npm
or yarn
git clone https://github.com/VarunGuttikonda/WebScraper.git
cd WebScraper
npm install
- Puppeteer
A NodeJS library to run chrome in headless mode and automate it using theDevTools
protocol - Objects-to-csv
A library to convertJSON
objects intoCSV
strings and vice-versa
npm install puppeteer objects-to-csv
async/await
from JavaScriptElementHandle
andJSHandle
from PuppeteerConnection Strings
andBindings
from Azure Functions
Azure Functions is a Functions-as-a-Service application from Microsoft Azure. It provides serverless compute options to perform computations that can be written as a single function.
The following files serve as the configuration of this project:
function.json
- Defines the properties of functions like names of the bindings, type of the bindings and their connection strings along with their direction ie binding is input or output. Should be defined for each function seperatelyhost.json
- Describes the properties of the host to which the function will be deployed. Details like packages to be installed, extensions to use are a part of this filelocal.settings.json
- Contains the metadata used by Azure Functions Core Tools to assist the developer in testing the apppackage.json
- Contains the metadata of the project like name,author, GitHub link, packages used etc.gitignore
- Has a list of file names that the VCS (Git) shouldn't be tracking
scrape.js
- Exports the main scrape function named asscrape
. This function takes care of creating theBrowser
andPage
objects. It later then scrapes all the jobs on the siteconstants.js
- This file contains all the configurations likeHTML Selectors
, config file forBrowser
object etc.utils.js
- Has utilities for error hanlding and printing to the consolescrapeUtils.js
- Contains the code for navigating, clicking and scraping the website that were used in thescrape
function
This application was deployed to Azure Functions. If you want to deploy it to any other cloud platform, please use the
scrape.js,constants.js,scrapeUtils.js and utils.js
as the base files. These export the scraping functionality.