Web Scraping using Azure Functions

Web Scraping using Azure Functions

Introduction

Many a times, we need datasets to perform data related tasks like data analytics, data cleaning and machine learning as such. Most data science courses taught in college and online just make the dataset available to you. I wanted to explore the data creation part so I can progress forward in my data science curriculum.
I can create a dataset using a spreadsheet software but that is very boring and time-consuming. So I wanted to automate this task and also try out Puppeter.
The site I am scraping is Naukri. It is a job aggregator that does not provide any public API for consumption. This makes it the perfect choice for web scraping.

This project was made as a part of the assesment of my Cloud Computing class.

Requirements

NodeJS Runtime
npm (Node Packet Manager)
Chromium or Chrome Browser (You don't have to install them seperately if you don't have them already. Installing the package automatically installs the browser)
Azure Account

Cloning the repository

If you wanna try this on your computer, clone the repository and install the dependencies using npm or yarn

git clone https://github.com/VarunGuttikonda/WebScraper.git  
cd WebScraper  
npm install

Tools used

Puppeteer
A NodeJS library to run chrome in headless mode and automate it using the DevTools protocol
Objects-to-csv
A library to convert JSON objects into CSV strings and vice-versa

npm install puppeteer objects-to-csv

Concepts Involved

async/await from JavaScript
ElementHandle and JSHandle from Puppeteer
Connection Strings and Bindings from Azure Functions

Description and Code Configurations

Azure Functions is a Functions-as-a-Service application from Microsoft Azure. It provides serverless compute options to perform computations that can be written as a single function.
The following files serve as the configuration of this project:

function.json - Defines the properties of functions like names of the bindings, type of the bindings and their connection strings along with their direction ie binding is input or output. Should be defined for each function seperately
host.json - Describes the properties of the host to which the function will be deployed. Details like packages to be installed, extensions to use are a part of this file
local.settings.json - Contains the metadata used by Azure Functions Core Tools to assist the developer in testing the app
package.json - Contains the metadata of the project like name,author, GitHub link, packages used etc
.gitignore - Has a list of file names that the VCS (Git) shouldn't be tracking

Structure of the Code

scrape.js - Exports the main scrape function named as scrape. This function takes care of creating the Browser and Page objects. It later then scrapes all the jobs on the site
constants.js - This file contains all the configurations like HTML Selectors, config file for Browser object etc.
utils.js - Has utilities for error hanlding and printing to the console
scrapeUtils.js - Contains the code for navigating, clicking and scraping the website that were used in the scrape function

Notes

This application was deployed to Azure Functions. If you want to deploy it to any other cloud platform, please use the scrape.js,constants.js,scrapeUtils.js and utils.js as the base files. These export the scraping functionality.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
BackEnd		BackEnd
.funcignore		.funcignore
.gitignore		.gitignore
README.md		README.md
host.json		host.json
package-lock.json		package-lock.json
package.json		package.json
proxies.json		proxies.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping using Azure Functions

Introduction

Requirements

Cloning the repository

Tools used

Concepts Involved

Description and Code Configurations

Structure of the Code

Notes

About

Releases

Packages

Languages

VarunGuttikonda/WebScraper

Folders and files

Latest commit

History

Repository files navigation

Web Scraping using Azure Functions

Introduction

Requirements

Cloning the repository

Tools used

Concepts Involved

Description and Code Configurations

Structure of the Code

Notes

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages