This project analyzes YouTube trending videos data to understand the characteristics of videos that trend on YouTube.
The data is too big to store on github. Files to download from Kaggle.
The project uses the following Python libraries:
- pandas
- matplotlib
- seaborn
- datetime
The project also uses custom modules create_plots
and clean
from the src
directory.
To run the main script, main.py
The script performs an analysis on YouTube videos data, which is read from youtube_data/USvideos.csv
. The steps of the analysis are as follows:
-
Parsing Dates: The
publish_time
andtrending_date
fields are parsed to datetime. -
Mapping Categories to Genres: The
category_id
field is mapped to a genre using a JSON file. -
Calculating Time Difference: The time difference between the publish time and the trending date is calculated.
-
Removing Outliers: Outliers are removed from the data where the time difference is more than 25 days.
-
Grouping by Genre: The data is grouped by genre and the total views and the number of videos per genre are calculated.
-
Creating Box Plots: Box plots of the distribution of the time difference are created, both with and without outliers.
-
Creating a Histogram: A histogram of the distribution of the time difference without outliers is created.
-
Creating Bar Plots: Bar plots are created showing the total views per genre and the total views and the number of videos per genre.
-
**Checking correlationship between genre, likes, dislikes, comment counts, and time difference using heatmap.
The script generates several plots and saves them in the current directory. The plots include box plots, a histogram, and bar plots. The plots are also displayed above in the description of the analysis.