Skip to content

A centralized collection of datasets including data sources for Reddit, RSS feeds, and more for data analysis, machine learning, and NLP tasks.

License

Notifications You must be signed in to change notification settings

skyrisenexus/datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 

Repository files navigation

Datasets Repository

This repository serves as a centralized collection of datasets used for various data analysis, machine learning, and natural language processing tasks. The datasets are organized into specific categories and maintained with strict versioning and documentation standards.

📁 Project Structure

datasets/
├── README.md                # Main documentation
├── data/                    # Data directory
│   ├── reddit/              # Reddit-related datasets
│   │   ├── README.md        # Reddit data documentation
│   │   └── subreddits.json  # Subreddit configurations
│   └── rss/                 # RSS feed datasets
│       ├── README.md        # RSS data documentation
│       └── rss_sources.json # RSS feed configurations

🎯 Overview

This project maintains a collection of datasets from various sources, primarily focusing on:

  1. Reddit Data: Curated content from specific subreddits

    • Post data
    • Comment threads
    • User interactions
    • Community metrics
  2. RSS Feeds: Structured content from various news and content sources

    • News articles
    • Blog posts
    • Updates and announcements
    • Multi-language content

🚀 Getting Started

Prerequisites

  • Git (2.x or higher)
  • Python 3.8+ (for data processing scripts)
  • JSON processor (e.g., jq for command line operations)

Installation

  1. Clone the repository:

    git clone https://github.com/skyrisenexus/datasets.git
    cd datasets
  2. Install required dependencies (if any):

    pip install -r requirements.txt

📊 Data Structure

Reddit Data

RSS Sources

🛠 Usage

Data Access

  1. Direct Access:

    • Clone the repository
    • Access JSON files directly
    • Use provided scripts (if any) for data processing
  2. API Integration:

    • Follow Reddit API guidelines for Reddit data
    • Use RSS feed standards for RSS data
    • Respect rate limits and terms of service

Data Updates

Data updates follow these principles:

  • Regular updates on a scheduled basis
  • Version control for all changes
  • Documented update procedures
  • Quality checks before commits

🤝 Contributing

We welcome contributions from the community! Here's how you can help:

Contribution Process

  1. Fork the repository
  2. Create a feature branch:
    git checkout -b feature/your-feature-name
  3. Commit your changes:
    git commit -m "Add: detailed description of your changes"
  4. Push to your fork
  5. Create a Pull Request

Contribution Guidelines

  1. Code Style

    • Follow existing JSON structure
    • Maintain consistent formatting
    • Include appropriate comments
  2. Documentation

    • Update relevant README files
    • Document any new features
    • Include examples where appropriate
  3. Quality Assurance

    • Validate JSON files
    • Test data integrity
    • Verify source reliability

Pull Request Standards

  • Title: Clear and descriptive
  • Description: Detailed explanation of changes
  • Labels: Add appropriate labels
  • References: Link related issues
  • Tests: Include/update tests if applicable

📝 Documentation Standards

All documentation should:

  • Be written in clear, professional English
  • Include examples and use cases
  • Maintain consistent formatting
  • Be updated with any changes

🔒 Security

  • No sensitive data should be committed
  • API keys and credentials must be kept private
  • Follow security best practices for data handling

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

Data Usage Rights

  • Reddit data is subject to Reddit's Terms of Service
  • RSS feed content is subject to respective source licenses
  • Verify usage rights before implementing in production

📞 Support

  • Create an issue for bugs or feature requests
  • Join our community discussions
  • Check existing documentation first

🔄 Version Control

We follow semantic versioning:

  • MAJOR.MINOR.PATCH
  • Document breaking changes
  • Maintain a changelog

Last updated: [Current Date]

For specific details about each data source, please refer to the README files in their respective directories.

About

A centralized collection of datasets including data sources for Reddit, RSS feeds, and more for data analysis, machine learning, and NLP tasks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published