This repository serves as a centralized collection of datasets used for various data analysis, machine learning, and natural language processing tasks. The datasets are organized into specific categories and maintained with strict versioning and documentation standards.
datasets/
├── README.md # Main documentation
├── data/ # Data directory
│ ├── reddit/ # Reddit-related datasets
│ │ ├── README.md # Reddit data documentation
│ │ └── subreddits.json # Subreddit configurations
│ └── rss/ # RSS feed datasets
│ ├── README.md # RSS data documentation
│ └── rss_sources.json # RSS feed configurations
This project maintains a collection of datasets from various sources, primarily focusing on:
-
Reddit Data: Curated content from specific subreddits
- Post data
- Comment threads
- User interactions
- Community metrics
-
RSS Feeds: Structured content from various news and content sources
- News articles
- Blog posts
- Updates and announcements
- Multi-language content
- Git (2.x or higher)
- Python 3.8+ (for data processing scripts)
- JSON processor (e.g.,
jq
for command line operations)
-
Clone the repository:
git clone https://github.com/skyrisenexus/datasets.git cd datasets
-
Install required dependencies (if any):
pip install -r requirements.txt
- Located in
/data/reddit/
- Configured via
subreddits.json
- Includes metadata about subreddits and their categories
- See Reddit README for detailed information
- Located in
/data/rss/
- Configured via
rss_sources.json
- Supports multiple languages and regions
- See RSS README for detailed information
-
Direct Access:
- Clone the repository
- Access JSON files directly
- Use provided scripts (if any) for data processing
-
API Integration:
- Follow Reddit API guidelines for Reddit data
- Use RSS feed standards for RSS data
- Respect rate limits and terms of service
Data updates follow these principles:
- Regular updates on a scheduled basis
- Version control for all changes
- Documented update procedures
- Quality checks before commits
We welcome contributions from the community! Here's how you can help:
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature-name
- Commit your changes:
git commit -m "Add: detailed description of your changes"
- Push to your fork
- Create a Pull Request
-
Code Style
- Follow existing JSON structure
- Maintain consistent formatting
- Include appropriate comments
-
Documentation
- Update relevant README files
- Document any new features
- Include examples where appropriate
-
Quality Assurance
- Validate JSON files
- Test data integrity
- Verify source reliability
- Title: Clear and descriptive
- Description: Detailed explanation of changes
- Labels: Add appropriate labels
- References: Link related issues
- Tests: Include/update tests if applicable
All documentation should:
- Be written in clear, professional English
- Include examples and use cases
- Maintain consistent formatting
- Be updated with any changes
- No sensitive data should be committed
- API keys and credentials must be kept private
- Follow security best practices for data handling
This project is licensed under the MIT License - see the LICENSE file for details.
- Reddit data is subject to Reddit's Terms of Service
- RSS feed content is subject to respective source licenses
- Verify usage rights before implementing in production
- Create an issue for bugs or feature requests
- Join our community discussions
- Check existing documentation first
We follow semantic versioning:
- MAJOR.MINOR.PATCH
- Document breaking changes
- Maintain a changelog
Last updated: [Current Date]
For specific details about each data source, please refer to the README files in their respective directories.