Hausa Language Models

Overview

This repository contains research and implementations of language models for Hausa, one of Africa's major languages spoken primarily in northern Nigeria, Niger, and other parts of West Africa. The project aims to advance natural language processing capabilities for Hausa by developing specialized language models for various tasks.

Objectives

Develop pre-trained language models for Hausa
Fine-tune models for specific NLP tasks
Improve accessibility of NLP tools for the Hausa-speaking community
Bridge the technological gap for low-resource languages

Tasks

The models in this repository are being developed for various NLP/Vision tasks, including but not limited to:

Text generation
Machine translation
Vision question answering

Motivation

Despite being spoken by over 70 million people, Hausa remains underrepresented in current NLP research and applications. This project aims to address this disparity by creating resources that can be used in practical applications and further research.

Contributing

Contributions are welcome! Whether you're a native Hausa speaker, ML practitioner, or NLP researcher, your input can help improve these resources.

Installation

To set up the development environment for this project:

Create a virtual environment using uv:
```
uv venv
```

Activate the virtual environment:

# On Unix/Linux/macOS
source .venv/bin/activate
# On Windows
.venv\Scripts\activate

Install dependencies:
```
uv pip install pyproject.toml -e .
```

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Scripts

Here you can include various scripts related to the development and fine-tuning of the Hausa language models. This may include pre-processing scripts, training scripts, and any utility functions that are helpful for working with the data or models.

To train a custom tokenizer

python3 train_tokenizer.py \
    --base_tokenizer "<BASE_TOKENIZER_NAME>" \
    --dataset_url "<DATASET_NAME>" \
    --subset "<SUBSET_NAME>" \
    --split "<SPLIT_NAME>" \
    --text_column "<TEXT_COLUMN>" \
    --trust_remote_code \
    --push_to_hub \
    --model_id "<YOUR_MODEL_ID>" \
    --token "<YOUR_HF_TOKEN>"

To train an xLSTM causal language model

python3 train_xlstm.py \
    --target_model_id "<YOUR_MODEL_ID>" \
    --tokenizer_id "<TOKENIZER_ID>" \
    --hf_token "<YOUR_HF_TOKEN>" \
    --features "<TEXT_COLUMN>" \
    --training_args_file "<PATH_TO_TRAINING_ARGS>" \
    --trust_remote_code

To finetune a model of your choosing already available on Hugging Face

python3 train_hf.py \
    --source_model_id "<SOURCE_MODEL_ID>" \
    --target_model_id "<YOUR_MODEL_ID>" \
    --tokenizer_id "<TOKENIZER_ID>" \
    --hf_token "<YOUR_HF_TOKEN>" \
    --max_seq_length 512 \
    --features "<TEXT_COLUMN>" \
    --training_args_file "<PATH_TO_TRAINING_ARGS>" \
    --trust_remote_code

Citation

If you use these models in your research or applications, please cite this repository as

@misc{thiombiano2024hausa_lm,
    author = {Thiombiano, Abdoul Majid O.},
    title = {Hausa Language Models},
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/thiomajid/hausa_lm}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
hausa_lm		hausa_lm
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
model_config.yaml		model_config.yaml
pyproject.toml		pyproject.toml
train_hf.py		train_hf.py
train_tokenizer.py		train_tokenizer.py
train_xlstm.py		train_xlstm.py
trainer_arguments.yaml		trainer_arguments.yaml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hausa Language Models

Overview

Objectives

Tasks

Motivation

Contributing

Installation

License

Scripts

Citation

About

Releases

Packages

Languages

License

thiomajid/hausa_lm

Folders and files

Latest commit

History

Repository files navigation

Hausa Language Models

Overview

Objectives

Tasks

Motivation

Contributing

Installation

License

Scripts

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages