🤖💡 LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea Generation with Minimal Context

"It's not like finding a needle in a haystack, it is like creating new needles."

🏆 Leaderboard: http://liveideabench.com 💡

🤗 Dataset

🧠✨🎉 News (2025/3/29): Latest Dataset Update (v2) on Hugging Face!

We are pleased to announce that, based on the invaluable feedback from reviewers, we have enhanced our benchmark by upgrading it to version 2. This update introduces a new dimension—Clarity—and improves the prompts, evaluation process (including the rejection handling mechanism), making our benchmark more comprehensive and objective. 🚀

This v2 version of the benchmark incorporates the latest models, including: claude-3.7-sonnet:thinking, o3-mini-high, gpt-4.5-preview, qwq-32b, deepseek-r1, gemini-2.0-flash-thinking, and a total of 41 state-of-the-art models.

🧠✨🎉 News (2025/1/27): Latest Dataset Update on Hugging Face!

We are excited to announce that the latest dataset, including supplementary tests for models like deepseek-R1, deepseek-V3, minimax-01, phi-4, and Opus, has been uploaded to Hugging Face! 🚀

LiveIdeaBench Evaluation Framework

👇 Evaluation Instruction

1️⃣ Install Environment

conda env create -f environment.yaml

2️⃣ Database Initialization

Run the Python script to initialize the database:

python -c "from utils.database import init_database; init_database()"

3️⃣ Configuring API Keys

Before running the program, you need to configure at least one API key:

Create an apikey file and write your OpenRouter API key:

echo "your-openrouter-api-key" > apikey

Alternatively, set environment variables:

export OPENROUTER_API_KEY="your-openrouter-api-key"
export STEP_API_KEY="your-step-api-key"
export GEMINI_API_KEYS="key1,key2,key3"

4️⃣ Running Examples

Generate and evaluate ideas using a specified model:

# Generate ideas using a specified model
python run.py --idea_model "openai/gpt-4o-mini"

# Use a specific provider
python run.py --idea_model "openai/gpt-4o-mini" --provider openrouter

# Use a single keyword:

python run.py --idea_model "openai/gpt-4o-mini" --keyword "relativity"
# Use multiple keywords:

python run.py --idea_model "openai/gpt-4o-mini" --keyword "relativity" "periodic table"
# Do not specify a keyword (use all keywords):

python run.py --idea_model "openai/gpt-4o-mini"

5️⃣ Database Export

This step extracts the generated ideas, scores, and metadata from the internal database.

Run the script:

python view_database.py

to extract the generated data from the SQL database.

Then, run stats.ipynb, to generate data/data.parquet which serves as input for the subsequent analysis notebooks.

6️⃣ Evaluate Fluency

Fluency measures the diversity and uniqueness of the generated ideas. This script calculates the Fluency score based on the processed data.

Run the script:

python hash.py

7️⃣ Compute Flexibility & Plotting

Flexibility evaluates whether the model's generated ideas span across diverse scientific disciplines based on the input keyword(s).

This notebook calculates the Flexibility score and creates visualizations of the benchmark results.

Run the Jupyter Notebook: stats_flexibility.ipynb

🌍🌱 CO2 Emission Estimation

This repo provides an estimation of the CO2 footprint associated with running the idea generation and evaluation pipeline.

Run the Jupyter Notebook: co2.ipynb

Bibtex

@article{ruan2024liveideabench,
title={LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea Generation with Minimal Context},
author={Kai Ruan and Xuan Wang and Jixiang Hong and Peng Wang and Yang Liu and Hao Sun},
journal={arXiv preprint arXiv:2412.17596},
year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
assets		assets
csvs		csvs
data		data
keywords_data		keywords_data
results		results
tokenizer		tokenizer
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
co2.ipynb		co2.ipynb
config.py		config.py
environment.yaml		environment.yaml
hash.py		hash.py
run.py		run.py
stats.ipynb		stats.ipynb
stats_flexibility.ipynb		stats_flexibility.ipynb
view_database.py		view_database.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖💡 LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea Generation with Minimal Context

🤗 Dataset

🧠✨🎉 News (2025/3/29): Latest Dataset Update (v2) on Hugging Face!

🧠✨🎉 News (2025/1/27): Latest Dataset Update on Hugging Face!

LiveIdeaBench Evaluation Framework

👇 Evaluation Instruction

1️⃣ Install Environment

2️⃣ Database Initialization

3️⃣ Configuring API Keys

4️⃣ Running Examples

5️⃣ Database Export

6️⃣ Evaluate Fluency

7️⃣ Compute Flexibility & Plotting

🌍🌱 CO2 Emission Estimation

Bibtex

About

Releases

Packages

Languages

License

x66ccff/liveideabench

Folders and files

Latest commit

History

Repository files navigation

🤖💡 LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea Generation with Minimal Context

🤗 Dataset

🧠✨🎉 News (2025/3/29): Latest Dataset Update (v2) on Hugging Face!

🧠✨🎉 News (2025/1/27): Latest Dataset Update on Hugging Face!

LiveIdeaBench Evaluation Framework

👇 Evaluation Instruction

1️⃣ Install Environment

2️⃣ Database Initialization

3️⃣ Configuring API Keys

4️⃣ Running Examples

5️⃣ Database Export

6️⃣ Evaluate Fluency

7️⃣ Compute Flexibility & Plotting

🌍🌱 CO2 Emission Estimation

Bibtex

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages