Releases: intel/auto-round
v0.4.7
Highlights
Support W4AFP8 for HPU. Please refer to Intel Neural Compressor for guidance on running these models. by @yiliu30 in #467
Support packing immediately in new quantization api to save ram usage by @wenhuach21 in #466
20x for awq and 4x for gptq packing speedup on cuda by @wenhuach21 in #459
Support auto-round-light to speed up the tuning process @WeiweiZhang1 in #454
Fix critic bug of mxfp4 in tuningby @wenhuach21 in #451
What's Changed
- step-1 support naive double quant in tuning by @wenhuach21 in #442
- fix critic bug of mxfp4 by @wenhuach21 in #451
- update readme by @wenhuach21 in #455
- update eval by @n1ck-guo in #450
- awq exporting bugfix by @WeiweiZhang1 in #456
- Support force loading into autoround Format by @WeiweiZhang1 in #453
- 20x for awq and 4x for gptq packing speedup by @wenhuach21 in #459
- fixl eval bug by @n1ck-guo in #461
- [STEP-1]W4Afp8 export by @wenhuach21 in #378
- [HPU] Update W4A8 for HPU by @yiliu30 in #467
- support for gemma3 by @n1ck-guo in #468
- upload_auto-round-light results by @WeiweiZhang1 in #454
- GGUF support step2: add naive Q2_KS and Q4_KS by @n1ck-guo in #448
- fix incorrect recipe data by @WeiweiZhang1 in #471
- support for mistral3 by @n1ck-guo in #472
- support to export gemma3 gguf format by @n1ck-guo in #470
- Increase unit test timeout from 120 to 240 minutes by @XuehaoSun in #474
- support packing immediately in new quantization api to save ram usage by @wenhuach21 in #466
- rm redundant line break by @WeiweiZhang1 in #475
- Temporarily close qxk api for new release by @n1ck-guo in #478
- add restrict for exporting act-quant models by @n1ck-guo in #480
Full Changelog: v0.4.6...v0.4.7
v0.4.6
Highlights:
1 set torch compile to false by default in #447
2 Fix packing hang and force to fp16 at exporting in #430
3 align auto_quantizer with Transformers 4.49 in #437
What's Changed
- Fix packing hang, torch compile and force to fp16 at exporting by @wenhuach21 in #430
- fix nblocks issues by @wenhuach21 in #432
- rm gc collect in packing by @wenhuach21 in #438
- align auto_quantizer with main branch in Transformers by @WeiweiZhang1 in #437
- [HPU]Fix compile bug when quant layer by @yiliu30 in #441
- remove tricky setting in mxfp4 by @wenhuach21 in #445
- fix bug of evaluate user model by @n1ck-guo in #444
- Refine funcs by @WeiweiZhang1 in #446
- set torch compile to false by default by @WeiweiZhang1 in #447
Full Changelog: v0.4.5...v0.4.6
v0.4.5
Highlights:
We have enhanced support for extremely large models with the following updates:
Multi-Card Tuning Support: Added basic support for multi-GPU tuning. #415 support naive multi-card tuning
Accelerated Packing Stage: Improved the packing speed (2X-4X)for AutoGPTQ and AutoAWQ formats by leveraging cuda. #407 speedup packing stage for autogptq and autoawq forma
Deepseek V3 GGUF Export: Introduced support for exporting models to the Deepseek V3 GGUF format. #416 support to export deepseek v3 gguf format
What's Changed
- update format readme by @wenhuach21 in #411
- fix log bug and device "auto" bug by @n1ck-guo in #409
- speedup packing stage for autogptq and autoawq format by @wenhuach21 in #407
- support naive multi-card tuning by @wenhuach21 in #415
- support bf16 inference for autoround format by @wenhuach21 in #420
- enable backup pile dataset loading by @WeiweiZhang1 in #417
- fix evaluation device bug, relate to issue 413 by @n1ck-guo in #419
- support to export deepseek v3 gguf format by @n1ck-guo in #416
- fix cuda UT torch_dtype by @WeiweiZhang1 in #423
- fix eval trust_remote_code by @n1ck-guo in #424
Full Changelog: v0.4.4...v0.4.5
v0.4.4 release
Highlights:
1 Fix install issue in #387
2 support to export gguf q4_0 and q4_1 format in #393
3 fix llm cmd line seqlen issue in #399
What's Changed
- fix a critic bug of static activation quantization by @wenhuach21 in #392
- vlm 70B+ in single card by @n1ck-guo in #395
- enhance calibration dataset and add awq pre quantization warning by @wenhuach21 in #396
- support awq format for vlms by @WeiweiZhang1 in #398
- [critic bug]fix llm example seqlen issue by @WeiweiZhang1 in #399
- fix device auto issue by @wenhuach21 in #400
- Fix auto-round install & bump into 0.4.4 by @XuehaoSun in #387
- fix dtype converting issue by @wenhuach21 in #403
- support for deepseek vl2 by @n1ck-guo in #401
- llm_layer_config_bugfix by @WeiweiZhang1 in #406
- support awq with qbits, only support sym by @wenhuach21 in #402
- support to export gguf q4_0 and q4_1 format by @n1ck-guo in #393
Full Changelog: v0.4.3...v0.4.4
v0.4.3: bug fix release
Highlights:
fix incorrect device setting in autoround format inference by @WeiweiZhang1 in #383
remove the dependency on AutoGPTQ by @XuehaoSun in #380
What's Changed
- support_llava_hf_vlm_example by @WeiweiZhang1 in #381
- fix block_name_to_quantize by @WeiweiZhang1 in #382
- fix incorrect device setting in autoround format inference by @WeiweiZhang1 in #383
- refine homepage, update model links by @WeiweiZhang1 in #385
- update eval basic usage by @n1ck-guo in #384
- refine error msg and dump more log in the tuning by @wenhuach21 in #386
- remove the dependency on AutoGPTQ for CPU and bump to V0.4.3 by @XuehaoSun in #380
Full Changelog: v0.4.2...v0.4.3
v0.4.2: bug fix release
Highlights
1 Fix autoawq exporting issue
2 remove bias exporting if possible in autogptq format
What's Changed
- bump version into v0.4.1 by @XuehaoSun in #350
- Update docker user and remove baseline UT by @XuehaoSun in #347
- delete llm example and refine readme by @wenhuach21 in #354
- Simulated W4Afp8 Quantization by @wenhuach21 in #331
- add QWQ-32B, VLM, Qwen2.5, Llama3.1 int4 models by @wenhuach21 in #356
- fix awq exporting by @wenhuach21 in #358
- Tensor reshape bugfix by @WeiweiZhang1 in #364
- fix awq backend and fp_layers issue by @wenhuach21 in #363
- fix awq exporting bugs by @wenhuach21 in #365
- fix bug of only_text_test check due to inference issue on cpu by @n1ck-guo in #362
- add gpu test by @wenhuach21 in #367
- using multicard when device set to "auto" by @n1ck-guo in #368
- quant_block_names enhancement by @WeiweiZhang1 in #369
- [HPU] Add lazy mode back by @yiliu30 in #371
- remove bias exporting if possible in autogptq format by @wenhuach21 in #375
- save processor automatically by @n1ck-guo in #372
- Add gpu ut by @wenhuach21 in #370
- fix gpu ut by @n1ck-guo in #376
- fix typos by @wenhuach21 in #377
Full Changelog: v0.4.1...v0.4.2
v0.4.1: bug fix release
Highlights:
- Fixed vllm calibration infinite loop issue
- Corrected the default value for the sym argument in the API configuration.
What's Changed
- fix typo by @wenhuach21 in #342
- vllm/llama-vision llava calibration infinite loop fix by @WeiweiZhang1 in #343
- [HPU]Enhance
numba
check by @yiliu30 in #345 - [VLM]fix bs and grad reset by @n1ck-guo in #344
- [HPU]Enhance installation check by @yiliu30 in #346
- [Critical Bug]API use sym as default by @wenhuach21 in #349
- triton backend requires< 3.0 by @wenhuach21 in #348
Full Changelog: v0.4...v0.4.1
v0.4
Highlights
[Experimental Feature] We provide API support for VLM models
[Kernel] We add ipex support for intel cpu
[Bug fix] We fix tuning bug for glm4 model
[Enhancement] better align gradient_accumulate_steps
behavior for varied length input
What's Changed
- refine AuoRound format and support marlin repacking by @wenhuach21 in #280
- update readme for v0.3.1 release by @wenhuach21 in #283
- update readme for cpu inference by @wenhuach21 in #284
- avoid deterministic algorithm warning in inference by @wenhuach21 in #285
- fix mx_fp issues by @wenhuach21 in #286
- update torch ao integration information by @wenhuach21 in #287
- Refine code by @wenhuach21 in #291
- Add ipex support for intel cpu by @wenhuach21 in #292
- fix ipex tqdm mismatch issue by @wenhuach21 in #293
- fix bug of backend by @wenhuach21 in #294
- [Experimental Feature]support for common hf multimodel by @n1ck-guo in #276
- use torch.compile by default for PyTorch versions 2.6 and above by @wenhuach21 in #295
- refine forward hook by @WeiweiZhang1 in #290
- eval for MLLMs by @n1ck-guo in #296
- mllm eval bug fix by @n1ck-guo in #297
- Port Numba-based packing from INC by @yiliu30 in #301
- refine model config file for mixed precision quantization by @wenhuach21 in #300
- fix glm4-9b batch dim issue by @wenhuach21 in #304
- better align gradient_accumulate_steps for varied length input by @wenhuach21 in #309
- Enable torch.compile on HPU by @yiliu30 in #307
- Update autogptq exporting by @wenhuach21 in #310
- fix typo by @wenhuach21 in #311
- qwen2 vision quantization bugfix by @WeiweiZhang1 in #313
- multiple gpu evaluation/calibration refine by @wenhuach21 in #312
- HPU only release binary by @yiliu30 in #302
- patch 1 for mllm by @n1ck-guo in #298
- add torch compile arg by @wenhuach21 in #314
- fix merge error by @n1ck-guo in #316
- Update the check for HPU by @yiliu30 in #318
- fix eval device issue by @wenhuach21 in #319
- fix multiple device bug by @wenhuach21 in #321
- add warning for no gptq exllamav2 kernel by @wenhuach21 in #324
- add pile calib, rename quant_block_list to to_quant_block_names by @WeiweiZhang1 in #322
- fix autogptq version error by @wenhuach21 in #325
- new mllm eval by @n1ck-guo in #317
- Add cpu only version by @XuehaoSun in #315
- set default mllm dataset by @n1ck-guo in #327
- fix fp_layers issue and force to FP16 on cuda for autoround format inference by @wenhuach21 in #326
- fix the bug of test model support for test-only by @n1ck-guo in #328
- Increase unit test timeout to 120 minutes by @XuehaoSun in #330
- fix mllm dataset config bug and add gptq cuda backend by @wenhuach21 in #329
- add tips and tricks for llm&mllm quantization by @wenhuach21 in #333
- fix eval_bs in fake format and reset auto-gptq exporting max_shard_size by @wenhuach21 in #332
- fix model_dtype issue and reformat mllm code by @wenhuach21 in #335
- Exclude markdown files from unit test pipelines by @XuehaoSun in #337
- refine mllm docs by @WeiweiZhang1 in #336
- cogvlm doc by @n1ck-guo in #339
- add qwen2.5 recipe and refine readme by @WeiweiZhang1 in #338
- add cogvlm recipe and refine readme by @WeiweiZhang1 in #340
- refine mllm API and add help info by @n1ck-guo in #334
Full Changelog: v0.3.1...v0.4
Intel® auto-round v0.3.1 Release
Release Highlights:
New Features:
Full-Range Symmetric Quantization: We’ve introduced full-range symmetric quantization, which often matches or even exceeds the performance of asymmetric quantization, especially at lower bit widths, such as 2.
Command-Line Support: You can now quantize models using the command auto-round --model xxx --format xxx
Default Exporting Format Change: The default format has been updated to auto_round instead of auto_gptq.
Muiti-thread packing: up to 2X speed up on packing phase
Bug Fixes:
Resolved Missing Cached Position Embeddings: Fixed an issue with missing cached position embeddings in Transformer version 4.45.2.
Mutable Default Values Issue: Addressed problems related to mutable default values.
3 bit packing bug for AutoGPTQ format
What's Changed
- Add setseed in autoround by @WeiweiZhang1 in #201
- support autoawq format by @yintong-lu in #115
- Remove UT coverage check by @XuehaoSun in #202
- set autoround format as default to unify CPU/HPU/CUDA by @wenhuach21 in #205
- add local file of pile-10k by @WeiweiZhang1 in #198
- modify setup.py by @n1ck-guo in #206
- limit the scale minimum value not to 0 by @WeiweiZhang1 in #211
- fix example dataset regression by @WeiweiZhang1 in #212
- remove local pile file by @WeiweiZhang1 in #213
- update xpu format exporting by @WeiweiZhang1 in #214
- fix a bug in autoround format inference by @wenhuach21 in #215
- avoid underflow and overflow for exllamav2 by @wenhuach21 in #218
- add qwen int4 model, refine example by @WeiweiZhang1 in #217
- [Experimental Feature]fast tuning norm/bias at 2 bits by @wenhuach21 in #208
- update readme by @wenhuach21 in #220
- refine eval_042 to enable parallelize evaluation by @WeiweiZhang1 in #221
- Enable phi3v tuning by @WeiweiZhang1 in #197
- Bump setuptools from 69.5.1 to 70.0.0 in /examples/multimodal-modeling/Phi-3-vision by @dependabot in #223
- refine example by @WeiweiZhang1 in #224
- change the scale thresh generally by @WeiweiZhang1 in #229
- add quantized models by 3rd party by @WeiweiZhang1 in #230
- add meta3.1-70B-instruct model, refine docs by @WeiweiZhang1 in #231
- fix model link by @WeiweiZhang1 in #232
- refine docs, add accuracy data, add receip and eval scripts by @WeiweiZhang1 in #226
- add brief formats introduction by @wenhuach21 in #236
- update readme and add itrex in the requirements.txt by @wenhuach21 in #238
- add tritonv2, improve packing and pbar by @wenhuach21 in #239
- refine the code and the speedup is notable by @wenhuach21 in #240
- move some settings from example to main by @wenhuach21 in #241
- add runable script for autoround by @n1ck-guo in #225
- update readme by @n1ck-guo in #242
- Add MANIFEST.in file to include requirements.txt by @XuehaoSun in #243
- fix example bug by @n1ck-guo in #245
- enable llava int4 inference with autoround format by @WeiweiZhang1 in #237
- remove autoawq requirement at packing stage by @n1ck-guo in #249
- remove unused log by @n1ck-guo in #252
- support INC API by @WeiweiZhang1 in #255
- avoid potential bug for auto-gptq 0.8 by @wenhuach21 in #250
- fix example by @n1ck-guo in #256
- fix preci by @n1ck-guo in #258
- enable_qwen2-vl_quantization by @WeiweiZhang1 in #248
- update eval and fix example by @n1ck-guo in #260
- refine autoawq exporting code by @wenhuach21 in #261
- better support quant_lm_head for larger models by @wenhuach21 in #263
- Fix 3bit packing for auto-gptq format by @wenhuach21 in #264
- Add a warning for improper export formats. by @wenhuach21 in #265
- Update readme for VLM support and integration by @wenhuach21 in #266
- remove g_idx in gptq format by @wenhuach21 in #267
- keep the dtype after qdq by @wenhuach21 in #268
- enable llama3.2-vision model quantization by @WeiweiZhang1 in #269
- fix mutable default value by @wenhuach21 in #272
- change to even rounding for mantissa of mx_fp by @wenhuach21 in #277
- adamround bugfix, refine import by @WeiweiZhang1 in #275
- [Important Change]set full range sym as the default by @wenhuach21 in #278
- refine eval by @wenhuach21 in #282
- qwen2_bugfix, add adamround vision UT by @WeiweiZhang1 in #281
New Contributors
- @dependabot made their first contribution in #223
Full Changelog: v0.3...v0.3.1
Intel® auto-round v0.3 Release
-
Highlights:
- Broader Device Support:
- Expanded support for CPU, HPU, and CUDA inference in the AutoRound format, resolving the 2-bit accuracy issue.
- New Recipes and Model Releases:
- Published numerous recipes on the Low Bit Open LLM Leaderboard, showcasing impressive results on LLaMa 3.1 and other leading models.
- Experimental Features:
- Introduced several experimental features, including activation quantization and
mx_fp
, with promising outcomes with AutoRound.
- Introduced several experimental features, including activation quantization and
- Multimodal Model Support:
- Extended capabilities for tuning and inference across several multimodal models.
Lowlights:
- Implemented support for
low_cpu_mem_usage
,auto_awq
format, calibration dataset concatenation, and calibration datasets with chat templates.
- Broader Device Support: