Releases: IBM/unitxt
Unitxt 1.26.5
What's Changed
- For load_dataset, use_cache default value is taken from settings by @eladven in #1880
- Support watsonx.ai on-prem credentials by @pratapkishorevarma in #1883
- extend condition to also filter by field exists or not by @dafnapension in #1879
- fix performance test by @dafnapension in #1884
- Add support for inline-defined templates in the UI by @Chemafiz in #1886
- Mitigate HTTP 403 errors in pandas by @bnayahu in #1888
- Biggen benchmark and pearson correlation metric by @martinscooper in #1887
- Update version to 1.26.5 by @elronbandel in #1889
New Contributors
- @pratapkishorevarma made their first contribution in #1883
- @Chemafiz made their first contribution in #1886
Full Changelog: 1.26.4...1.26.5
Unitxt 1.26.4
What's Changed
- Add more Judgebench benchmarks by @martinscooper in #1869
- Make sqlite3 not an optional dependency by @elronbandel in #1871
- Removed legacy topicality, idk, and groundness metrics that worked only on BAM by @yoavkatz in #1875
- Bench and models by @martinscooper in #1872
- Handle a case in ToolCallPostProcessor where prediction is an empty list of tools by @yoavkatz in #1874
- Update version to 1.26.4 by @elronbandel in #1876
Full Changelog: 1.26.3...1.26.4
Unitxt 1.26.3
What's Changed
- LLM Judge: Improve context/prediction fields parsing by @martinscooper in #1856
- Fixed bug in tool inference by @yoavkatz in #1868
- Added a new MetricBasedNer that allows calculating entity similary using any Unitxt metric. by @yoavkatz in #1860
- Update version to 1.26.3 by @elronbandel in #1870
Full Changelog: 1.26.2...1.26.3
Unitxt 1.26.2
What's Changed
- Add tot dataset by @elronbandel in #1865
- Add tokenizer_name to base huggingface inference engines by @elronbandel in #1862
- Add hf to cross provider inference engine by @yoavkatz in #1866
- Update version to 1.26.2 by @elronbandel in #1867
Full Changelog: 1.26.1...1.26.2
Unitxt 1.26.1
Lock datasets dependency to <4.0.0
The latest datasets v4.0.0 release removes support for loading datasets with trust_remote_code=True
. This change breaks compatibility with many datasets currently in the Unitxt catalog, as several datasets require this feature to load properly.
This patch restricts the datasets version to below 4.0.0 until we can find or develop replacements for affected datasets.
Unitxt 1.26.0 - Multi Threading
Main changes:
- Made Unitxt Thread-Safe so it can run in multi-threaded environments.
- Added an option to set sampling seed for demos (in context example). This is done by
demos_sampling_seed
. It allows running the same dataset with different demo examples. - Improved printouts of instance scores with to_markdown() and summary in Unitxt. For example :
results = evaluate(predictions=predictions, data=dataset)
print(results.instance_scores.summary)

All changes:
- Add to_markdown() to InstanceScores to pretty print output by @yoavkatz in #1846
- Improved InstanceScores summary to be readible and in decent width by @yoavkatz in #1847
- Improve multi turn tool calling example by @elronbandel in #1848
- Add metrics documentation including range, directionality and references by @elronbandel in #1850
- Fix sacrebleu documentation by @elronbandel in #1851
- Add F1 score documentation to F1Fast metric class by @elronbandel in #1852
- Add more llmjudge benchmarks by @martinscooper in #1804
- Fix llama scout name and url on rits by @martinscooper in #1857
- Add demos_sampling_seed to recipe api by @elronbandel in #1858
- Add comprehensive multi threading support and tests by @elronbandel in #1853
- Update BlueBench to match the original implementation by @bnayahu in #1855
Full Changelog: 1.25.0...1.26.0
Unitxt 1.25.0 - Improved Error Messages
Main changes
- Error message simplied and improved. Now each failue produces a short stack trace, following by the context the error occured, a link to help documention, and then the detailed error message
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 🦄 Unitxt Error Context │
│ -------------------------------------------------------------------------------------------------------------------- │
│ - Python: 3.10.17 │
│ - Unitxt: 1.25.0 │
│ - Stage: Metric Processing │
│ - Stream: all_data>> │
│ - Object: KeyValueExtraction (https://www.unitxt.ai/en/latest/unitxt.metrics.html#unitxt.metrics.KeyValueExtraction)│
│ - Help: https://www.unitxt.ai/en/latest/docs/adding_metric.html │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Each reference is expected to be of type 'Dict[str, str]' in metrics.key_value_extraction.accuracy metric. Received reference of type <class 'str'>: Austin
-
Added Granite Thinking support including example.
-
Added a flag in the format to determine whether the to place the template instructions once in the system turn, or in the user turns (for each demo and for the final input). This is important because some models delete their default system prompt, when their recieve an external system prompt.
-
Added option to get generated text in meta data when calling infer_log_prob() . In the past only seperated tokens were returned.
See example code. -
Added support for multi turn dialog metrics. See tool calling example.
What's Changed
- Add Multi Turn Metrics Support by @elronbandel in #1579
- add a test for faithfulness with an external client and fetch artifact by @matanor in #1824
- Fix rits model names and the judges that use them by @martinscooper in #1825
- Add option to store template instruction in user role and not system role and added granite thinking example by @yoavkatz in #1667
- Bluebench fixes by @bnayahu in #1828
- Fix huggingface auto model log probs evaluation by @elronbandel in #1829
- Add support for tool calling in HFAutoModelInferenceEngine by @elronbandel in #1827
- Changed artifiact.to_yaml() to use standard dict to yaml API + added example to create a yaml representation of a data card by @yoavkatz in #1831
- Fix CLI issues by @bnayahu in #1832
- Allow changing default ollama api_base by @martinscooper in #1830
- fix bug when WML does not return any content or tool call by @yoavkatz in #1835
- Arena hard fix by @bnayahu in #1836
- Add full generated text when running infer_log_prob() with meta data enabled. by @yoavkatz in #1834
- Improved parsing of MT bench style scores by @yoavkatz in #1839
- Use os.path.join to create infer cache dir path by @martinscooper in #1840
- Add multi turn tool calling task and support multiple tools per call by @elronbandel in #1811
- Results summarization utility for the CLI by @bnayahu in #1842
- Improved error messages by @elronbandel in #1838
- Improve Text2SQL Metrics: Refactoring, New Execution Metric, and Bug Fixes by @oktie in #1841
- Update coverage exclusions by @elronbandel in #1843
- Use full artifact representation as the cache key for the dataset by @elronbandel in #1644
- Update version to 1.25.0 by @elronbandel in #1844
Full Changelog: 1.24.0...1.25.0
Unitxt 1.24.0
What's Changed
- External client for wml infer engine by @matanor in #1817
- Improved JoinStream error messages by @yoavkatz in #1819
- Added param to control of confidence interval calculation in evaluate api by @yoavkatz in #1815
- extend code coverage some by @dafnapension in #1814
- Make api_key_env_var optional in LoadFromAPI by @martinscooper in #1799
- Fix Issue with multi byte token decoding by @elronbandel in #1821
- Fix ruff format pre-commit by @elronbandel in #1822
- Test eval utils with external client by @matanor in #1820
- Improved and Optimized JaccardIndex , Spearman, StringContainment metrics and added MSE and RMSE metrics by @elronbandel in #1816
- Update version to 1.24.0 by @elronbandel in #1823
Full Changelog: 1.23.1...1.24.0
Unitxt 1.23.1
What's Changed
- Add more metrics for schema linking by @kurhula in #1788
- Fixed argument_value_precision by @yoavkatz in #1794
- FIx granite guardian agentic metric and align it with unitxt built in tool calling types by @elronbandel in #1786
- Allow running benchmarks and recipes in cli by @elronbandel in #1785
- Add ToRR Benchmark Readme file by @csrajmohan in #1793
- Add tool calling correctness metric by @elronbandel in #1796
- Remove IBM branding from opensource doc by @yoavkatz in #1802
- Add LoadJsonFile loader and tests by @elronbandel in #1801
- LLM judge judgebench benchmarks by @martinscooper in #1800
- Added granite tool calling system prompt by @Narayanan-V-Eswar in #1798
- Documenation updates by @yoavkatz in #1790
- Cards for the Real MM RAG datasets by @assaftibm in #1795
- Add more judges by @martinscooper in #1808
- Fixed problematic load of json with a single dictionary line. by @yoavkatz in #1806
- Add more cross provider models by @martinscooper in #1807
- Fix model name by @martinscooper in #1809
- watsonx.ai mistral small support by @LukaszCmielowski in #1810
- Fix: number of batches calculation is incorrect by @martinscooper in #1805
- Fix example dependencies installation by @elronbandel in #1812
- Update version to 1.23.1 by @elronbandel in #1818
New Contributors
- @kurhula made their first contribution in #1788
- @LukaszCmielowski made their first contribution in #1810
Full Changelog: 1.23.0...1.23.1
Unitxt 1.23.0
Main changes
- Revised the tool calling tasks and metrics introduced in 1.22.4) - Non backward compatible change. Existing datasets addressed.
- Fixed support for running HF with AutoModelInferenceEngine (MultiGPU + tokenization issue)
- Added to_yaml() to create a yaml representation of the card that can be used for running custom datasets in Granite.build
What's Changed
- FIx batching support for hf Dataset in HFAutoModelInferenceEngine by @elronbandel in #1771
- Fix litellm inference without task_data by @elronbandel in #1772
- Added to_yaml shorthand function to artifact by @yoavkatz in #1768
- Simplify tool calling base types by @elronbandel in #1773
- Added tool calling to wml chat by @pawelknes in #1782
- Reverting to datasets=351 can solve problems in test catalog preparation by @dafnapension in #1784
- Update ibm wml engine #1775 by @MikolajCharchut in #1781
- Fix HF AutoModel tokenization issue with chat template + issue with multi GPU by @OfirArviv in #1779
- Performance to report accurate times based on end-to-end time() diffs, rather than accumulate cProfile numbers over methods whose names seem relevant by @dafnapension in #1783
- Add support to mix args and textual query in load_dataset by @elronbandel in #1778
- Add installation of spacy as a binary dependency for examples regression tests by @elronbandel in #1787
- Improvements to tool calling - NON BACKWARD COMPATIBLE CHANGES by @Narayanan-V-Eswar in #1770
- Added example for standalone metric evaluation by @yoavkatz in #1769
- Update version to 1.23.0 by @elronbandel in #1789
New Contributors
- @Narayanan-V-Eswar made their first contribution in #1770
Full Changelog: 1.22.4...1.23.0