Commit Graph

  • 76684141a5 ruby : fix dangling pointers, memory leak, and SEGV on parallel transcription (#3715) master KITAITI Makoto 2026-03-22 02:03:00 +09:00
  • 9386f23940 release : v1.8.4 v1.8.4 Georgi Gerganov 2026-03-19 10:40:13 +02:00
  • ef3463bb29 ci : update workflows Georgi Gerganov 2026-03-18 22:43:38 +02:00
  • 4bbce1e5b2 benches : update gg/benches-update Georgi Gerganov 2026-03-18 22:34:51 +02:00
  • f5b477ab09 sync : ggml Georgi Gerganov 2026-03-18 14:45:25 +02:00
  • b2be16208d ggml : bump version to 0.9.8 (ggml/1442) Georgi Gerganov 2026-03-16 20:15:14 +02:00
  • 945d3151d9 ggml : restore ggml_type_sizef() to aboid major version bump (ggml/1441) Georgi Gerganov 2026-03-16 20:09:25 +02:00
  • dc96116622 fix: VAD time mapping timestamp drift caused by overlap samples (#3711) lohopupa 2026-03-17 12:19:08 +06:00
  • 79218f51d0 go : handle EOF correctly in model download (#3671) Alan 2026-03-16 12:44:18 +01:00
  • 975b979834 py : replace deprecated openvino-dev with openvino>=2023.3.0 (#3678) Aiudadadadf 2026-03-16 12:41:54 +01:00
  • 21665eab4c examples : Allow max_len to be used for any output format (#3679) Gaël James 2026-03-16 12:33:56 +01:00
  • 136dc2eb12 server: return proper HTTP status codes for error responses (#3707) Igor Loskutov 2026-03-16 07:33:06 -04:00
  • 27fa20774a ggml : try fix arm build (#0) Georgi Gerganov 2026-03-16 09:11:13 +02:00
  • 2bc630f197 talk-llama : sync llama.cpp Georgi Gerganov 2026-03-16 07:16:46 +02:00
  • ab1252c19e sync : ggml Georgi Gerganov 2026-03-16 07:13:51 +02:00
  • d4bc312169 ggml : extend im2col f16 (ggml/1434) David366AI 2026-03-15 15:50:56 -04:00
  • 81ea958719 common : add nvfp4 (ggml/0) Georgi Gerganov 2026-03-15 19:56:19 +02:00
  • d7926e62d4 CUDA: limit number of FA stream-k CUDA blocks (llama/20586) Johannes Gäßler 2026-03-15 18:30:47 +01:00
  • 2fb6aea8ad ggml: avoid creating CUDA context during device init (llama/20595) Pascal 2026-03-15 17:42:56 +01:00
  • b327a321a2 ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain (llama/20536) MoonShadow 2026-03-16 00:23:58 +08:00
  • 6770239830 ggml : guard against sumq2 being 0 in IQ4_NL (llama/20460) Bartowski 2026-03-15 04:47:28 -04:00
  • 55c66106af cuda : add RDNA4-specific MMVQ parameter table for bs=1 decode (llama/19478) PikaPikachu 2026-03-15 15:33:39 +08:00
  • cd02195b8f vulkan: use graphics queue on AMD (llama/20551) Ruben Ortlam 2026-03-15 08:18:54 +01:00
  • b312018435 metal : add FA specialization for HSK = 320, HSV = 256 (llama/20549) Georgi Gerganov 2026-03-14 23:15:47 +02:00
  • 55f8cfdaed hexagon: Q4_0 and MXFP4 repack fixes (llama/20527) Max Krasnyansky 2026-03-14 11:09:08 -07:00
  • c5f9a49b51 add op gated_delta_net (llama/20455) Neo Zhang 2026-03-14 22:01:57 +08:00
  • 93d09fdb23 ggml : add native AVX512-FP16 support for F16 operations (llama/20529) Adrien Gallouët 2026-03-14 10:06:14 +01:00
  • 8ad5cb1e9d Use fp32 in cuBLAS V100 to avoid overflows, env variables to override cuBLAS compute type (llama/19959) Wallentri 2026-03-14 10:43:13 +03:00
  • 96b163e874 ggml : add OpenVINO backend (llama/15307) Zijun Yu 2026-03-14 13:56:55 +08:00
  • 46aad766f5 Fix data race in CUDA's "cpy" kernel (influences GGML's DUP, CONT operations). (llama/20507) Rail Chabdarov 2026-03-14 06:19:44 +01:00
  • a31600d8e3 opencl: fix l2_norm (llama/20480) lhez 2026-03-13 22:18:52 -07:00
  • c7abcd577b graph : remove redundant GDN state transposes (llama/20443) Georgi Gerganov 2026-03-13 22:12:54 +02:00
  • 5905e8708f ggml-cpu: add RVV vec dot kernels for quantization types (llama/18859) rehan-10xengineer 2026-03-13 20:36:04 +05:00
  • 9bfa81d262 ggml : fix typo gmml (llama/20512) Adrien Gallouët 2026-03-13 14:36:13 +01:00
  • f1f5f43d69 metal : fix l2 norm scale (llama/20493) Georgi Gerganov 2026-03-13 11:43:20 +02:00
  • 2ed6dc0222 llama : disable graph reuse with pipeline parallelism (llama/20463) Georgi Gerganov 2026-03-12 21:04:13 +02:00
  • 2450919665 vulkan: add GATED_DELTA_NET op support (llama/20334) ProgenyAlpha 2026-03-12 06:32:04 -04:00
  • 44c12c642e vulkan: fix SSM_CONV PP scaling with large ubatch sizes (llama/20379) ProgenyAlpha 2026-03-12 05:03:18 -04:00
  • 7e816a99d2 sync : ggml Georgi Gerganov 2026-03-16 07:13:14 +02:00
  • b48ffe28fc metal : avoid divisions in bin kernel (llama/20426) Georgi Gerganov 2026-03-16 07:12:50 +02:00
  • 7ccebd5264 sync : ggml Georgi Gerganov 2026-03-16 07:12:37 +02:00
  • 86e312d61d vulkan: fix l2_norm epsilon handling (llama/20350) Jeff Bolz 2026-03-12 00:39:41 -05:00
  • 6c5e3aac3e vulkan: fix OOB check in flash_attn_mask_opt (llama/20296) Jeff Bolz 2026-03-12 00:35:49 -05:00
  • 26ee4f7362 vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (llama/20059) Masato Nakasaka 2026-03-11 22:30:16 -07:00
  • d5772cf7b2 opencl: use larger workgroup size for get_rows (llama/20316) lhez 2026-03-11 22:03:27 -07:00
  • 193781cf0e opencl: add cumsum op (llama/18981) shaofeiqi 2026-03-11 22:03:07 -07:00
  • f5ba865378 hip: compile debug builds with -O2 on hip to avoid a compiler bug (llama/20392) uvos 2026-03-12 03:37:10 +01:00
  • 5267523829 ggml-webgpu: Add supports for GGML_OP_REPEAT (llama/20230) Masashi Yoshimura 2026-03-12 06:40:36 +09:00
  • d73fe25267 llama : enable chunked fused GDN path (llama/20340) Georgi Gerganov 2026-03-11 22:46:40 +02:00
  • e4021d4071 ggml : add NVFP4 quantization type support (llama/19769) Richard Davison 2026-03-11 21:02:54 +01:00
  • 5d3a5447c8 llama : add support for Nemotron 3 Super (llama/20411) Daniel Bevenius 2026-03-11 19:27:53 +01:00
  • e2aa5c73f3 metal : fix capture_compute counter logic (llama/20410) Georgi Gerganov 2026-03-11 18:38:22 +02:00
  • 0e1e76f93b metal : fix q5_k mul_mv register spill (llama/20399) Georgi Gerganov 2026-03-11 16:25:27 +02:00
  • c2e384f21e metal : add env var to trigger graph capture (llama/20398) Georgi Gerganov 2026-03-11 16:25:10 +02:00
  • 8b335550cf ggml-cuda: gdn use shared mem for HIP (llama/20366) uvos 2026-03-11 06:06:19 +01:00
  • 7c9a16c565 cuda/hip: fix loop unrolling in ssm-conv (llama/20369) uvos 2026-03-11 06:04:32 +01:00
  • 286387ef0a fix op rope, add rope_back (llama/20293) Neo Zhang 2026-03-11 09:53:34 +08:00
  • 72c7a2532d fix for failed UT case: ACC, L2_NORM, UPSCALE, fused_glu, unary (llama/20283) Neo Zhang 2026-03-11 09:53:05 +08:00
  • 1e05b10d67 ggml : bump RPC version (llama/20330) Georgi Gerganov 2026-03-10 21:36:57 +02:00
  • fddedc5cbc ggml webgpu: faster normal quant and some k-quant matrix operations, better shader parameter handling (llama/20173) Reese Levine 2026-03-10 09:14:27 -07:00
  • dfa6858d02 kleidiai : support for concurrent sme and neon kernel execution (llama/20070) Charles Xu 2026-03-10 08:25:25 +01:00
  • bd64b8af4d ggml-cpu: add RVV repack GEMM and GEMV for quantization types (llama/19121) Taimur Ahmad 2026-03-10 11:49:52 +05:00
  • cabe3d95f4 metal: handle command buffer failures gracefully in synchronize (llama/20306) Julian Pscheid 2026-03-09 23:32:24 -07:00
  • ae21974f4f metal : extend mul_mv_ext to BF16, Q2_K, Q3_K (llama/20250) Paul Flynn 2026-03-09 10:48:12 -04:00
  • d19c65e9da metal : add upscale (llama/20284) Georgi Gerganov 2026-03-09 16:45:11 +02:00
  • 3984ae384d ggml-cuda: disable gdn for musa (llama/20278) Aman Gupta 2026-03-09 16:15:36 +08:00
  • 65dbf3c31a ggml-vulkan: add SGN operator, auto-generate Vulkan.csv and ops.md (llama/20219) Bertay Eren 2026-03-09 09:24:16 +03:00
  • 890c047e30 vulkan: skip zero size tensors in backend copies (llama/20233) Ruben Ortlam 2026-03-09 07:23:45 +01:00
  • f099ed27b8 cuda : display total and free VRAM capacity during device initialization (llama/20185) Michael Huang 2026-03-08 21:45:43 -07:00
  • 8d97f59639 ggml-vulkan: Add ELU op support (llama/20183) GiantPrince 2026-03-08 07:38:17 -04:00
  • 4b0653a792 vulkan: Fix data races in coopmat1 mul_mat(_id) (llama/20084) Jeff Bolz 2026-03-08 06:33:48 -05:00
  • 8a9b0ba1df supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (llama/20190) Neo Zhang 2026-03-08 12:00:07 +08:00
  • 49489bfbd1 ggml: add GATED_DELTA_NET op (llama/19504) Aman Gupta 2026-03-07 15:41:10 +08:00
  • 910034df28 opencl: add l2_norm (llama/20160) lhez 2026-03-06 18:03:05 -08:00
  • 6e063fae5a quants : Add memsets and other fixes for IQ quants (llama/19861) Bartowski 2026-03-06 16:06:56 -05:00
  • 78b3801d54 hexagon: add f32 ssm_conv op (llama/20122) Todor Boinovski 2026-03-06 09:59:26 -08:00
  • 247ec204d8 cpu: skip redudant ROPE cache updates (llama/20149) Max Krasnyansky 2026-03-06 08:32:40 -08:00
  • d658720fa5 ggml-cuda: add mem check for fusion (llama/19916) Aman Gupta 2026-03-07 00:05:43 +08:00
  • 5d9b73dc06 ggml: update comments for backends which have no memory to report (llama/20157) Aaron Teo 2026-03-06 23:24:38 +08:00
  • 548f2e5190 ggml-cpu: Fix gcc 15 ICE on ppc64le (ggml/20083) (llama/20130) shalinib-ibm 2026-03-06 20:52:39 +05:30
  • d2d235f467 CUDA: use shared mem for ssm_conv (llama/20128) Aman Gupta 2026-03-06 23:09:59 +08:00
  • 596b655dbd ggml-cpu: fix data race for debug asserts (llama/20148) Johannes Gäßler 2026-03-06 09:12:49 +01:00
  • 1d94b0be4f opencl: add neg, exp and diag (llama/20127) lhez 2026-03-05 21:16:39 -08:00
  • f56fb1be3b hexagon: add fp16 support for binary ops: add,sub,mul,div (llama/20139) YardenTal44 2026-03-06 04:29:13 +02:00
  • 51f397c1af CUDA: Improve performance via less synchronizations between token (llama/17795) Andreas Kieslinger 2026-03-05 12:53:21 +01:00
  • 67abc63e9d chore : correct typos [no ci] (llama/20041) Marcel Petrick 2026-03-05 08:50:21 +01:00
  • 2e79b85f66 hexagon: Flash Attention optimizations (dma, mpyacc, multi-row) and MatMul updates (llama/20118) Max Krasnyansky 2026-03-04 21:55:29 -08:00
  • 2c50962528 opencl: add SET, support i32 for CPY, minor refactor for cpy (llama/20101) lhez 2026-03-04 21:32:26 -08:00
  • 4834971a4f Fix wait logic for inflight jobs (llama/20096) Nikhil Jain 2026-03-04 11:54:55 -08:00
  • 8d78d40946 Add concat op to webgpu. (llama/20068) Masashi Yoshimura 2026-03-05 04:19:00 +09:00
  • 5d25427e58 ggml: fix ggml_is_contiguous_n for ne == 1 (llama/20092) Johannes Gäßler 2026-03-04 12:04:31 +01:00
  • b1b018dfd1 ggml : use a simple std::thread in AMX without OpenMP (llama/20074) Adrien Gallouët 2026-03-04 11:57:09 +01:00
  • 169d723fa0 kleidiai : add sme fp16 compute path for q4_0 gemm on aarch64 (llama/20043) Charles Xu 2026-03-03 10:40:26 +01:00
  • 3a96680718 opencl: add optimized q4_1 mm kernel for adreno (llama/19840) shaofeiqi 2026-03-02 19:49:41 -08:00
  • 3145384715 ggml webgpu: fix workgroup dispatch limit for large batch sizes (llama/19965) Abhijit Ramesh 2026-03-02 19:35:11 -08:00
  • 22034a5f6f ggml webgpu: Clean up per-thread parameter buffer pool and job submission logic (llama/19772) Nikhil Jain 2026-03-02 10:23:34 -08:00
  • de686fafad ggml-webgpu: Support non-contiguous src0 and overlapping src0/src1 in binary ops (llama/19850) Masashi Yoshimura 2026-03-03 00:59:53 +09:00
  • 923a292429 vulkan: tune MMVQ for Intel Windows (llama/19988) Ruben Ortlam 2026-03-02 15:58:25 +01:00
  • e2be9edd5a ggml-cpu: optimise s390x multiply extend instructions (llama/20032) Aaron Teo 2026-03-02 16:23:56 +08:00
  • 2a9649c420 vulkan: improve partial offloading performance on AMD (llama/19976) Ruben Ortlam 2026-03-01 17:32:14 +01:00