Commit Graph

  • cc0c103b5d ggml-sycl: remove unused syclcompat header (llama/19140) Patryk Kaminski 2026-01-28 16:33:54 +01:00
  • dda7d9cd1c vulkan: handle device dedup on MacOS + Vega II Duo cards (llama/19058) Oleksandr Kuvshynov 2026-01-28 06:35:54 -05:00
  • 531d7b6781 ggml: new backend for Virglrenderer API Remoting acceleration (v2) (llama/18718) Kevin Pouget 2026-01-28 10:49:40 +01:00
  • 3701413a71 ggml-cpu: arm64: Q4_K scale unroll and vectorization (llama/19108) Alberto Cabrera Pérez 2026-01-28 07:15:56 +00:00
  • 7fb0f823de cuda : fix "V is K view" check for non-unified KV cache (llama/19145) Georgi Gerganov 2026-01-28 09:15:27 +02:00
  • f28a733025 CUDA: tune GLM 4.7 Flash FA kernel selection logic (DGX Spark) (llama/19142) Georgi Gerganov 2026-01-28 09:15:11 +02:00
  • dfdd2fee83 ggml webgpu: Split shared state (webgpu_context) into global state and per-thread state (llama/18976) Nikhil Jain 2026-01-27 20:53:36 -08:00
  • 9c75c793a6 ggml-zendnn : update ZenDNN git tag to main branch (llama/19133) Vishal Singh 2026-01-28 03:51:36 +05:30
  • 9d94d0f782 CUDA: tune GLM 4.7 Flash FA kernel selection logic (llama/19097) Johannes Gäßler 2026-01-27 14:28:56 +01:00
  • 00885e08e2 ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860 (llama/18888) Alberto Cabrera Pérez 2026-01-27 09:08:10 +00:00
  • 5fcbbdc0dd Reduce CPU-side stalls due to the CUDA command buffer being full (llama/19042) Gaurav Garg 2026-01-27 06:52:44 +00:00
  • b2e2032856 ggml-cpu: Enable FP16 MMA kernels on PPC (llama/19060) shalinib-ibm 2026-01-27 09:22:34 +05:30
  • 56f82a9f33 opencl: add flattened q6_K mv (llama/19054) lhez 2026-01-30 10:34:38 +02:00
  • 41d5d7bb0e CUDA: fix padding of GQA to power of 2 in FA (llama/19115) Johannes Gäßler 2026-01-26 23:24:58 +01:00
  • f63848eada CUDA: faster FA for GQA > 1 but not power of 2 (llama/19092) Johannes Gäßler 2026-01-25 21:19:47 +01:00
  • 4372b87b8e metal : fix recommendedMaxWorkingSetSize availability on legacy iOS/macOS (llama/19088) ccbinn 2026-01-26 02:07:19 +08:00
  • 1642a4fb60 ggml-cpu: Use tiled FA for prompt-processing (llama/19012) Aman Gupta 2026-01-25 23:25:58 +08:00
  • d2b51404e4 kv-cache : support V-less cache (llama/19067) Georgi Gerganov 2026-01-25 15:48:56 +02:00
  • f53eafd745 CUDA: re-use MLA K data for V in MMA FA (llama/19057) Johannes Gäßler 2026-01-24 10:09:36 +01:00
  • 13577a6ce4 ggml-cuda: enable cuda-graphs for n-cpu-moe (llama/18934) Aman Gupta 2026-01-24 14:25:20 +08:00
  • 79f1bb3d35 ggml-hexagon: flash-attn opt (llama/19025) nullname 2026-01-24 14:02:07 +08:00
  • 0d9dda5a99 use malloc to support both iGPU and dGPU in same time (llama/18992) Neo Zhang 2026-01-23 20:54:10 +08:00
  • e090d91f5e ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) (llama/18860) Alberto Cabrera Pérez 2026-01-23 07:55:08 +00:00
  • 3f96a1da0e mla : make the V tensor a view of K (llama/18986) Georgi Gerganov 2026-01-22 22:09:01 +02:00
  • f21d0cbb1a CUDA: fix alignment check for FA (llama/19023) Johannes Gäßler 2026-01-22 20:39:25 +01:00
  • 0e030b852a opencl: enable the general fp mm for non-cont input and as a fallback for specialized kqv kernel for adreno (llama/18970) lhez 2026-01-22 10:29:25 -08:00
  • d4fafcfc6f CUDA: add gqa_ratio 4 for GLM 4.7 flash (llama/18953) Aman Gupta 2026-01-22 18:51:53 +08:00
  • 167fec69d5 opencl: add TRI op support (llama/18979) shaofeiqi 2026-01-21 22:05:54 -08:00
  • 55927d42ef ggml-zdnn : mark zDNN buffers as non-host (llama/18967) Aleksei Nikiforov 2026-01-22 01:16:21 +01:00
  • b7e323f40b vulkan: Remove transfer_ctx, do everything in compute_ctx. (llama/18945) Jeff Bolz 2026-01-21 11:01:40 -06:00
  • b2bc4d810b vulkan: support flash attention GQA/split_k with small batches (llama/18938) Jeff Bolz 2026-01-21 10:43:43 -06:00
  • 3bbf4ced47 Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356)" (llama/18831) Masato Nakasaka 2026-01-22 01:13:43 +09:00
  • 660d943ff8 vulkan: Use mul_mat_vec_id for small values of n (llama/18918) Jeff Bolz 2026-01-21 09:22:02 -06:00
  • 924a9e292c CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (llama/18964) Oliver Simons 2026-01-21 02:34:29 +01:00
  • fdc83ee3c0 CUDA: Replace init_offsets kernel with iterators in cub-based argsort (llama/18930) Oliver Simons 2026-01-20 13:11:01 +01:00
  • bf71ffa6b3 ggml : cleanup path_str() (llama/18928) Adrien Gallouët 2026-01-20 11:42:49 +01:00
  • b0517d6912 metal : enable FA for MLA heads (llama/18950) Georgi Gerganov 2026-01-20 12:21:28 +02:00
  • 47f3e3b927 ggml : add ggml_build_forward_select (llama/18550) Georgi Gerganov 2026-01-19 20:03:19 +02:00
  • 62a09b106d opencl: fix q6_K mv for m=1 (llama/18893) lhez 2026-01-17 13:50:32 -08:00
  • 389dafc7c2 ggml webgpu: support for backend sampling (llama/18880) Reese Levine 2026-01-30 10:32:34 +02:00
  • 511ca7a1f4 ggml : extend ggml_pool_1d + metal (llama/16429) Thore Koritzius 2026-01-16 15:59:56 +01:00
  • ecb4b80c35 ggml-blas: hide warnings from included BLAS headers (llama/18818) Perry Naseck 2026-01-16 06:38:25 -05:00
  • 42960b6073 CANN: Remove unused ggml_cann_get_device function (llama/18625) Raul Torres 2026-01-16 08:34:09 +00:00
  • 2fceb5a80f CANN: fix an issue where get_env was not fully renamed (llama/18796) Chenguang Li 2026-01-16 16:24:04 +08:00
  • 854274a297 CANN: support gated linear attn (llama/18653) hipudding 2026-01-16 16:18:49 +08:00
  • ed6004d051 OpenCL: add SOLVE_TRI op support (llama/18846) shaofeiqi 2026-01-15 11:17:17 -08:00
  • 290ff3d28d cuda : print less debug logs when disabling cuda graphs (llama/18868) Georgi Gerganov 2026-01-15 20:53:01 +02:00
  • f2f0ba0384 CUDA: fix allignment on register spill for FA (llama/18815) Johannes Gäßler 2026-01-15 15:14:50 +01:00
  • 78a23d4830 ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (llama/18837) shalinib-ibm 2026-01-15 15:01:18 +05:30
  • 50b7ab3d46 hexagon: support for OP_CPY, host buffers now optional (llama/18822) Max Krasnyansky 2026-01-30 10:28:03 +02:00
  • bc09047405 CUDA: Factor out and re-use block_reduce function (llama/18785) Oliver Simons 2026-01-15 03:44:54 +01:00
  • 4b155e9bfb vulkan: Check maxStorageBufferRange in supports_op (llama/18709) Jeff Bolz 2026-01-14 03:59:05 -06:00
  • 25aeb66a4a CUDA : fix typo in clang pragma comment [no ci] (llama/18830) Daniel Bevenius 2026-01-14 10:31:49 +01:00
  • 49762e8fb3 vulkan: work around Intel fp16 bug in mmq (llama/18814) Ruben Ortlam 2026-01-14 09:41:23 +01:00
  • 17656e56dc ggml-metal: do not copy headers for embedded, use current binary dir for embedded (llama/18705) Perry Naseck 2026-01-14 02:22:25 -05:00
  • c6a495ae5d HIP: add fattn-mma-f16 for RDNA4 (llama/18481) yulo 2026-01-13 20:52:16 +08:00
  • 7aa8818647 examples : use -dev/--device and WHISPER_ARG_DEVICE (#3557) Bráulio Oliveira 2026-01-21 04:40:30 -03:00
  • f53dc74843 whisper : Fix UTF-8 character boundary issue in segment wrapping (max_len) (#3592) Yshtola 2026-01-16 20:16:05 +08:00
  • 2eeeba56e9 release : v1.8.3 v1.8.3 Georgi Gerganov 2026-01-15 11:54:31 +02:00
  • 21c1765fcb benches : update Georgi Gerganov 2026-01-15 11:53:09 +02:00
  • 47af2fb70f sync : ggml Georgi Gerganov 2026-01-13 19:11:04 +02:00
  • 6ee0eaf531 CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (llama/18800) Georgi Gerganov 2026-01-13 12:25:53 +02:00
  • ab1828dc1c vulkan: change memory_logger to be controlled by an env var (llama/18769) Jeff Bolz 2026-01-12 06:32:55 -06:00
  • aedf332ec5 vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (llama/18678) Jeff Bolz 2026-01-12 05:32:13 -06:00
  • 716d68aca9 vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (llama/18763) Ruben Ortlam 2026-01-12 07:29:35 +01:00
  • c0433783c3 Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (llama/18749) Ruben Ortlam 2026-01-11 17:33:33 +01:00
  • ecfcc65fbf talk-llama : sync llama.cpp Georgi Gerganov 2026-01-12 14:48:26 +02:00
  • 13dc9a912b sync : ggml Georgi Gerganov 2026-01-12 14:44:38 +02:00
  • d4ce2e554f opencl: add SOFTPLUS op support (llama/18726) shaofeiqi 2026-01-10 21:57:44 -08:00
  • 3a1ea96373 HIP: adjust RDNA3.5 MMQ kernel selction logic (llama/18666) Johannes Gäßler 2026-01-10 17:19:01 +01:00
  • 484b17053a cmake : update blas logic (llama/18205) Perry Naseck 2026-01-10 11:00:54 -05:00
  • 45be2cd27a Corrected: changed s13 = src1->nb[3] instead of nb[2] (llama/18724) Michael Wand 2026-01-10 01:16:07 -08:00
  • 4af27bf2da opencl: add EXPM1 op (llama/18704) shaofeiqi 2026-01-09 10:13:13 -08:00
  • 4ac8c3b478 Updates to webgpu get_memory (llama/18707) Reese Levine 2026-01-09 08:17:18 -08:00
  • fff3ebd93d llama: use host memory if device reports 0 memory (llama/18587) Aaron Teo 2026-01-09 05:34:56 +08:00
  • a71127dfd8 ggml-webgpu: Fix GGML_MEM_ALIGN to 8 for emscripten. (llama/18628) Masashi Yoshimura 2026-01-09 01:36:42 +09:00
  • 1bb903f599 ggml webgpu: initial flashattention implementation (llama/18610) Reese Levine 2026-01-08 08:23:39 -08:00
  • 0bc0e5616e vulkan: fix push constant size for quantize_q8_1 (llama/18687) Jeff Bolz 2026-01-08 08:40:58 -06:00
  • 678c660e62 vulkan: optimize ssm_scan (llama/18630) Jeff Bolz 2026-01-08 08:16:54 -06:00
  • f2d8588229 metal : add MoE kernel specialization for ne20=5 (llama/18667) 도로로도로또 2026-01-08 19:37:45 +09:00
  • b9965c89a1 ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH (llama/18535) Doctor Shotgun 2026-01-08 01:03:21 -08:00
  • 85a329cb08 opencl: add FILL op support (llama/18682) shaofeiqi 2026-01-07 22:04:50 -08:00
  • 4f2ca7c163 cuda : fix build on cuda 12.8 (llama/18672) Oliver Walsh 2026-01-07 21:32:44 +00:00
  • a91ab72bd9 vulkan: reject ops when a tensor is too large to allocate (llama/18646) Jeff Bolz 2026-01-07 05:03:32 -06:00
  • 096e7e911a vulkan: Warptile tuning for Intel Xe2/Xe3 (llama/18178) virajwad 2026-01-07 02:59:47 -08:00
  • a576ed944a vulkan: more mul mat optimizations (llama/18533) Eve 2026-01-07 10:13:17 +00:00
  • 5c583f3c02 CANN: Fix rename for get_env (llama/18652) hipudding 2026-01-07 16:11:31 +08:00
  • 47671c81db CANN: Rename get_env to get_env_as_lowercase (llama/18624) Raul Torres 2026-01-07 02:01:25 +00:00
  • a5f51ac75b Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (llama/18611) Max Krasnyansky 2026-01-06 17:38:29 -08:00
  • 436f30d05f ggml : optimize cuda ssm_scan using warp-level reduction (llama/18505) Aadeshveer Singh 2026-01-06 23:54:34 +05:30
  • dbec71f6cf vulkan: support buffer_from_host_ptr (llama/18467) Jeff Bolz 2026-01-06 10:37:07 -06:00
  • 575d894603 ggml-cuda: refactor cuda graph usage (llama/18637) Aman Gupta 2026-01-06 23:48:45 +08:00
  • ed674cfc10 mmq.cu: tune mmq/rocblas switching for RDNA (llama/18537) Beinsezii 2026-01-06 07:26:07 -08:00
  • 5520f27363 ggml : fix avx512bf16 build (llama/18623) Adrien Gallouët 2026-01-06 07:54:10 +01:00
  • 9a1a6685ba CANN: Make valid_values variable static const (llama/18627) Raul Torres 2026-01-06 03:53:28 +00:00
  • e563e239a7 ggml webgpu: add CEIL operation support (llama/18605) nwyin 2026-01-05 13:38:57 -06:00
  • 9956333361 CUDA: fix FA FP16 accumulator overflow for Granite (llama/18614) Johannes Gäßler 2026-01-05 19:51:13 +01:00
  • 804f545454 ggml-cuda: check for srcs outside the cgraph (llama/18583) Aman Gupta 2026-01-05 22:46:36 +08:00
  • 52ba45e2b8 vulkan: fix topk_moe_sigmoid_norm_bias failures in GLM-4.6 (llama/18582) Jeff Bolz 2026-01-05 04:51:39 -06:00
  • 0a99b4c377 vulkan: handle quantize_q8_1 overflowing the max workgroup count (llama/18515) Jeff Bolz 2026-01-05 04:30:14 -06:00