# ⚡ RTX GPU Acceleration
Docling on RTX
Whether you're an AI enthusiast, researcher, or developer working with document processing, this guide will help you unlock the full potential of your NVIDIA RTX GPU with Docling. By leveraging GPU acceleration, you can achieve up to **6x speedup** compared to CPU-only processing. This dramatic performance improvement makes GPU acceleration especially valuable for processing large batches of documents, handling high-throughput document conversion workflows, or experimenting with advanced document understanding models. ## Prerequisites Before setting up GPU acceleration, ensure you have: - An NVIDIA RTX GPU (RTX 40/50 series) - Windows 10/11 or Linux operating system ## Installation Steps ### 1. Install NVIDIA GPU Drivers First, ensure you have the latest NVIDIA GPU drivers installed: - **Windows**: Download from [NVIDIA Driver Downloads](https://www.nvidia.com/Download/index.aspx) - **Linux**: Use your distribution's package manager or download from NVIDIA Verify the installation: ```bash nvidia-smi ``` This command should display your GPU information and driver version. ### 2. Install CUDA Toolkit CUDA is NVIDIA's parallel computing platform required for GPU acceleration. Follow the official installation guide for your operating system at [NVIDIA CUDA Downloads](https://developer.nvidia.com/cuda-downloads). The installer will guide you through the process and automatically set up the required environment variables. ### 3. Install cuDNN cuDNN provides optimized implementations for deep learning operations. Follow the official installation guide at [NVIDIA cuDNN Downloads](https://developer.nvidia.com/cudnn). The guide provides detailed instructions for all supported platforms. ### 4. Install PyTorch with CUDA Support To use GPU acceleration with Docling, you need to install PyTorch with CUDA support using the special `extra-index-url`: ```bash # For CUDA 12.8 (current default for PyTorch) pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 # For CUDA 13.0 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130 ``` !!! note The `--index-url` parameter is crucial as it ensures you get the CUDA-enabled version of PyTorch instead of the CPU-only version. For other CUDA versions and installation options, refer to the [PyTorch Installation Matrix](https://pytorch.org/get-started/locally/). Verify PyTorch CUDA installation: ```python import torch print(f"PyTorch version: {torch.__version__}") print(f"CUDA available: {torch.cuda.is_available()}") print(f"CUDA version: {torch.version.cuda}") print(f"GPU device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}") ``` ### 5. Install and Run Docling Install Docling with all dependencies: ```bash pip install docling ``` **That's it!** Docling will automatically detect and use your RTX GPU when available. No additional configuration is required for basic usage. ```python from docling.document_converter import DocumentConverter # Docling automatically uses GPU when available converter = DocumentConverter() result = converter.convert("document.pdf") ```
Advanced: Tuning GPU Performance For optimal GPU performance with large document batches, you can adjust batch sizes and explicitly configure the accelerator: ```python from docling.document_converter import DocumentConverter from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions # Explicitly configure GPU acceleration accelerator_options = AcceleratorOptions( device=AcceleratorDevice.CUDA, # Use CUDA for NVIDIA GPUs ) # Configure pipeline for optimal GPU performance pipeline_options = ThreadedPdfPipelineOptions( ocr_batch_size=64, # Increase batch size for GPU layout_batch_size=64, # Increase batch size for GPU table_batch_size=4, ) # Create converter with custom settings converter = DocumentConverter( accelerator_options=accelerator_options, pipeline_options=pipeline_options, ) # Convert documents result = converter.convert("document.pdf") ``` Adjust batch sizes based on your GPU memory (see Performance Optimization Tips below).
## GPU-Accelerated VLM Pipeline For maximum performance with Vision Language Models (VLM), you can run a local inference server on your RTX GPU. This approach provides significantly better throughput than inline VLM processing. ### Linux: Using vLLM (Recommended) vLLM provides the best performance for GPU-accelerated VLM inference. Start the vLLM server with optimized parameters: ```bash vllm serve ibm-granite/granite-docling-258M \ --host 127.0.0.1 --port 8000 \ --max-num-seqs 512 \ --max-num-batched-tokens 8192 \ --enable-chunked-prefill \ --gpu-memory-utilization 0.9 ``` ### Windows: Using llama-server On Windows, you can use `llama-server` from llama.cpp for GPU-accelerated VLM inference: #### Installation 1. Download the latest llama.cpp release from the [GitHub releases page](https://github.com/ggml-org/llama.cpp/releases) 2. Extract the archive and locate `llama-server.exe` #### Launch Command ```powershell llama-server.exe ` --hf-repo ibm-granite/granite-docling-258M-GGUF ` -cb ` -ngl -1 ` --port 8000 ` --context-shift ` -np 16 -c 131072 ``` !!! note "Performance Comparison" vLLM delivers approximately **4x better performance** compared to llama-server. For Windows users seeking maximum performance, consider running vLLM via WSL2 (Windows Subsystem for Linux). See [vLLM on RTX 5090 via Docker](https://github.com/BoltzmannEntropy/vLLM-5090) for detailed WSL2 setup instructions. ### Configure Docling for VLM Server Once your inference server is running, configure Docling to use it: ```python from docling.datamodel.pipeline_options import VlmPipelineOptions from docling.datamodel.settings import settings BATCH_SIZE = 64 # Configure VLM options vlm_options = vlm_model_specs.GRANITEDOCLING_VLLM_API vlm_options.concurrency = BATCH_SIZE # when running with llama.cpp (llama-server), use the different model name. # vlm_options.params["model"] = "ibm-granite_granite-docling-258M-GGUF_granite-docling-258M-BF16.gguf" # Set page batch size to match or exceed concurrency settings.perf.page_batch_size = BATCH_SIZE # Create converter with VLM pipeline converter = DocumentConverter( pipeline_options=vlm_options, ) ``` For more details on VLM pipeline configuration, see the [GPU Support Guide](../usage/gpu.md). ## Performance Optimization Tips ### Batch Size Tuning Adjust batch sizes based on your GPU memory: - **RTX 5090 (32GB)**: Use batch sizes of 64-128 - **RTX 4090 (24GB)**: Use batch sizes of 32-64 - **RTX 5070 (12GB)**: Use batch sizes of 16-32 ### Memory Management Monitor GPU memory usage: ```python import torch # Check GPU memory if torch.cuda.is_available(): print(f"GPU Memory allocated: {torch.cuda.memory_allocated(0) / 1024**3:.2f} GB") print(f"GPU Memory reserved: {torch.cuda.memory_reserved(0) / 1024**3:.2f} GB") ``` ## Troubleshooting ### CUDA Out of Memory If you encounter out-of-memory errors: 1. Reduce batch sizes in `pipeline_options` 2. Process fewer documents concurrently 3. Clear GPU cache between batches: ```python import torch torch.cuda.empty_cache() ``` ### CUDA Not Available If `torch.cuda.is_available()` returns `False`: 1. Verify NVIDIA drivers are installed: `nvidia-smi` 2. Check CUDA installation: `nvcc --version` 3. Reinstall PyTorch with correct CUDA version 4. Ensure your GPU is CUDA-compatible ### Performance Not Improving If GPU acceleration doesn't improve performance: 1. Increase batch sizes (if memory allows) 2. Ensure you're processing enough documents to benefit from GPU parallelization 3. Check GPU utilization: `nvidia-smi -l 1` 4. Verify PyTorch is using GPU: `torch.cuda.is_available()` ## Additional Resources - [NVIDIA CUDA Documentation](https://docs.nvidia.com/cuda/) - [PyTorch CUDA Installation Guide](https://pytorch.org/get-started/locally/) - [Docling GPU Support Guide](../usage/gpu.md) - [GPU Performance Examples](../examples/gpu_standard_pipeline.py)