Documentation

LM Studio REST API

Load a model

Load an LLM or Embedding model into memory with custom configuration for inference

POST /api/v1/models/load

Request body

model : string

Unique identifier for the model to load. Can be an LLM or embedding model.

context_length (optional) : number

Maximum number of tokens that the model will consider.

eval_batch_size (optional) : number

Number of input tokens to process together in a single batch during evaluation. Will only have an effect on LLMs loaded by LM Studio's llama.cpp-based engine.

flash_attention (optional) : boolean

Whether to optimize attention computation. Can decrease memory usage and improved generation speed. Will only have an effect on LLMs loaded by LM Studio's llama.cpp-based engine.

num_experts (optional) : number

Number of expert to use during inference for MoE (Mixture of Experts) models. Will only have an effect on MoE LLMs loaded by LM Studio's llama.cpp-based engine.

offload_kv_cache_to_gpu (optional) : boolean

Whether KV cache is offloaded to GPU memory. If false, KV cache is stored in CPU memory/RAM. Will only have an effect on LLMs loaded by LM Studio's llama.cpp-based engine.

echo_load_config (optional) : boolean

If true, echoes the final load configuration in the response under "load_config". Default false.

curl http://localhost:1234/api/v1/models/load \
  -H "Authorization: Bearer $LM_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "context_length": 16384,
    "flash_attention": true,
    "echo_load_config": true,
  }'

Response fields

type : "llm" | "embedding"

Type of the loaded model.

model_instance_id : string

Unique identifier for the loaded model instance.

load_time_seconds : number

Time taken to load the model in seconds.

status : "loaded"

Load status.

load_config (optional) : object

The final configuration applied to the loaded model. This may include settings that were not specified in the request. Included only when "echo_load_config" is true in the request.

LLM load config : object

Configuration parameters specific to LLM models. load_config will be this type when "type" is "llm". Only parameters that applied to the load will be present.

context_length : number

Maximum number of tokens that the model will consider.

eval_batch_size (optional) : number

Number of input tokens to process together in a single batch during evaluation. Only present for models loaded with LM Studio's llama.cpp-based engine.

flash_attention (optional) : boolean

Whether Flash Attention is enabled for optimized attention computation. Only present for models loaded with LM Studio's llama.cpp-based engine.

num_experts (optional) : number

Number of experts for MoE (Mixture of Experts) models. Only present for MoE models loaded with LM Studio's llama.cpp-based engine.

offload_kv_cache_to_gpu (optional) : boolean

Whether KV cache is offloaded to GPU memory. Only present for models loaded with LM Studio's llama.cpp-based engine.

Embedding model load config : object

Configuration parameters specific to embedding models. load_config will be this type when "type" is "embedding". Only parameters that applied to the load will be present.

context_length : number

Maximum number of tokens that the model will consider.

{
  "type": "llm",
  "model_instance_id": "openai/gpt-oss-20b",
  "load_time_seconds": 9.099,
  "status": "loaded",
  "load_config": {
    "context_length": 16384,
    "eval_batch_size": 512,
    "flash_attention": true,
    "offload_kv_cache_to_gpu": true,
    "num_experts": 4
  }
}

This page's source is available on GitHub