
Run Kimi-Linear-48B-A3B-Instruct on Novita
What is Kimi-Linear
Kimi Linear is a hybrid linear attention architecture that outperforms traditional full attention methods across various contexts, including short, long, and reinforcement learning (RL) scaling regimes. At its core is Kimi Delta Attention (KDA)—a refined version of Gated DeltaNet that introduces a more efficient gating mechanism to optimize the use of finite-state RNN memory. Kimi Linear achieves superior performance and hardware efficiency, especially for long-context tasks. It reduces the need for large KV caches by up to 75% and boosts decoding throughput by up to $6\times$ for contexts as long as 1M tokens.
Key Features
- Kimi Delta Attention (KDA): A linear attention mechanism that refines the gated delta rule with finegrained gating.
- Hybrid Architecture: A 3:1 KDA-to-global MLA ratio reduces memory usage while maintaining or surpassing the quality of full attention.
- Superior Performance: Outperforms full attention in a variety of tasks, including long-context and RL-style benchmarks on 1.4T token training runs with fair comparisons.
- High Throughput: Achieves up to 6× faster decoding and significantly reduces time per output token (TPOT). You can check the details on the official website of the project.
Run Kimi-Linear-48B-A3B-Instruct on Novita
Step 1: Console Entry
Launch the GPU interface and select Get Started to access deployment management.
Step 2: Package Selection
Locate Kimi-Linear-48B-A3B-Instruct in the template repository and begin installation sequence.
Step 3: Infrastructure Setup
Configure computing parameters including memory allocation, storage requirements, and network settings. Select Deploy to implement.
Step 4: Review and Create
Double-check your configuration details and cost summary. When satisfied, click Deploy to start the creation process.
Step 5: Wait for Creation
After initiating deployment, the system will automatically redirect you to the instance management page. Your instance will be created in the background.
Step 6: Monitor Download Progress
Track the image download progress in real-time. Your instance status will change from Pulling to Running once deployment is complete. You can view detailed progress by clicking the arrow icon next to your instance name.
Step 7: Verify Instance Status
Click the Logs button to view instance logs and confirm that the InvokeAI service has started properly.
Step 8: Environmental Access
Launch development space through Connect interface, then initialize Start Web Terminal.
Demo
To access your private model, copy the code below and replace "http://127.0.0.1:8080" with your actual endpoint address.
1curl --request POST \ 2 --url http://127.0.0.1:8080/v1/chat/completions \ 3 --header "Authorization: Bearer " \ 4 --header "Content-Type: application/json" \ 5 --data '{ 6 "model": "moonshotai/Kimi-Linear-48B-A3B-Instruct", 7 "messages": [ 8 {"role": "user", "content":"who are you?"} 9 ], 10 "max_tokens": 128 11 }' 12 {"id":"chatcmpl-de7c4de865e94699b80eb1a0d0bc9f22","object":"chat.completion","created":1761904682,"model":"moonshotai/Kimi-Linear-48B-A3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"I'm Kimi, a large language model trained by Moonshot AI. I'm here to help you with any questions or tasks you have. How can I assist you today?","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":163586,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":11,"total_tokens":46,"completion_tokens":35,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}