Create Serverless Endpoint
This guide takes deploying a Llama 3.1 8B model as an example to introduce how to create a Serverless Endpoint from scratch.
Step 1: Prepare Container Image
You need to package the runtime environment into a Docker image and upload it to an image repository in advance. Currently, Novita AI supports specifying “public image repository” and “private image repository” (including image repository access credentials).
We use the vLLM official image repository for serving the Llama 3.1 8B model: vllm/vllm-openai:latest
. You can use this image address directly.
Step 2: Go to the Console
Go to the Serverless Console, select the appropriate GPU container instance specification, and click “Create Endpoint”.
Currently, only RTX 4090 24GB instance specifications are supported. More options will be available gradually, please stay tuned for official updates.
Step 3: Configure Serverless Endpoint
Here we use the public image vllm/vllm-openai:latest
as an example for configuration. The example configuration is as follows:
Please note that the serverless endpoint costs shown in the example screenshot above are for reference only.
After confirming that the configuration is correct, click “Deploy” to complete the creation process.
Field Descriptions
- Endpoint Name: Used to uniquely identify your Endpoint. The system will generate a name for you by default, but you can also specify one yourself.
- App Name: The application name is part of the Endpoint URL. It is optional and supports customization, only lowercase letters, numbers, and hyphens are allowed. If not specified, the system will generate one automatically.
- Worker Configuration
- Workers Min: The minimum number of Workers to retain. It can be set to 0, so no Workers will be retained when your service has no requests. Note that setting it to 0 may not respond quickly to subsequent requests. Be cautious if your scenario requires quick response time.
- Workers Max: The maximum number of Workers to retain. When your service request volume increases and triggers the auto-scaling mechanism, the number of Workers will increase. This configuration limits the maximum number of Workers to help control costs.
- Idle Timeout (seconds): When the auto-scaling mechanism triggers a Worker to be taken offline, the platform will retain the Worker for the specified time to quickly handle potential subsequent request increases. Note that the platform will charge you for this time.
- Max Concurrency: The maximum number of concurrent requests each Worker can handle. When the concurrency exceeds the maximum value, requests will be dispatched to other Workers. If all Workers are fully occupied, requests will be queued for execution.
- GPUs/Worker: The number of GPUs each Worker occupies.
- CUDA Version: Supports specifying the CUDA version.
- Scale Policy:
- Queue Delay: Adjusts the number of Workers based on the waiting time of requests in the queue. You need to configure the “Queue Delay Time (seconds)”. When the waiting time of requests in the queue exceeds this value, auto-scaling will be triggered. When it is less than this value, auto-scaling will be triggered to scale down (combined with the “Idle Timeout (seconds)” set earlier to determine the specific time for Worker to go offline).
- Request Count: Adjusts the number of Workers based on the number of requests in the queue. You need to configure the “Queue Max Request Count”. When the number of requests in the queue exceeds this value, auto-scaling will be triggered. When it is less than this value, auto-scaling will be triggered to scale down (combined with the “Idle Timeout (seconds)” set earlier to determine the specific time for Worker to go offline).
- Image Configuration:
- Container Image: The image repository address, such as
vllm/vllm-openai:latest
. - Container Registry Credentials: When you specify a private image repository, you need to set the access credentials. Please configure it through the Settings - Container Registry Auth.
- HTTP Port: The HTTP port exposed by the Workers. The load balancer will forward requests to this port of the Workers.
- Container Start Command: The command executed when the container starts. For example:
--model meta-llama/Llama-3.1-8B-Instruct --max-model-len 4096
.
- Container Image: The image repository address, such as
- Storage Configuration: Here you can configure the container disk, volume disk and network volume. Currently, the container disk and volume disk sizes are fixed. More options will be available gradually, please stay tuned for official updates.
- Others:
- Health Check Path: The built-in load balancer will perform health checks through this path. It will determine whether to forward requests to the Worker based on whether the returned status code is
200
, such as/health
. - Environment Variables: Here you can configure the environment variables that the service depends on. For example:
HUGGING_FACE_HUB_TOKEN={Your Hugging Face Access Token (with read permission)}
.
- Health Check Path: The built-in load balancer will perform health checks through this path. It will determine whether to forward requests to the Worker based on whether the returned status code is
Step 4: Access the Deployed Llama 3 8B Model Service
You can find the Serverless Endpoint you just created in the Serverless Console. When the Endpoint’s status changes to “Serving”, and at least one Worker is in the “Running” state, click the “Copy” button to copy the URL for accessing the model service.
And you can use the curl
command to access the service like below:
Next Step
Next, please refer to the document Manage Serverless Endpoint to learn how to check and modify the serverless endpoint you just created.
Was this page helpful?