Run Llama 3.1-8B with SGlang on Novita AI

GLM 4.5 is on Novita!

Don't show again

Llama3.1-8B

Run Llama3.1-8B with SGlang on Novita AI

Model Text-Generation

Original Author : MetaUpdate Time : 2025-03-07Github

One click deployment

On Demand

Deploy

README

What is Llama 3.1

Meta Llama 3.1 features multilingual language models in 8B, 70B, and 405B sizes with both pretrained and instruction-tuned variants. The instruction-tuned models excel in multilingual conversations, outperforming many open-source and proprietary alternatives on industry benchmarks.

Key Features of Llama 3.1

Technical Power: Llama 3.1 delivers unprecedented capabilities with its 405 billion parameter architecture trained on 16,000 Nvidia H100 GPUs, featuring "Imagine Me" for personalized images, support for five languages, and search engine API integration.
Open Innovation: Meta's open-source approach democratizes advanced AI access, creating a "Linux for AI" that allows developers to freely modify the model with minimal restrictions only on large-scale commercial deployments.
Cost Leadership: Operating at approximately half the cost of competitors like GPT-4, Llama 3.1 makes advanced AI economically viable for organizations and developers of all sizes.
Ecosystem Integration: Enhanced multilingual and multimedia capabilities integrate across Meta platforms while strategic partnerships and flexible licensing enable collaborative innovation throughout the AI community.

Know More about Llama 3.1-8B

What is Llama 3.1-8B

Meta's Llama 3.1-8B is the smallest, fastest model in the Llama 3.1 family. This pre-trained and instruction-tuned multilingual LLM outperforms many open and closed-source alternatives on key benchmarks despite its compact size. With a shortened context window for speed and cost optimization, it delivers exceptional text understanding and generation capabilities in an efficient, economical package.

Comparing Llama 3.1-8B with Llama 3-8B

Context Window: Llama 3.1-8B offers a dramatically larger 128K token context window compared to Llama 3-8B's 8,000 tokens, representing a 16x increase in context capacity.
Mathematical Performance: Llama 3.1-8B demonstrates substantial improvement on the MATH benchmark, scoring 51.9% (0-shot) versus Llama 3-8B's 29.1%, showing a 22.8 percentage point improvement.
GSM8K Benchmark: Llama 3.1-8B achieves 84.5% (8-shot) on grade-school math problems compared to Llama 3-8B's 80.6%, reflecting enhanced mathematical reasoning.
Knowledge Cutoff: Both models share the same knowledge cutoff date of December 2023 for training data, though Llama 3.1-8B offers more advanced capabilities within this knowledge base.
API Availability: Both models are available through the same major providers: Azure AI, AWS Bedrock, Google Cloud Vertex AI Model Garden, NVIDIA NIM, and IBM Watsonx.

Use Cases

The Llama 3.1-8B model is flexible and able to manage a diverse range of tasks, which include but are not limited to:

Chatbots and virtual assistants: Creating interactive conversational AI systems for customer service or information retrieval.
Content generation: Generating creative text formats like poems, stories, code snippets, or marketing copy.
Text summarization: Condensing lengthy text into concise summaries.
Machine translation**:** Translating text between different languages.
Educational tools: Creating personalized learning experiences with adaptive question-answering capabilities.

How to Use Llama 3.1-8B with SGlang on Novita AI

Step 1: Access GPU Console

Navigate to "GPU" in the menu bar and click "Get Started" to enter the GPU Instance console.

Step 2: Select Template

Browse official templates and GPU options based on your performance and budget requirements. Choose "SGLang: Llama3.1-8B-Instruct" template and click "Deploy" under the 4090 card option.

Step 3: Configure Instance Parameters

Adjust disk size settings on the left panel if needed. Review the pre-configured settings on the right panel and enter your Huggingface Token in the environment variables section. Click "Next" to proceed.

Step 4: Review and Deploy

Verify instance configuration and cost details. Click "Deploy" to initiate deployment.

Step 5: Wait for Deployment

Allow the system to complete the deployment process.

Step 6: Monitor Instance Status

View your instance in the management dashboard where it will initially show "Pulling" status. Click the arrow next to your instance to track image download progress. Wait for status to change to "Running."

Step 7: Check Initialization Logs

Click "Logs" button and select "Instance Logs" to monitor the Llama model loading process.

Step 8: Access Connection Options

Close the logs page and click "Connect" to view connection information.

Step 9: Connect to HTTP Service

Under "Connection Options," click "Connect to HTTP Service." Note: The page may display "Not Found" error as the Llama model requires POST requests. Copy the URL for accessing the service on port 8000.

Step 10: Test Model Access

Use curl commands to test the Llama model API and verify proper functionality.

Step 11: Release Resources

Remember to terminate the instance when finished to avoid unnecessary charges.

Further Ethical Considerations and Solutions

When implementing Llama 3.1-8B, address these ethical guidelines:

Combat Bias Proactively: Audit outputs regularly, use diverse training data, implement feedback mechanisms.

Prioritize Data Protection: Use stringent handling protocols, anonymization, and privacy-preserving methods.

Prevent Harmful Content: Establish moderation systems, develop usage policies, educate users on limitations.

Ensure Transparency: Document model functions and decision-making processes clearly.

Implement Ethical Governance: Incorporate ethical principles from initial deployment stages.

Balance Innovation and Responsibility: Maintain equilibrium between AI advancement and ethical obligations.

FAQs

What hardware specifications are recommended for running Llama 3.1-8B?

For optimal performance, Novita AI's 4090 GPU instance is recommended. This configuration provides sufficient VRAM and processing power to run the model efficiently without throttling or excessive latency issues.

What is the typical deployment time for the Llama 3.1-8B instance?

Deployment typically takes 3-5 minutes, depending on network conditions and whether the image has been cached. The initial model loading can take an additional 1-3 minutes as the weights are loaded into GPU memory.

How do I interact with the model once it is deployed?

The model exposes an HTTP endpoint on port 8000. You can send POST requests to this endpoint with your prompts. The service accepts JSON payloads with a "prompt" field containing your instruction or query.

What is the expected inference speed for Llama 3.1-8B on Novita AI?

With SGLang optimization, you can expect generation speeds of approximately 30-50 tokens per second on the 4090 GPU configuration, though this may vary based on prompt complexity and generated content.

How do I monitor resource usage during model operation?

Novita AI provides real-time monitoring of GPU memory usage, CPU utilization, and other metrics through the instance dashboard. Access these by clicking on your instance and navigating to the "Metrics" tab.

Can I fine-tune the Llama 3.1-8B model on my own data?

Yes, though this requires additional setup. You'll need to use the SGLang fine-tuning capabilities or switch to a different framework. Consider using Novita AI's persistent storage option to maintain your fine-tuned weights.

How much does it cost to run this model on Novita AI?

Pricing depends on the selected GPU and usage duration. The 4090 GPU instance typically costs around $0.60-$0.80 per hour. Check Novita AI's current pricing page for the most accurate rates.

How do I handle errors or unexpected model behaviour?

Common issues can be diagnosed through the instance logs. Access these by clicking the "Logs" button in your instance dashboard. For model-specific errors, check your prompt formatting and ensure you're not exceeding context window limitations.

License

License: A custom commercial license, the Llama 3.1 Community License, is available.

Source Site

Hugging Face

enter image description here

Novita-CollabHub

TOP-LLM Integration Repo**:** Integrate the Novita AI API into popular software and platforms. Access Novita AI to get an API key.
Novita AI Template**:** Templates are Docker container images paired with a configuration. They are used to launch images as instances, define the required container disk size, volume, volume paths, and ports needed. You can also define environment variables within the template.

Get in Touch

Email: iris@novita.ai
Discord: novita.ai

About Novita AI

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing an affordable and reliable GPU cloud for building and scaling.

Other Recommended Templates

Meta Llama 3.1 8B

Accelerate AI Innovation with Meta Llama 3.1 8B Instruct, Powered by Novita AI

MiniCPM-V-2_6

Empower Your Applications with MiniCPM-V 2.6 on Novita AI.

Kohya-SS

Unleash the Power of Kohya-SS with Novita AI

stable-diffusion-3-medium

Transform Creativity with Stable Diffusion 3 Medium on Novita AI

Qwen2-Audio-7B-Instruct

Empower Your Audio with Qwen2 on Novita AI

Ready to build smarter? Start today.

Get started with Novita AI and unlock the power of affordable, reliable, and scalable AI inference for your applications.

Get Started