oneuptime/Llama/app.py

import time
import transformers
import torch
from fastapi import FastAPI
from pydantic import BaseModel
import schedule

# TODO: Store this in redis down the line. 
items_pending = {}
queue = []
items_processed = {}

# Declare a Pydantic model for the request body
class Prompt(BaseModel):
   prompt: str

# Declare a Pydantic model for the request body
class PromptResult(BaseModel):
   id: str

model_path = "/app/Models/Meta-Llama-3-8B-Instruct"

pipe = transformers.pipeline(
    "text-generation", 
    model=model_path,
    # use gpu if available
    device="cuda" if torch.cuda.is_available() else "cpu",
    )

app = FastAPI()

@app.get("/")
async def root():
    return {"status": "ok"}

@app.post("/prompt/")
async def create_item(prompt: Prompt):

    # Log prompt to console
    print(prompt)

    # If not prompt then return bad request error
    if not prompt:
        return {"error": "Prompt is required"}

    messages = [
        {"role": "user", "content": prompt.prompt},
    ]

    # Generate random id 
    random_id = str(time.time())

    # add to queue

    items_pending[random_id] = messages
    queue.append(random_id)

    # Return response
    return {
        "id": random_id,
        "status": "queued"
    }

@app.post("/prompt-result/")
async def prompt_status(prompt_status: PromptResult):
    
        # Log prompt status to console
        print(prompt_status)
    
        # If not prompt status then return bad request error
        if not prompt_status:
            return {"error": "Prompt status is required"}
    
        # check if item is processed. 
        if prompt_status.id in items_processed:

           
            return_value =  {
                "id": prompt_status.id,
                "status": "processed",
                "output": items_processed[prompt_status.id]
            }

            # delete from item_processed
            del items_processed[prompt_status.id]

            return return_value
        else:
            return {
                "id": prompt_status.id,
                "status": "pending"
            }


async def job():
    print("Processing queue...")

    while len(queue) > 0:
        # process this item. 
        random_id = queue.pop(0)
        messages = items_pending[random_id]
        outputs = pipe(messages)
        items_processed[random_id] = outputs

# Schedule the job to run every 5 seconds
schedule.every(5).seconds.do(job)
refactor: Update Dockerfile.tpl to use huggingface/transformers-pytorch-gpu image This commit updates the Dockerfile.tpl to use the huggingface/transformers-pytorch-gpu image instead of the continuumio/anaconda3 image. This change allows the Llama app to utilize GPU resources for improved performance in AI processing. Additionally, the unnecessary installation of the transformers and accelerate libraries is removed as they are already included in the huggingface/transformers-pytorch-gpu image. 2024-06-19 13:06:23 +00:00			`import time`
make llama work 2023-10-15 20:04:58 +00:00			`import transformers`
refactor: Update Dockerfile.tpl to use huggingface/transformers-pytorch-gpu image This commit updates the Dockerfile.tpl to use the huggingface/transformers-pytorch-gpu image instead of the continuumio/anaconda3 image. This change allows the Llama app to utilize GPU resources for improved performance in AI processing. Additionally, the unnecessary installation of the transformers and accelerate libraries is removed as they are already included in the huggingface/transformers-pytorch-gpu image. 2024-06-19 13:06:23 +00:00			`import torch`
make llama work with rest api 2023-10-16 10:45:15 +00:00			`from fastapi import FastAPI`
			`from pydantic import BaseModel`
refactor: Add GPU support to Llama app in docker-compose.ai.yml 2024-06-19 20:58:08 +00:00			`import schedule`
make llama work with rest api 2023-10-16 10:45:15 +00:00
refactor: Add GPU support to Llama app in docker-compose.ai.yml 2024-06-19 20:58:08 +00:00			`# TODO: Store this in redis down the line.`
			`items_pending = {}`
			`queue = []`
			`items_processed = {}`
make llama work with rest api 2023-10-16 10:45:15 +00:00
			`# Declare a Pydantic model for the request body`
			`class Prompt(BaseModel):`
			`prompt: str`

refactor: Add GPU support to Llama app in docker-compose.ai.yml 2024-06-19 20:58:08 +00:00			`# Declare a Pydantic model for the request body`
			`class PromptResult(BaseModel):`
			`id: str`

refactor: Update Llama app to use local model path instead of model ID This commit updates the Llama app to use a local model path instead of a model ID. The model path is set to "/app/Models/Meta-Llama-3-8B-Instruct". This change improves the reliability and performance of the app by directly referencing the model file instead of relying on an external model ID. 2024-06-18 20:41:29 +00:00			`model_path = "/app/Models/Meta-Llama-3-8B-Instruct"`
make llama work with rest api 2023-10-16 10:45:15 +00:00
refactor: Update Dockerfile.tpl to use huggingface/transformers-pytorch-gpu image This commit updates the Dockerfile.tpl to use the huggingface/transformers-pytorch-gpu image instead of the continuumio/anaconda3 image. This change allows the Llama app to utilize GPU resources for improved performance in AI processing. Additionally, the unnecessary installation of the transformers and accelerate libraries is removed as they are already included in the huggingface/transformers-pytorch-gpu image. 2024-06-19 13:06:23 +00:00			`pipe = transformers.pipeline(`
			`"text-generation",`
			`model=model_path,`
			`# use gpu if available`
			`device="cuda" if torch.cuda.is_available() else "cpu",`
			`)`
add python app for llama. 2023-10-15 17:14:15 +00:00
make llama work with rest api 2023-10-16 10:45:15 +00:00			`app = FastAPI()`

refactor: Update Dockerfile.tpl to use huggingface/transformers-pytorch-gpu image This commit updates the Dockerfile.tpl to use the huggingface/transformers-pytorch-gpu image instead of the continuumio/anaconda3 image. This change allows the Llama app to utilize GPU resources for improved performance in AI processing. Additionally, the unnecessary installation of the transformers and accelerate libraries is removed as they are already included in the huggingface/transformers-pytorch-gpu image. 2024-06-19 13:06:23 +00:00			`@app.get("/")`
			`async def root():`
			`return {"status": "ok"}`

make llama work with rest api 2023-10-16 10:45:15 +00:00			`@app.post("/prompt/")`
			`async def create_item(prompt: Prompt):`

refactor: Update Llama app to log prompt and output to console This commit updates the Llama app to log the prompt and output to the console for debugging purposes. It adds print statements to log the prompt before processing and the generated output after processing. This change improves the development workflow by providing visibility into the input and output of the AI model. 2024-06-18 21:08:42 +00:00			`# Log prompt to console`
			`print(prompt)`

make llama work with rest api 2023-10-16 10:45:15 +00:00			`# If not prompt then return bad request error`
			`if not prompt:`
			`return {"error": "Prompt is required"}`

refactor: Update Dockerfile.tpl to expose port 8547 instead of port 80 This commit modifies the Dockerfile.tpl file to update the EXPOSE directive. The port number is changed from 80 to 8547 to align with the port used by the Llama application. This change ensures that the Llama application is accessible from outside the container on the correct port. 2024-06-18 17:42:11 +00:00			`messages = [`
refactor: Update Llama app to log prompt and output to console This commit updates the Llama app to log the prompt and output to the console for debugging purposes. It adds print statements to log the prompt before processing and the generated output after processing. This change improves the development workflow by providing visibility into the input and output of the AI model. 2024-06-18 21:08:42 +00:00			`{"role": "user", "content": prompt.prompt},`
refactor: Update Dockerfile.tpl to expose port 8547 instead of port 80 This commit modifies the Dockerfile.tpl file to update the EXPOSE directive. The port number is changed from 80 to 8547 to align with the port used by the Llama application. This change ensures that the Llama application is accessible from outside the container on the correct port. 2024-06-18 17:42:11 +00:00			`]`

refactor: Add GPU support to Llama app in docker-compose.ai.yml 2024-06-19 20:58:08 +00:00			`# Generate random id`
			`random_id = str(time.time())`

			`# add to queue`

			`items_pending[random_id] = messages`
			`queue.append(random_id)`

			`# Return response`
			`return {`
			`"id": random_id,`
			`"status": "queued"`
			`}`

			`@app.post("/prompt-result/")`
			`async def prompt_status(prompt_status: PromptResult):`

			`# Log prompt status to console`
			`print(prompt_status)`

			`# If not prompt status then return bad request error`
			`if not prompt_status:`
			`return {"error": "Prompt status is required"}`

			`# check if item is processed.`
			`if prompt_status.id in items_processed:`


			`return_value = {`
			`"id": prompt_status.id,`
			`"status": "processed",`
			`"output": items_processed[prompt_status.id]`
			`}`

			`# delete from item_processed`
			`del items_processed[prompt_status.id]`

			`return return_value`
			`else:`
			`return {`
			`"id": prompt_status.id,`
			`"status": "pending"`
			`}`



refactor: Convert job function to async in app.py The job function in app.py has been converted to an async function to support asynchronous processing. This change improves the performance and responsiveness of the application by allowing other tasks to run concurrently while the job function is processing the queue. 2024-06-19 21:05:36 +00:00			`async def job():`
refactor: Add GPU support to Llama app in docker-compose.ai.yml 2024-06-19 20:58:08 +00:00			`print("Processing queue...")`

			`while len(queue) > 0:`
			`# process this item.`
			`random_id = queue.pop(0)`
			`messages = items_pending[random_id]`
			`outputs = pipe(messages)`
			`items_processed[random_id] = outputs`

			`# Schedule the job to run every 5 seconds`
			`schedule.every(5).seconds.do(job)`