멀티모달 리뷰 qresearch의 llama3-vision-alpha 콜랩 구동

728x90

LLM RnD 자료를 찾으러 Note에서 일본 LLM 동향을 검색하고 있었는데 qresearch라는 곳에서 llama3로 vision모델을 만들었다는 글을 보았습니다. 그냥 자기 것이 성능이 우수하다 이런 내용이 아닌 만들어서 코드 리뷰하는 문서 였습니다. 생각보다 유익한 내용인거 같아서 따라 구동 해봤습니다.

간단히 코드 구동이 가능합니다. 이 경우에 허깅페이스 레포지토리에서 lama-3-vision-alpha/mm_projector.bin만 들어있는데 그 이외에 파일은 튜닝을 따로 시키지 않은 llama3와 siglip 모델을 사용해서 중간의 projection층을 만들어서 그것만으로 vision 모델을 구현 했다는 것이 놀라웠습니다.

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "qresearch/llama-3-vision-alpha-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, torch_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    use_fast=True,
)

image = Image.open("/content/sdfsdfsdfdsf.jpg")

print(
    tokenizer.decode(
        model.answer_question(image, "what is this?", tokenizer),
        skip_special_tokens=True,
    )
)

프로젝션의 구조는 아주 간단한 시퀀셜 리니어 모델과 액티베이션이 전부이고, 임베딩 레이어에서 concat해주므로서 비전 능력을 얻을 수 있다는 것이 정말 대단하네요. 그래서 저는 기존 llama3 모델 말고 다른 튜닝된 llama3로 바꿔서 실행도 해봤습니다. 그랬을 때도 비슷하게 동작하는 것을 보아 단순해보이면서도 강력하게 구현 할 수 있다는게 놀라웠습니다. 훈련 하는 방법도 직접 코드를 보거나 구현해보는 것을 해보고 싶다고 생각들었습니다.

class ProjectionModule(nn.Module):
    def __init__(self, mm_hidden_size, hidden_size):
        super(ProjectionModule, self).__init__()

        # Directly set up the sequential model
        self.model = nn.Sequential(
            nn.Linear(mm_hidden_size, hidden_size),
            nn.GELU(),
            nn.Linear(hidden_size, hidden_size),
        )

    def forward(self, x):
        return self.model(x)

def process_tensors(input_ids, image_features, embedding_layer):
    # Find the index of -200 in input_ids
    split_index = (input_ids == -200).nonzero(as_tuple=True)[1][0]

    # Split the input_ids at the index found, excluding -200
    input_ids_1 = input_ids[:, :split_index]
    input_ids_2 = input_ids[:, split_index + 1 :]

    # Convert input_ids to embeddings
    embeddings_1 = embedding_layer(input_ids_1)
    embeddings_2 = embedding_layer(input_ids_2)

    device = image_features.device
    token_embeddings_part1 = embeddings_1.to(device)
    token_embeddings_part2 = embeddings_2.to(device)

    # Concatenate the token embeddings and image features
    concatenated_embeddings = torch.cat(
        [token_embeddings_part1, image_features, token_embeddings_part2], dim=1
    )

    # Create the corrected attention mask
    attention_mask = torch.ones(
        concatenated_embeddings.shape[:2], dtype=torch.long, device=device
    )
    return concatenated_embeddings, attention_mask

        image_features = image_forward_outs.hidden_states[-2]

        projected_embeddings = projection_module(image_features).to("cuda")

        new_embeds, attn_mask = process_tensors(
            input_ids, projected_embeddings, embedding_layer
        )
        device = model.device
        attn_mask = attn_mask.to(device)
        new_embeds = new_embeds.to(device)

참고자료

https://huggingface.co/qresearch/llama-3-vision-alpha/blob/main/__main__.py

https://note.com/astropomeai/n/n89124686697f

728x90

저작자표시

'코딩 > 모델 리뷰' 카테고리의 다른 글

gemma3 vllm에서 dtype bfloat16과 float16 빈칸 문제 (0)	2025.04.03
모델 리뷰 Llama 3을 colab에서 실행해보자 (0)	2024.05.03
모델 리뷰 믹스트랄 8x22B 4bit 구동 해보자 (0)	2024.04.16
모델 리뷰 OLMo Bitnet 1B을 colab에서 실행해보자 (0)	2024.04.03
모델 리뷰 anthropic의 Claude 3 사용 및 API 사용법 (0)	2024.03.06

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

LLM연습장

멀티모달 리뷰 qresearch의 llama3-vision-alpha 콜랩 구동

'코딩 > 모델 리뷰' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

멀티모달 리뷰 qresearch의 llama3-vision-alpha 콜랩 구동

'코딩 > 모델 리뷰' 카테고리의 다른 글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역