LLaMA: Open and Efficient Foundation Language Models Paper Review / 논문 리뷰

Machine Learning paper/NLP

LLaMA: Open and Efficient Foundation Language Models Paper Review / 논문 리뷰

Sion1225 2024. 1. 23. 18:48

* 아직 작성중인 글 입니다.

두번째 논문 리뷰입니다.

논문리뷰를 쓰기로 마음먹게된 계기중 하나가, 당시 이 논문을 읽던 때, 제대로 작성된 한글리뷰가 없었다는 것 이였습니만, 이런저런 이유로 반년넘게 리뷰를 미뤘더니 지금은 괜찮은 리뷰가 많아졌습니다.

때문에 순서에 맞지않게 GPT에서 갑자기 LLaMA를 작성하게 됐습니다.

2023년 1분기에 발표된 모델임에도 불구하고, 3분기에 개선모델인 LLaMA2가 공개되었습니다.

GPT2 부터 갑작스러운 Open AI의 모델 비공개 행보로 인해, 프라이빗 모델로 트랜드가 변경된 현재, 오픈소스 진영을 고수하고 있는 Meta가 발표한 준 오픈소스 모델로서의 의의가 큰 모델입니다.

오픈소스에 초점을 맞추고 있는 만큼, 모델의 크기를 획기적으로 줄이면서도(GPT3의 1/10), 그 이상의 성능을 보여준다는 점이 인상적인 모델입니다.

이 논문은 Related Work 파트가 Conclusion의 직전에 위치해 있습니다.

기존의 Transformer Encoder Architecture 에서 큰 변화가 3개 있는데, 각각의 변경점에 관하여 선행연구의 지식이 요구됩니다. 이 논문에서 가장 어려운 파트이지 않을까 생각됩니다.

본 글에서는 우선 간략하게 짚고가나, 추후에 각각의 논문에 관하여 상세히 리뷰를 하려고 계획중입니다.

지금까지 인용구 표시를

본인의 코멘트, 해석

을 표기할때 사용해 왔으나, 불릿 포인트 안에서 사용할 수 없는 단점을 발견하였기에, * 표시 또한 필자인 저의 코멘트 표시로 사용합니다.
크게 중요하지 않은 내용은 본문 첨부없이 코멘트로 언급하고 넘어갑니다.

LLaMA: Open and Efficient Foundation Language Models (Hugo Touvron et al., 2023)

원문링크

Introduction

Large Languages Models (LLMs) trained on massive corpora of texts have shown their ability to perform new tasks from textual instructions or from a few examples (Brown et al., 2020). These few-shot properties first appeared when scaling models to a sufficient size (Kaplan et al., 2020), resulting in a line of work that focuses on further scaling these models (Chowdhery et al., 2022; Rae et al., 2021). These efforts are based on the assumption that more parameters will lead to better performance.
지금까지의 LLM은 parameter의 수가 많아질 수록(모델이 커질수록), 성능도 향상될 것이라는 assumtion(확실한 근거없는 추측)에 기반하 모델의 크기를 키우는 방향으로 발전해 왔다.
recent work from Hoffmann et al. (2022) shows that, for a given compute budget, the best performances are not achieved by the largest models, but by smaller models trained on more data.
Training Compute-Optimal Large Language Models (Hoffmann et al. (2022)) 에서는 제한된 계산자원내에서 가장큰 모델보다, 더욱 많은 훈련데이터로 훈련된 모델이 더 뛰어난 성능을 보여줌을 보였다.
LLaMA-13B outperforms GPT-3 on most benchmarks, despite being 10× smaller.
LLaMA-13B 는 10배나 더 작은 크기에도 불구하고 GPT-3보다 뛰어나다.
we only use publicly available data, making our work compatible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented.
우리는 공개된 데이터, 오픈소스만을 이용했다. (지금까지의 오픈소스데이터만을 이용한 모델중 PaLM이나 Chinchilla 만큼 좋은 성능을 보여준건 우리 뿐이다.)

Approach

Our training approach is similar to the methods described in previous work (Brown et al., 2020; Chowdhery et al., 2022), and is inspired by the Chinchilla scaling laws (Hoffmann et al., 2022).
우리의 훈련 접근방식은 서술한 선행연구의 방식과 흡사하며, Chinchilla의 scaling laws에서 영감을 받았다.
Pre-training Data
We reuse data sources that have been leveraged to train other LLMs, with the restriction of only using data that is publicly available, and compatible with open sourcing.
우리는 지금까지 다른 LLM들에 영향을 준 데이터들, 특히 그 중에서도 공공이용가능하고, 오픈소스인 데이터들만 썼다.
- English CommonCrawl [67%]
  * 매달 웹 크롤링을 하여 데이터를 수집하고, 무료로 공개하는 비영리 조직입니다.
  - We preprocess five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline (Wenzek et al., 2020). This process deduplicates the data at the line level, performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an ngram language model.
    우리는 2017년부터 2020년까지 다양한 CommonCrawl 덤프를 CCNet 파이프라인(Wenzek et al., 2020)을 사용하여 전처리한다. 이 과정은 데이터를 줄 단위로 중복 제거하고, fastText 선형 분류기를 사용한 언어 식별을 통해 영어가 아닌 페이지를 제거하며, n-gram 언어 모델을 사용하여 저품질 콘텐츠를 필터링한다.
  - we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages, and discarded pages not classified as references.
    우리는 위키피디아에서 참조로 사용된 페이지와 무작위로 샘플링된 페이지를 분류하는 선형 모델을 훈련시켰고, 참조로 분류되지 않은 페이지는 제거했다.* 모델을 학습시키기 위한 데이터셋을 구축하는데도 다양한 모델을 학습시켜 사용된다는 점을 유의깊게 봤습니다.
- C4 [15%]
  *T5모델 {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2020)} 을 훈련시키기 위해 제작되고 공개되어진 데이터셋입니다. 앞선 CommonCrawl의 데이터를 정제하여 제작된 데이터셋입니다.
  CommonCrawl과 데이터가 중복되는게 아닌가 라는 의문을 가졌는데, 탐색적 실험에 의한 결과라고 합니다.
  - During exploratory experiments, we observed that using diverse pre-processed CommonCrawl datasets improves performance. We thus included the publicly available C4 dataset (Raffel et al., 2020) in our data.
    탐색적 실험에 의하여 전처리된 CommonCrawl 데이터셋을 활용하면 성능이 향상됨을 알았다. 따라서 C4 데이터셋을 포함시켰다.
  - The preprocessing of C4 also contains deduplication and language identification steps: the main difference with CCNet is the quality filtering, which mostly relies on heuristics such as presence of punctuation marks or the number of words and sentences in a webpage.
    CommonCrawl 의 전처리에 사용된 CCNet과의 주요한 차이점은, 구두점의 존재나 단어의 수, 웹페이지의 문장과 같은 휴리스틱에 의존하는 품질필터링이다.
- Github [4.5%]
  * 다들 잘 아시는 깃헙입니다. 오픈소스 라이센스로 공개된 코드들만 학습에 사용하였다고합니다. 물론 전처리 과정을 거쳤고, 같은 파일은 제거했다고합니다.
  코딩에 관한 능력을 높히기 위해 프로그래밍 언어까지 학습시켰다는점이 주목할 만 합니다.
- Wikipedia [4.5%]
  * 이제는 LLM의 pre-training 에 당연히 들어가게된 교과서와 같은 데이터셋 위키피디아입니다.
  라틴문자와 키릴문자를 사용하는 다음과같은 20개 국어를 학습시켰다고합니다. bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. 전체 데이터중에 적은 비율을 차지하나, 약하지만 다국어 능력까지 주입하려고 했다는 점이 재밌습니다.
- Gutenberg and Books3 [4.5%]
  * Gutenberg 와 Books3, 별개의 데이터셋입니다. 두가지 다 여러 책들을 묶어서 만들어졌으며, Gutenberg 는 퍼블릭도메인을 가진 책들을 포함하고 있는 프로젝트이고, Books3 는 대규모 언어모델을 위한 공개 데이터셋인 ThePile의 Books3 섹션(Gao et al., 2020)에서 가져왔다고 합니다.
  유사한 데이터셋 특성상 9할이상의 데이터가 중복되며, 중복은 제거했다고 합니다.
- ArXiv [2.5%]
  * 과학데이터를 추가하기위해 아카이브의 Latex파일을 Lewkowycz et al. (2022)의 방법에 따라 전처리하여 학습데이터에 추가하였다고 합니다.
  이를 통해, 학술적 텍스트의 처리능력향상과, 과학적 추론능력이 향상될거라 생각됩니다.
- Stack Exchange [2%]
  * 추가적으로 CS부터 화학에 이르기까지 다양한 분야의 고품질의 질문과 답변이 있는 Stack Exchange또한 학습데이터에 포함시켰다고 합니다.
- Tokenizer
  - We tokenize the data with the bytepair encoding (BPE) algorithm (Sennrich et al., 2015), using the implementation from SentencePiece (Kudo and Richardson, 2018).
    우리는 SentencePiece의 BPE 알고리즘을 이용하여 데이터를 tokenize한다.
    * OpenAI의 GPT패밀리가 채택하고 있는 BytePair Encoding 방식입니다.
  - our entire training dataset contains roughly 1.4T tokens after tokenization.
    토큰화 후, 우리의 전체 훈련데이터셋은 대략 1.4T의 토큰을 포함하고 있다.
  - For most of our training data, each token is used only once during training, with the exception of the Wikipedia and Books domains, over which we perform approximately two epochs.
    우리의 훈련데이터 대부분에 대해, 각 토큰은 훈련중 한번만 사용되나, 위키피디아와 책 도메인은 예외적으로 대략 두 에포크 수행한다.

Introduction 에서도 얘기한, 공공이용이 가능한 오픈소스 데이터를 사용했다고 합니다.
저자들이 특히 이 부분을 강조하는것으로 보이는데, 친오픈소스 AI정책을 펼치는 Meta에서 중요하게 생각하고 있는 셀링포인트라고 생각됩니다.

Architecture
Following recent work on large language models, our network is based on the transformer architecture (Vaswani et al., 2017).
We leverage various improvements that were subsequently proposed, and used in different models such as PaLM. Here are the main difference with the original architecture, and where we were found the inspiration for this change (in bracket)
기본적으로는 최근의 llm을 따르고, 우리의 네트워크는 트랜스포머아키텍쳐(디코더)를 기반으로 한다.
다양한 후속 제안(연구)으로 모델을 개선했다. (하기에 기술, []안은 영감을 준 모델)
- Pre-normalization [GPT3]
  GPT3의 구조
  
  To improve the training stability, we normalize the input of each transformer sub-layer, instead of normalizing the output. We use the RMSNorm normalizing function, introduced by Zhang and Sennrich (2019).
  훈련의 안정성을 높히기 위해, 출력대신 각 트랜스포머블록의 입력을 정규화(normalize)하고, RMSNorm 함수를 사용했다.
  - RMSNorm 이란? LayerNormalization 에서 정규화시 centering 작업을 뺀 정규화방법. (https://github.com/bzhangGo/rmsnorm)
- SwiGLU activation function [PaLM]
- Rotary Embeddings [GPTNeo]