FareedKhan-dev / train-llm-from-scratch

FareedKhan-dev / train-llm-from-scratch

Train LLM From Scratch I am looking for a PhD position in AI. I implemented a transformer model from scratch using PyTorch, based on the paper Attention is All You Need. You can use my scripts to train your own billion or million parameter LLM using a single GPU. 从零开始训练大语言模型 (LLM) 我目前正在寻找人工智能方向的博士职位。我基于《Attention is All You Need》这篇论文,使用 PyTorch 从零实现了一个 Transformer 模型。你可以使用我的脚本,仅需单张 GPU 即可训练属于你自己的百万或十亿参数级大语言模型。

Below is the output of the trained 13 million parameter LLM: In 1978, The park was returned to the factory-plate that the public share to the lower of the electronic fence that follow from the Station’s cities. The Canal of ancient Western nations were confined to the city spot. The villages were directly linked to cities in China that revolt that the US budget and in Odambinais is uncertain and fortune established in rural areas. 以下是训练好的 1300 万参数 LLM 的输出示例: 1978 年,公园被归还给工厂板块,公众将其分享给电子围栏的下层,这些围栏源自车站所在的城市。古代西方国家的运河被限制在城市区域。村庄与中国的城市直接相连,这些城市反抗美国的预算,而在奥丹比奈斯(Odambinais),情况尚不明朗,财富在农村地区建立。

Table of Contents

Training Data Info, Prerequisites and Training Time, Code Structure, Usage, Step by Step Code Explanation (Importing Libraries, Preparing the Training Data, Transformer Overview, Multi Layer Perceptron (MLP), Single Head Attention, Multi Head Attention, Transformer Block, The Final Model, Batch Processing, Training Parameters, Training the Model, Saving the Trained Model, Training Loss, Generating Text, What’s Next).

目录

训练数据信息、先决条件与训练时间、代码结构、使用方法、代码分步详解(导入库、准备训练数据、Transformer 概述、多层感知机 (MLP)、单头注意力机制、多头注意力机制、Transformer 模块、最终模型、批处理、训练参数、训练模型、保存训练模型、训练损失、文本生成、后续计划)。

Training Data Info Training data is from the Pile dataset, which is a diverse, open-source, and large-scale dataset for training language models. The Pile dataset is a collection of 22 diverse datasets, including text from books, articles, websites, and more. The total size of the Pile dataset is 825GB. 训练数据信息 训练数据来自 Pile 数据集,这是一个用于训练语言模型的多样化、开源且大规模的数据集。Pile 数据集由 22 个不同的数据集组成,包含来自书籍、文章、网站等的文本。Pile 数据集的总大小为 825GB。

Below is the sample of the training data: Line: 0 { “text”: “Effect of sleep quality … epilepsy.”, “meta”: { “pile_set_name”: “PubMed Abstracts” } } Line: 1 { “text”: “LLMops a new GitHub Repository …”, “meta”: { “pile_set_name”: “Github” } } 以下是训练数据的样本: 第 0 行:{ “text”: “睡眠质量的影响 … 癫痫。”, “meta”: { “pile_set_name”: “PubMed 摘要” } } 第 1 行:{ “text”: “LLMops 一个新的 GitHub 仓库 …”, “meta”: { “pile_set_name”: “Github” } }

Prerequisites and Training Time Make sure you have a basic understanding of object-oriented programming (OOP), neural networks (NN) and PyTorch to understand the code. You will need a GPU to train your model. Colab or Kaggle T4 will work for training a 13+ million-parameter model, but they will fail for billion-parameter training. 先决条件与训练时间 请确保你对面向对象编程 (OOP)、神经网络 (NN) 和 PyTorch 有基础了解,以便理解代码。你需要 GPU 来训练模型。Colab 或 Kaggle 的 T4 GPU 适用于训练 1300 万参数以上的模型,但无法胜任十亿参数级别的训练。

(Table omitted for brevity, please refer to the original text for GPU specifications) (表格因篇幅原因省略,请参考原文获取 GPU 规格详情)

The 13M LLM training is the training of a 13+ million-parameter model, and the 2B LLM training is the training of a 2+ billion-parameter model. The data size is categorized as small, medium, and large. The small data size is around 1 GB, the medium data size is around 5 GB, and the large data size is around 10 GB. 13M LLM 训练指的是 1300 万参数以上模型的训练,而 2B LLM 训练指的是 20 亿参数以上模型的训练。数据规模分为小、中、大三类。小规模数据约为 1GB,中等规模约为 5GB,大规模约为 10GB。

Code Structure The codebase is organized as follows: train-llm-from-scratch/ ├── src/ (models, attention, transformer_block, transformer) ├── config/ (default configurations) ├── data_loader/ (data loaders/iterators) ├── scripts/ (training, downloading, preprocessing, generating) ├── data/ (train/val datasets) ├── models/ (saved models) 代码结构 代码库组织如下: train-llm-from-scratch/ ├── src/ (模型、注意力机制、Transformer 模块、Transformer 主体) ├── config/ (默认配置) ├── data_loader/ (数据加载器/迭代器) ├── scripts/ (训练、下载、预处理、生成脚本) ├── data/ (训练/验证数据集) ├── models/ (已保存的模型)

Usage Clone the repository and navigate to the directory: git clone https://github.com/FareedKhan-dev/train-llm-from-scratch.git cd train-llm-from-scratch 使用方法 克隆仓库并进入目录: git clone https://github.com/FareedKhan-dev/train-llm-from-scratch.git cd train-llm-from-scratch

If you encounter any issues regarding the imports, make sure to change pythonpath to the root directory of the project: export PYTHONPATH="$PYTHONPATH:." Install the required dependencies: pip install -r requirements.txt 如果遇到导入问题,请确保将 pythonpath 设置为项目的根目录: export PYTHONPATH="$PYTHONPATH:." 安装所需的依赖项: pip install -r requirements.txt

You can modify the transformer architecture under src/models/transformer.py and the training configurations under config/config.py. To download the training data, run: python scripts/data_download.py 你可以在 src/models/transformer.py 中修改 Transformer 架构,并在 config/config.py 中修改训练配置。要下载训练数据,请运行: python scripts/data_download.py