Car-GPT: Could LLMs finally make self-driving cars happen?

Car-GPT：大语言模型最终能实现自动驾驶吗？

In 1928, London was in the middle of a terrible health crisis, devastated by bacterial diseases like pneumonia, tuberculosis, and meningitis. Confined in sterile laboratories, scientists and doctors were stuck in a relentless cycle of trial and error, using traditional medical approaches to solve complex problems. 1928年，伦敦正处于一场严重的健康危机之中，饱受肺炎、肺结核和脑膜炎等细菌性疾病的摧残。科学家和医生们被困在无菌实验室里，陷入了不断的试错循环中，试图用传统的医疗手段解决复杂的难题。

This is when, in September 1928, an accidental event changed the course of the world. A Scottish doctor named Alexander Fleming forgot to close a petri dish (the transparent circular box you used in science class), which got contaminated by mold. This is when Fleming noticed something peculiar: all bacteria close to the moisture were dead, while the others survived. 就在这时，1928年9月发生的一起意外事件改变了世界进程。一位名叫亚历山大·弗莱明的苏格兰医生忘记盖上一个培养皿（你在科学课上用过的那种透明圆形盒子），导致它被霉菌污染了。弗莱明注意到一个奇怪的现象：所有靠近霉菌湿气处的细菌都死了，而其他的细菌却存活了下来。

“What was that moisture made of?” wondered M. Flemming. This was when he discovered that Penicillin, the main component of the mold, was a powerful bacterial killer. This led to the groundbreaking discovery of penicillin, leading to the antibiotics we use today. In a world where doctors were relying on existing well-studied approaches, Penicillin was the unexpected answer. “那湿气里到底是什么？”弗莱明医生心想。随后他发现，霉菌的主要成分——青霉素，是一种强效的杀菌剂。这促成了青霉素的开创性发现，并带来了我们今天使用的抗生素。在一个医生们都依赖现有成熟疗法的世界里，青霉素成为了那个意想不到的答案。

Self-driving cars may be following a similar event. Back in the 2010s, most of them were built using what we call a « modular » approach. The software « autonomous » part is split into several modules, such as Perception (the task of seeing the world), or Localization (the task of accurately localize yourself in the world), or Planning (the task of creating a trajectory for the car to follow, and implementing the « brain » of the car). Finally, all these go to the last module: Control, that generates commands such as « steer 20° right », etc… So this was the well-known approach. 自动驾驶汽车可能正在经历类似的事件。回溯到2010年代，大多数自动驾驶汽车都是采用所谓的“模块化”方法构建的。软件的“自动驾驶”部分被拆分为多个模块，例如感知（观察世界的任务）、定位（在世界中准确确定自身位置的任务）或规划（为车辆创建行驶轨迹并实现车辆“大脑”的任务）。最后，所有这些都汇集到最后一个模块：控制，它生成诸如“向右转向20度”之类的指令。这就是当时广为人知的方法。

But a decade later, companies started to take another discipline very seriously: End-To-End learning. The core idea is to replace every module with a single neural network predicting steering and acceleration, but as you can imagine, this introduces a black box problem. 但十年后，各家公司开始非常重视另一门学科：端到端学习。其核心思想是用单一的神经网络来取代每一个模块，直接预测转向和加速。但正如你所能想象的那样，这引入了一个“黑箱”问题。

The 4 Pillars of Self-Driving Cars are Perception, Localization, Planning, and Control. Could a Large Language Model replicate them? 自动驾驶汽车的四大支柱是感知、定位、规划和控制。大语言模型（LLM）能复制它们吗？

These approaches are known, but don’t solve the self-driving problem yet. So, we could be wondering: “What if LLMs (Large Language Models), currently revolutionizing the world, were the unexpected answer to autonomous driving?” This is what we’re going to see in this article, beginning with a simple explanation of what LLMs are and then diving into how they could benefit autonomous driving. 这些方法虽然为人所知，但尚未解决自动驾驶的问题。因此，我们不禁会想：“如果目前正在改变世界的大语言模型（LLM）是自动驾驶那个意想不到的答案呢？”这就是我们将在本文中探讨的内容，首先简单解释什么是LLM，然后深入探讨它们如何造福自动驾驶。

Preamble: LLMs-what?

前言：什么是LLM？

Before you read this article, you must know something: I’m not an LLM pro, at all. This means, I know too well the struggle to learn it. I understand what it’s like to google “learn LLM”; then see 3 sponsored posts asking you to download e-books (in which nothing concrete appears)… then see 20 ultimate roadmaps and GitHub repos, where step 1/54 is to view a 2-hour long video (and no one knows what step 54 is because it’s so looooooooong). 在阅读本文之前，你必须知道一件事：我根本不是LLM专家。这意味着我太了解学习它的艰辛了。我明白在谷歌上搜索“学习LLM”是什么感觉；然后看到3个赞助帖子让你下载电子书（里面什么实质内容都没有）……然后看到20个终极路线图和GitHub仓库，其中第1/54步就是观看一个2小时长的视频（而且没人知道第54步是什么，因为它太长了）。

So, instead of putting you through this pain myself, let’s just break down what LLMs are in 3 key ideas: Tokenization, Transformers, Processing Language. 所以，为了不让你经历这种痛苦，我们直接把LLM拆解为三个关键概念：分词（Tokenization）、Transformer架构、语言处理。

Tokenization

分词（Tokenization）

In ChatGPT, you input a piece of text, and it returns text, right? Well, what’s actually happening is that your text is first converted into tokens. But what’s a token? You might ask. Well, a token can correspond to a word, a character, or anything we want. Think about it — if you are to send a sentence to a neural network, you didn’t plan on sending actual words, did you? The input of a neural network is always a number, so you need to convert your text into numbers; this is tokenization. 在ChatGPT中，你输入一段文本，它返回文本，对吧？实际上，发生的事情是你的文本首先被转换成了“token”（词元）。你可能会问，什么是token？嗯，一个token可以对应一个单词、一个字符，或者我们想要的任何东西。想一想——如果你要向神经网络发送一个句子，你不会打算发送实际的单词吧？神经网络的输入永远是数字，所以你需要将文本转换为数字；这就是分词。

Transformers

Transformer架构

Now that we understand how to convert a sentence into a series of numbers, we can send that series into our neural network! At a high level, we have the following structure: A Transformer is an Encoder-Decoder Architecture that takes a sequence of tokens as input and outputs a another series of tokens. 现在我们了解了如何将句子转换为一系列数字，我们就可以将该序列发送到神经网络中！从宏观上看，我们有以下结构：Transformer是一种编码器-解码器架构，它接收一系列token作为输入，并输出另一系列token。

If you start looking around, you will see that some models are based on an encoder-decoder architecture, some others are purely encoder-based, and others, like GPT, are purely decoder-based. Whatever the case, they all share the core Transformer blocks: multi-head attention, layer normalization, addition and concatenation, blocks, cross-attention, etc… 如果你开始研究，你会发现有些模型基于编码器-解码器架构，有些纯粹基于编码器，而像GPT这样的模型则纯粹基于解码器。无论哪种情况，它们都共享核心的Transformer模块：多头注意力机制、层归一化、加法和拼接、块、交叉注意力机制等。

The output/ Next-Word Prediction

输出/下一个词预测

In our case, the decoder is trying to generate a series of words; we call this task “next-word prediction”. Of course, it does it similarly by predicting numbers or tokens. 在我们的案例中，解码器试图生成一系列单词；我们将此任务称为“下一个词预测”。当然，它通过预测数字或token来完成这一任务。

Chat-GPT for Self-Driving Cars

用于自动驾驶的Chat-GPT

The thing is, you’ve already been through the tough part. The rest simply is: “How do I adapt this to autonomous driving?”. Think about it; we have a few modifications to make: Our input now becomes either images, sensor data (LiDAR point clouds, RADAR point clouds, etc…), or even a… 关键在于，你已经度过了最困难的部分。剩下的只是：“我该如何将其应用于自动驾驶？”想一想；我们需要做一些修改：我们的输入现在变成了图像、传感器数据（激光雷达点云、雷达点云等），甚至是……