Financial Market Applications of LLMs

大语言模型在金融市场的应用

The AI revolution drove frenzied investment in both private and public companies and captured the public’s imagination in 2023. Transformational consumer products like ChatGPT are powered by Large Language Models (LLMs) that excel at modeling sequences of tokens that represent words or parts of words [2]. Amazingly, structural understanding emerges from learning next-token prediction, and agents are able to complete tasks such as translation, question answering and generating human-like prose from simple user prompts.

2023年，人工智能革命推动了对私营和上市公司的疯狂投资，并激发了公众的想象力。像ChatGPT这样具有变革性的消费产品，其核心动力来自于大语言模型（LLMs），这些模型擅长对代表单词或词组的标记序列进行建模[2]。令人惊叹的是，通过学习“预测下一个标记”，模型能够产生结构性的理解，并能根据简单的用户提示完成翻译、问答和生成类人文本等任务。

Not surprisingly, quantitative traders have asked: can we turn these models into the next price or trade prediction [1,9,10]? That is, rather than modeling sequences of words, can we model sequences of prices or trades. This turns out to be an interesting line of inquiry that reveals much about both generative AI and financial time series modeling. Be warned this will get wonky.

不出所料，量化交易员们提出了一个问题：我们能否将这些模型转化为价格或交易预测工具[1,9,10]？也就是说，我们能否对价格或交易序列进行建模，而不是对单词序列进行建模？事实证明，这是一个有趣的探索方向，它揭示了生成式AI和金融时间序列建模的许多本质。请注意，接下来的内容会比较硬核。

LLMs are known as autoregressive learners — those using previous tokens or elements in a sequence to predict the next element or token. In quantitative trading, for example in strategies like statistical arbitrage in stocks, most research is concerned with identifying autoregressive structure. That means finding sequences of news or orders or fundamental changes that best predict future prices.

大语言模型被称为自回归学习器——即利用序列中先前的标记或元素来预测下一个元素或标记。在量化交易中，例如股票统计套利策略，大多数研究都关注于识别自回归结构。这意味着要找到最能预测未来价格的新闻、订单或基本面变化的序列。

Where things break down is in the quantity and information content of available data to train the models. At the 2023 NeurIPS conference, Hudson River Trading, a high frequency trading firm, presented a comparison of the number of input tokens used to train GPT-3 with the amount of trainable tokens available in the stock market data per year. HRT estimated that, with 3,000 tradable stocks, 10 data points per stock per day, 252 trading days per year, and 23,400 seconds in a trading day, there are 177 billion stock market tokens per year available as market data. GPT-3 was trained on 500 billion tokens, so not far off [6].

问题的症结在于训练模型所需数据的数量和信息含量。在2023年的NeurIPS会议上，高频交易公司Hudson River Trading (HRT) 展示了GPT-3训练所用的输入标记数量与每年股票市场中可训练标记数量的对比。HRT估算，以3000只可交易股票、每只股票每天10个数据点、每年252个交易日以及交易日内23,400秒计算，每年可用的市场数据标记约为1770亿个。GPT-3的训练量为5000亿个标记，因此两者相差并不算太远[6]。

But, in the trading context the tokens will be prices or returns or trades rather than syllables or words; the former is much more difficult to predict. Language has an underlying linguistic structure (e.g., grammar) [7]. It’s not hard to imagine a human predicting the next word in a sentence, however that same human would find it extremely challenging to predict the next return given a sequence of previous trades, hence the lack of billionaire day traders. The challenge is that there are very smart people competing away any signal in the market, making it almost efficient (“efficiently inefficient”, in the words of economist Lasse Pedersen) and hence unpredictable. No adversary actively tries to make sentences more difficult to predict — if anything, authors usually seek to make their sentences easy to understand and hence more predictable.

然而，在交易语境下，标记是价格、收益率或交易，而不是音节或单词；前者要难预测得多。语言具有潜在的语言结构（如语法）[7]。人类预测句子中的下一个单词并不难，但要让同一个人根据一系列过往交易来预测下一个收益率，则极具挑战性，这也是为什么很少有靠日内交易成为亿万富翁的人。挑战在于，市场中有非常聪明的人在竞争，他们会消除任何信号，使市场变得近乎有效（经济学家Lasse Pedersen称之为“有效率的无效率”），从而变得不可预测。没有对手会主动让句子变得难以预测——相反，作者通常会力求让句子易于理解，从而更具可预测性。

Looked at from another angle, there is much more noise than signal in financial data. Individuals and institutions are trading for reasons that might not be rational or tied to any fundamental change in a business. The GameStop episode in 2021 is one such example. Financial time series are also constantly changing with new fundamental information, regulatory changes, and occasional large macroeconomic shifts such as currency devaluations. Language evolves at a much slower pace and over longer time horizons.

从另一个角度看，金融数据中的噪声远多于信号。个人和机构的交易原因可能并不理性，也可能与企业的任何基本面变化无关。2021年的GameStop事件就是一个例子。金融时间序列也会随着新的基本面信息、监管变化以及偶尔发生的宏观经济剧变（如货币贬值）而不断变化。而语言的演变速度要慢得多，且跨度更长。

On the other hand, there are reasons to believe that ideas from AI will work well in financial markets. One emerging area of AI research with promising applications to finance is multimodal learning [5], which aims to use different modalities of data, for example both images and textual inputs to build a unified model. With OpenAI’s DALL-E 2 model, a user can enter text and the model will generate an image. In finance, multi-modal efforts could be useful to combine information from classical sources such as technical time series data (prices, trades, volumes, etc.) with alternative data in different modes like sentiment or graphical interactions on Twitter, natural language news articles and corporate reports, or the satellite images of shipping activity in a commodity centric port. Here, leveraging multi-modal AI, one could potentially incorporate all these types of non-price information to predict well.

另一方面，有理由相信AI的理念在金融市场中会发挥良好作用。AI研究中一个在金融领域具有广阔应用前景的新兴领域是多模态学习[5]，其目标是利用不同模态的数据（例如图像和文本输入）来构建统一模型。通过OpenAI的DALL-E 2模型，用户输入文本即可生成图像。在金融领域，多模态方法可用于将传统来源的信息（如技术时间序列数据：价格、交易、成交量等）与不同模态的替代数据相结合，例如Twitter上的情绪或图形互动、自然语言新闻报道和企业报告，或是大宗商品港口的航运活动卫星图像。通过利用多模态AI，人们有可能整合所有这些非价格信息来进行精准预测。

Another strategy called ‘residualization’ holds prominence in both finance and AI, though it assumes different roles in the two domains. In finance, structural ‘factor’ models break down the contemporaneous observations of returns across different assets into a shared component (the market return, or more generally returns of common, market-wide factors) and an idiosyncratic component unique to each underlying asset. Market and factor returns are difficult to predict and create interdependence, so it is often helpful to remove the common element when making predictions at the individual asset level and to maximize the number of independent observations in the data. In residual network architectures such as transformers, there’s a similar idea that we want to learn a function h(X) of an input X, but it might be easier to learn the residual of h(X) to the identity map, i.e., h(X) – X. Here, if the function h(X) is close to identity, its residual will be close to zero, and hence there will be less to learn and learning can be done more efficiently. In both cases the goal is to exploit structure to refine predictions: in the finance case, the idea is to focus on predicting innovations beyond what is implied by the overall market, for residual networks the focus is on predicting innovations to the identity map.

另一种被称为“残差化”（residualization）的策略在金融和AI领域都占据重要地位，尽管它们在两个领域扮演的角色不同。在金融领域，结构化“因子”模型将不同资产的同期收益率观测值分解为共享成分（市场收益率，或更广泛地说，共同的市场范围因子的收益率）和每个底层资产特有的异质成分。市场和因子收益率难以预测且存在相互依赖性，因此在进行个体资产层面的预测时，剔除共同成分并最大化数据中独立观测值的数量通常很有帮助。在Transformer等残差网络架构中，有一个类似的理念：我们想要学习输入X的函数h(X)，但学习h(X)相对于恒等映射的残差（即h(X) – X）可能更容易。在这里，如果函数h(X)接近恒等映射，其残差将接近于零，因此需要学习的内容更少，学习效率也更高。在这两种情况下，目标都是利用结构来优化预测：在金融案例中，重点是预测超出整体市场隐含水平的创新；而在残差网络中，重点是预测相对于恒等映射的创新。

A key ingredient for the impressive performance of LLMs is their ability to discern affinities or strengths between tokens over long horizons known as context windows. In financial markets, the ability to focus attention across long horizons enables analysis of multi-scale phenomena, with some aspects of market changes explained across very different time horizons. For example, at one extreme, fundamental info…

大语言模型表现出色的一个关键因素是它们能够在被称为“上下文窗口”的长跨度内识别标记之间的亲和力或强度。在金融市场中，跨长周期聚焦注意力的能力使得分析多尺度现象成为可能，市场变化的某些方面可以在截然不同的时间跨度上得到解释。例如，在一个极端情况下，基本面信息……