Let the AI Do the Experimenting

让 AI 来做实验

Agentic AI: Let the AI Do the Experimenting 代理式 AI：让 AI 来做实验

Using autoresearch to optimize marketing campaigns under budget constraints 利用自动研究（autoresearch）在预算限制下优化营销活动

Mariya Mansurova | Apr 28, 2026 | 14 min read Mariya Mansurova | 2026年4月28日 | 14分钟阅读

Image generated by author with DALLE-3 图片由作者使用 DALLE-3 生成

Have you ever been in a situation where you have plenty of ideas on how to improve your product, but no time to test them all? I bet you have. What if I told you that you no longer have to do it all on your own, you can delegate it to AI. It can run dozens (or even hundreds) of experiments for you, discard ideas that don’t work, and iterate on the ones that actually move the needle. 你是否曾遇到过这种情况：对如何改进产品有很多想法，却没时间一一测试？我敢肯定你有过。如果我告诉你，你不再需要独自完成这一切，可以将任务委托给 AI，会怎样？它可以为你运行几十（甚至几百）个实验，丢弃无效的想法，并对那些真正能带来成效的想法进行迭代。

Sounds amazing. And that’s exactly the idea behind autoresearch, where an LLM operates in a loop, continuously experimenting, measuring impact, and iterating from there. The approach sounded compelling, and many of my colleagues have already seen benefits from it. So I decided to try it out myself. For this, I picked a practical analytical task: marketing budget optimisation with a bunch of constraints. Let’s see whether an autonomous loop can reach the same results as we did. 听起来很棒。这正是“自动研究”（autoresearch）背后的理念：让大语言模型（LLM）在一个循环中运行，持续进行实验、衡量影响并在此基础上进行迭代。这种方法听起来很有吸引力，我的许多同事已经从中受益。因此，我决定亲自尝试一下。为此，我选择了一个实际的分析任务：在多重约束下进行营销预算优化。让我们看看自主循环是否能达到我们人工操作同样的结果。

Background

背景

Let’s start with some background to set the context. Autoresearch was developed by Andrej Karpathy. As he wrote in his repository: 让我们先从一些背景信息开始，以设定语境。“自动研究”是由 Andrej Karpathy 开发的。正如他在代码库中所写：

One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of “group meeting”. That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that’s right or wrong as the “code” is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began. —— @karpathy, March 2026. “曾几何时，前沿 AI 研究是由‘肉体计算机’（人类）在吃饭、睡觉、娱乐之余，偶尔通过‘小组会议’这种仪式，利用声波互联进行同步来完成的。那个时代早已一去不复返了。现在的研究完全属于在云端计算集群巨型结构上运行的自主 AI 智能体群。智能体声称我们现在处于代码库的第 10,205 代，无论如何，没人能分辨真假，因为‘代码’现在已经变成了人类无法理解的自我修改二进制文件。这个仓库就是这一切如何开始的故事。” —— @karpathy，2026 年 3 月。

The idea behind autoresearch is to let an LLM operate on its own in an environment where it can continuously run experiments. It changes the code, trains the model, evaluates whether performance improves, and then either keeps or discards each change before repeating the loop. Eventually, you come back and (hopefully) find a better model than you started with. Using this approach, Andrej was able to significantly improve nanochat. “自动研究”背后的理念是让 LLM 在一个可以持续运行实验的环境中自主操作。它会修改代码、训练模型、评估性能是否提升，然后在重复循环之前决定保留或丢弃每一次更改。最终，当你回来时，（希望）能发现一个比最初更好的模型。通过这种方法，Andrej 成功地显著改进了 nanochat。

Image by Andrej Karpathy | source 图片来源：Andrej Karpathy

The original implementation was focused on optimising an ML model. However, similar approach can be applied to any task with a clear objective (from reducing website load time to minimising errors when scraping with Playwright). Shopify later open-sourced an extension of the original autoresearch, pi-autoresearch. It builds on pi, a minimal open-source terminal coding harness. It follows a similar loop to the original autoresearch, with a few key steps: 最初的实现专注于优化机器学习模型。然而，类似的方法可以应用于任何有明确目标的任务（从减少网站加载时间到最小化使用 Playwright 抓取时的错误）。Shopify 后来开源了原始“自动研究”的一个扩展版本：pi-autoresearch。它基于 pi（一个极简的开源终端编码工具）构建。它遵循与原始“自动研究”类似的循环，包含几个关键步骤：

Define the metric you want to improve, along with any constraints.
定义你想要改进的指标，以及任何约束条件。
Measure the baseline.
测量基准。
Hypothesis testing: in each iteration, the agent proposes an idea, writes it down, and tests it. There are three possible outcomes: it doesn’t work (discard), it worsens the metric (discard), or it improves the target (keep it and iterate from there).
假设测试： 在每次迭代中，智能体提出一个想法，将其写下来并进行测试。有三种可能的结果：行不通（丢弃）、使指标变差（丢弃），或者改进了目标（保留并在此基础上迭代）。
Repeat: the loop continues until you stop it, improvements plateau, or it reaches a predefined iteration limit.
重复： 循环持续进行，直到你停止它、改进达到瓶颈，或者达到预定义的迭代次数限制。

So the core idea is to define a clear objective and let the agent try bold ideas and learn from them. This approach can uncover potential improvements to your KPIs by testing ideas your team simply never had the time to explore. It definitely sounds interesting, so let’s try it out. 因此，核心理念是定义一个明确的目标，让智能体尝试大胆的想法并从中学习。这种方法可以通过测试你团队根本没时间探索的想法，来挖掘 KPI 的潜在改进空间。这听起来确实很有趣，让我们试一试。

Task

任务

I would like to test this approach on an analytical task, since in analytical day-to-day tasks we often have clear objectives and need to iterate multiple times to reach an optimal solution. So, I went through all the posts I’ve written for Towards Data Science over the years and found a task around optimising marketing campaigns, which we discussed in the article “Linear Optimisations in Product Analytics”. 我想在一个分析任务上测试这种方法，因为在日常分析工作中，我们通常有明确的目标，并且需要多次迭代才能达到最优解。因此，我翻阅了多年来为 Towards Data Science 撰写的所有文章，找到了一个关于优化营销活动的任务，我们在《产品分析中的线性优化》一文中讨论过它。

The task is quite common. Imagine you work as a marketing analyst and need to plan marketing activities for the next month. Your goal is to maximise revenue within a limited marketing budget ($30M). You have a set of potential marketing campaigns, along with projections for each of them. For each campaign, we know the following: country and marketing channel, marketing_spending — investment required for this activity, revenue — expected revenue from acquired customers over the next 12 months (our target metric). We also have some additional information, such as the number of acquired users and the number of customer support contacts. We will use these to iterate on the initial task and make it progressively more challenging by adding extra constraints. 这个任务很常见。想象一下，你是一名营销分析师，需要规划下个月的营销活动。你的目标是在有限的营销预算（3000 万美元）内实现收入最大化。你有一组潜在的营销活动，以及对每一项活动的预测。对于每个活动，我们知道以下信息：国家和营销渠道、marketing_spending（该活动所需的投资）、revenue（未来 12 个月内从获取客户中获得的预期收入，即我们的目标指标）。我们还有一些额外信息，例如获取的用户数量和客户支持联系次数。我们将利用这些信息对初始任务进行迭代，并通过增加额外的约束条件使其逐渐变得更具挑战性。

Image by author 图片由作者提供

It is useful to give the agent a baseline approach so it has something to start from. So, let’s put it together. One simple solution for this optimisation is to focus on the top-performing segments by revenue per dollar spent. We can sort all campaigns by this metric and select the ones that fit within the budget. Of course, this approach is quite naive and can definitely be improved, but it provides a good starting point. 给智能体一个基准方法作为起点是很有用的。那么，让我们把它整合起来。这种优化方案的一个简单解法是关注“每投入一美元所产生的收入”这一指标表现最好的细分市场。我们可以按此指标对所有活动进行排序，并选择符合预算的活动。当然，这种方法相当简单，绝对有改进空间，但它提供了一个很好的起点。

import pandas as pd
df = pd.read_csv('marketing_campaign_estimations.csv', sep='\t')

# --- Baseline: greedy by revenue-per-dollar ---
df['revenue_per_spend'] = df.revenue / df.marketing_spending
df = df.sort_values('revenue_per_spend', ascending=False)
df['spend_cumulative'] = df.marketing_spending.cumsum()
selected_df = df[df.spend_cumulative <= 30_000_000]

total_spend = selected_df.marketing_spending.sum()
revenue_millions = selected_df.revenue.sum() / 1_000_000

assert total_spend <= 30_000_000, f"Budget violated: {total_spend}"
print(f"METRIC revenue_millions={revenue_millions:.4f}")
print(f"Segments={len(selected_df)} spend={total_spend/1e6:.2f}M")

I put this code in optimise.py in the repository. If we run the baseline, we see that the resulting revenue is 107.9M USD, while the total spend is 29.2M. 我将这段代码放入仓库的 optimise.py 中。如果我们运行这个基准测试，会发现产生的收入为 1.079 亿美元，而总支出为 2920 万美元。

python3 optimise.py # METRIC r