Effective use-cases for LLMs

大语言模型（LLM）的有效应用场景

Written by Mark in Blog 作者：Mark，发布于博客

There’s a lot of talk about the shortcomings of LLMs. They don’t actually reason. They’re expensive, especially when running in a loop. They’re quite slow at doing things. There’s a narrow category of use cases that LLMs excel at, one of which is “sifting through the noise”. The noise is everything we have to process to get to what we really want. Here are some use cases I haven’t heard about that I’ve enjoyed as a software engineer. 关于大语言模型（LLM）的缺陷，外界讨论颇多。它们并不具备真正的推理能力；运行成本高昂，尤其是在循环调用时；处理任务的速度也相当缓慢。然而，LLM 在某些特定领域表现出色，其中之一就是“从噪音中筛选信息”。所谓的“噪音”，就是我们在获取真正想要的内容之前必须处理的所有干扰项。以下是我作为软件工程师所发现的一些鲜为人知但非常实用的应用场景。

Searching through Customer Conversations

搜索客户对话记录

A PM colleague uploaded the transcript of every call with our top customers into an Embedding DB. Now their product proposals are deeply backed by evidence. We know 40% of our top customers have mentioned this pain point. The PM also identified a list of eager private beta customers to try out our new feature. This is useful when the customer’s problem is abstract. Often, these issues don’t have clear solutions, or those solutions don’t have clear names. That makes filing Feature Requests hard, and organizing/deduping even harder. Before LLMs, your best bet was that someone on your team had enough tenure to have seen this come up enough times, and that they remembered how to find all the links and connections. Now, it’s RAG. 一位产品经理（PM）同事将我们与核心客户的所有通话记录上传到了向量数据库（Embedding DB）中。现在，他们的产品提案有了扎实的证据支持。我们清楚地知道，40% 的核心客户都提到过这个痛点。该同事还筛选出了一批愿意参与内测的客户来试用我们的新功能。当客户的问题比较抽象时，这种方法非常有效。通常，这些问题没有明确的解决方案，或者解决方案没有统一的术语，这使得提交功能需求（Feature Requests）变得困难，整理和去重更是难上加难。在 LLM 出现之前，你只能寄希望于团队里有资深成员，他们见过足够多的案例，并记得如何找到所有的关联信息。而现在，有了 RAG（检索增强生成），一切都变得简单了。

Going from endpoint alert -> log analysis

从端点报警到日志分析

“Any large system is going to be operating most of the time in failure mode.” — John Gall, via Lorin Hochstein, Netflix “任何大型系统在大部分时间里都处于故障模式中。”——John Gall（引用自 Netflix 的 Lorin Hochstein）

When I’m on-call, one of my responsibilities is to triage failures on API endpoint our team owns. These failures are reported as “high rate of HTTP 4XX/5XX”. Sometimes, it’s noise, like there’s a DB connection hiccup for the pod. Other times, it’s signaling a bug, like customers can’t delete something anymore. Triaging is tedious: The first step is searching for the canonical log lines that mark the specific endpoint with the specific HTTP failure, filtered by time. Once I find the request that triggered the alert, I search by request ID, to see the request from start to end. Based on the logs, and my source code, I can usually guess what went wrong. Sometimes the stack trace is compiled JavaScript, rather than Typescript, so the line numbers don’t line up. I have to guess based on the name of the next function call. I double-check that I’m looking at a representative request. I quickly look at two or three more request IDs to make sure they’re all the same root cause. For more difficult issues like DB connection timeouts, I’ll see if there’s clustering on the canonical log lines around timestamp, host machine, customer ID. Maybe it’s not specifically my route, but an infra issue. All in all, there’s a lot of stuff to sift through. There’s so much judgment required, and I haven’t even found the problem, let alone thought about a solution yet. Yet, an agent harness is almost perfect for this. Given some alert and timestamp, point me in the right direction: logs, source code, or clustering. This has cut my triaging time from 15+ minutes to 1-2 minutes per issue. You don’t even need the SotA ($$$$) models. Save your money, use a faster model. I published this workflow as a skill for my teammates with the intention of sharing the actual human skill involved. The output names all the queries it tried, categorized into informative or non-informative, with links to dig deeper. I don’t want it to be magical, because I want my teammates to know how to think about triaging. I also want it to be a ramp to independent discovery. 当我值班时，职责之一是处理我们团队负责的 API 端点故障。这些故障通常表现为“高频率的 HTTP 4XX/5XX 错误”。有时这只是噪音，比如某个 Pod 的数据库连接出现了短暂波动；但有时则预示着 Bug，比如客户无法执行删除操作。故障分类（Triaging）非常繁琐：第一步是按时间过滤，搜索标记了特定端点和特定 HTTP 错误的规范日志行。一旦找到触发报警的请求，我会通过请求 ID 进行搜索，查看该请求从始至终的全过程。根据日志和源代码，我通常能推断出问题所在。有时堆栈跟踪是编译后的 JavaScript 而非 TypeScript，导致行号对不上，我只能根据下一个函数调用的名称进行猜测。我会反复确认自己查看的是否为典型请求，并快速检查另外两三个请求 ID，以确保它们属于同一个根本原因。对于数据库连接超时等更棘手的问题，我会查看日志行在时间戳、主机、客户 ID 等维度上是否存在聚类。也许这并非我的路由问题，而是基础设施问题。总而言之，需要筛选的信息量巨大，且需要极高的判断力，而此时我甚至还没找到问题，更别提解决方案了。然而，AI Agent 几乎完美契合这个场景。给定报警信息和时间戳，它能为我指明方向：是查看日志、源代码还是聚类分析。这使我的故障分类时间从每次 15 分钟以上缩短到了 1-2 分钟。你甚至不需要最顶尖（昂贵）的模型，省点钱，用更快的模型即可。我将这个工作流作为一项技能分享给队友，目的是传授其中的人类判断技巧。输出结果会列出它尝试过的所有查询，并将其分类为“有价值”或“无价值”，同时提供深入挖掘的链接。我不希望它显得过于神秘，因为我希望队友们能理解故障分类的思考过程，并将其作为独立解决问题的阶梯。

Shortening Content

内容缩短

I specifically didn’t call this summarizing because: ChatGPT doesn’t summarise. When I asked ChatGPT to summarise this text, it instead shortened the text. — https://ea.rna.nl/2024/05/27/when-chatgpt-summarises-it-actually-does-nothing-of-the-kind/ 我特意没有将其称为“总结”，因为：ChatGPT 根本不会总结。当我要求 ChatGPT 总结一段文本时，它实际上只是缩短了文本。——（参考链接：https://ea.rna.nl/2024/05/27/when-chatgpt-summarises-it-actually-does-nothing-of-the-kind/）

But despite all that, I still find incredible value in shortening texts! I’ll sometimes get recommended a podcast or video that’s over 1 hour long. Sometimes, I’m hooked within the first 5 minutes. But for technical content, my interest is often buried deep in the video, maybe 30 minutes in for recorded talks. I don’t want to spend that much time figuring out if something is interesting to me, and LLMs greatly help with that. In my experience, if there’s enough interesting content in the shortened version, there’s plenty in the unshortened version. One video casually mentioned east-coast vs west-coast programming in the US. Without shortening, I would have stopped watching 19 seconds earlier out of disinterest. 尽管如此，我仍然发现缩短文本具有巨大的价值！有时我会收到超过 1 小时的播客或视频推荐。有时我会在前 5 分钟就被吸引，但对于技术内容，我感兴趣的部分往往深埋在视频中，比如录制的讲座可能要到 30 分钟后才进入正题。我不想花那么多时间去判断内容是否对我感兴趣，而 LLM 在这方面帮了大忙。根据我的经验，如果缩短后的版本有足够吸引人的内容，那么原版内容肯定更丰富。有一个视频随口提到了美国东海岸与西海岸的编程差异，如果不是因为缩短版，我可能因为没兴趣而在 19 秒前就关掉了。

Transcribing

转录

Okay, shortening is really useful to me, but how do I get it to work on videos and podcasts? I made myself a little automation that, given a link, will check: If there are subtitles, download that; If it’s a video, download the audio for transcribing; If it’s audio, transcribe. Once I transform slow video or audio formats into text, I can summarize! I say all this with the caveat that maybe this is a coping skill for my ADHD. Attention is hard for me to maintain, specifically for audio. I literally have a test result saying I’m in the bottom 1% for auditory focus, consistency, AND stamina. These are three separate skills, and I’m statistically awful at all of them. So maybe what matters the most is the ability to transform audio into text, since I’m able to process text much better than audio (though statistically still average). 好吧，缩短内容对我很有用，但我该如何将其应用到视频和播客上呢？我做了一个小自动化工具，给定一个链接，它会检查：如果有字幕，就下载字幕；如果是视频，就提取音频进行转录；如果是音频，直接转录。一旦我将缓慢的视频或音频格式转换为文本，我就可以进行总结了！我必须说明的是，这可能是我应对 ADHD（注意力缺陷多动障碍）的一种策略。我很难保持注意力，尤其是在听音频时。我的测试结果显示，我在听觉专注力、一致性和耐力方面处于倒数 1%。这是三项独立的技能，而我在统计学上表现极差。所以，对我来说最重要的是将音频转化为文本的能力，因为我处理文本的能力远好于音频（尽管在统计学上仍处于平均水平）。

Filtering Recruiter Spam

过滤招聘垃圾邮件

I get between 1-4 emails per day from “recruiters” or their Agents. It’s quite annoying. They’ll come on weekends, overnight, holidays. So I set up an automation that reads my emails every 15 minutes and classifies them as recruiter or not. If it’s recruiter, mark it as read and label it. Something small like this has cut down on my email maintenance significantly. Sometimes I go through them and try to unsubscribe, but often it’s just personal email spam, not a true mailing list subscription. In the past, I’d reply saying I’m not looking for work right now, but the sheer quantity of emails makes me feel like no human will ever actually look at my reply. 我每天会收到 1 到 4 封来自“招聘人员”或其代理人的邮件，非常烦人。它们会在周末、深夜或节假日发来。于是，我设置了一个自动化程序，每 15 分钟读取一次邮件，并将其分类为“招聘邮件”或“非招聘邮件”。如果是招聘邮件，就自动标记为已读并打上标签。这样的小工具极大地减少了我处理邮件的时间。有时我会尝试退订，但通常这些只是个人垃圾邮件，并非真正的邮件列表订阅。过去，我会回复说我目前没有找工作的打算，但邮件的数量之多让我觉得，根本不会有真人去看我的回复。