I analysed 20 years of my chats

I analysed 20 years of my chats

我分析了自己 20 年的聊天记录

Am I a Bad Friend? I analysed 20 years of my chats and turned 1.2M messages into a structured vault of my life - to win friends and influence people. Instead, I learnt things about my emotional bandwidth, endearment cycles, and friendship half-lives. 27 May 2026 · MLX, data analysis, LLMs, 2nd brain

我是一个糟糕的朋友吗?我分析了自己 20 年的聊天记录,将 120 万条消息转化为一个结构化的生活宝库,本想借此“赢得朋友并影响他人”,结果却从中了解到了自己的情感带宽、亲昵周期以及友谊的半衰期。2026 年 5 月 27 日 · MLX, 数据分析, 大语言模型, 第二大脑

In 2014, Tim Urban of WaitButWhy published Your Life in Weeks - a grid where each square is one week of one’s life, and most of the grid is already filled. The image bothered me for years. I started tracking things partly because of it - I wanted the grid to mean something, not just count down. But the biometric data is an odd representation of how fulfilling my life has been. The grid suggests it’s the events that matter - jobs, trips, schools, marriages - and those are easy to mark. But they hardly tell how I felt during those weeks, or what I was like to the people around me. That was what I wanted to measure.

2014 年,WaitButWhy 的 Tim Urban 发表了《你的一生(以周为单位)》(Your Life in Weeks)——那是一个网格图,每一个方格代表人生中的一周,而大部分方格已经被填满。这张图困扰了我多年。我开始追踪各种数据,部分原因正是受此启发——我希望这些网格具有某种意义,而不仅仅是在倒计时。但生物识别数据并不能很好地反映我的人生有多充实。网格暗示只有那些重大事件才重要——工作、旅行、学校、婚姻——这些确实很容易标记。但它们几乎无法说明我在那些周里的感受,或者我在周围人眼中是什么样子。这正是我想要衡量的东西。

So I tried journaling. Paper first, then text files, then daily notes in Obsidian. The journal captured what I thought was important on the day I wrote it. It missed the conversations I forgot to jot down or the slow-moving patterns I couldn’t see at the time. My notes and their connections growing over the years. Tired of being bad at maintaining relationships and wanting the data to compensate, I set off on a quest to build a personal CRM of sorts, built from the record rather than from memory - thanks to the trail left by my prolific time-wasting on the Internet for the past few decades.

于是我尝试写日记。先是纸质版,然后是文本文件,最后是在 Obsidian 中记录每日笔记。日记捕捉到了我当天认为重要的事情,但它遗漏了那些我忘记记录的对话,或是当时无法察觉的缓慢演变的模式。多年来,我的笔记及其关联不断增长。由于厌倦了不擅长维护人际关系,且希望通过数据来弥补,我开始了一项任务:构建一种个人 CRM(客户关系管理系统)。它基于真实记录而非记忆构建——这多亏了我在过去几十年里在互联网上挥霍时间所留下的痕迹。

My digital history

我的数字历史

My online presence breaks into roughly three eras: ICQ, IRC, DC++ in 2000s: midnight channels for script kiddies and banter - all gone, and probably for the best. The ten-year-old I was in those chats doesn’t need a structured archive. VK, Twitter, Facebook in 2010s: school, university, early career - evenly spread. Instagram and Telegram in 2010s-2020s: surprisingly, even though I don’t post much on Instagram, it’s often easier to catch up with people in DMs, and there are more and more people swapping WhatsApp for Telegram too. Armed with GDPR and data access laws, I got myself archives with all my messages, reactions, and social graphs.

我的在线足迹大致分为三个时代:2000 年代的 ICQ、IRC 和 DC++:那是脚本小子和闲聊的午夜频道——一切都已消失,这或许是件好事。那个在聊天室里的十岁小孩并不需要一个结构化的存档。2010 年代的 VK、Twitter 和 Facebook:学校、大学、职业生涯早期——分布均匀。2010 年代至 2020 年代的 Instagram 和 Telegram:令人惊讶的是,尽管我不怎么在 Instagram 上发帖,但通过私信(DM)与人联系往往更容易,而且越来越多的人开始从 WhatsApp 转向 Telegram。凭借 GDPR 和数据访问法律,我获取了包含我所有消息、互动反应和社交图谱的存档。

Data archives

数据存档

Parsing a bunch of JSONs and HTMLs wasn’t hard but wasn’t fun either. Instagram double-encodes Cyrillic through latin-1. Telegram assigns different internal message IDs between exports taken at different dates. Facebook introduced E2E encryption at some point, so the same messages show up in three different folders. Telegram lets you export group chats or just your own messages. VK exports everything without asking. Instagram doesn’t differentiate between broadcasts and personal chats at all. Once parsed into a uniform tab-separated format, the five exports produce different kinds of signal.

解析一堆 JSON 和 HTML 文件并不难,但也绝不有趣。Instagram 通过 latin-1 对西里尔字母进行了双重编码。Telegram 在不同日期导出的数据中,内部消息 ID 是不同的。Facebook 在某个时间点引入了端到端加密,导致同一条消息出现在三个不同的文件夹中。Telegram 允许你导出群聊或仅导出你自己的消息。VK 则不加选择地导出所有内容。Instagram 根本不区分广播消息和个人聊天。一旦将这些数据解析为统一的制表符分隔格式,这五个导出源就会产生不同类型的信号。

Drowning in noise

淹没在噪音中

Before worrying about classification, you have to deal with the fact that most of the data is noise. In my longest thread - 486,000+ messages with my partner across ten years - the content has 2.4% links, 9.1% media, 1.5% emoji-only messages, 28.4% of short fillers, and 58.7% of substantive text. This means, 41% is noise for the purpose of this exercise. Emojis, links, and media were easy to filter, but catching conversational filler words - short words that look like content until you see them hundreds of times per month - is harder.

在担心分类之前,你必须面对一个事实:大部分数据都是噪音。在我最长的对话线程中——与伴侣十年间超过 48.6 万条消息——内容包含 2.4% 的链接、9.1% 的媒体、1.5% 的纯表情消息、28.4% 的简短填充词,以及 58.7% 的实质性文本。这意味着,对于这项练习而言,41% 是噪音。表情符号、链接和媒体很容易过滤,但捕捉对话中的填充词——那些看起来像内容,但当你每月看到它们几百次时就会发现其本质的短词——则要困难得多。

Across all platforms and years, the cleaned corpus contains roughly 52,000 unique lemmas. The novelty rate - the share of words I hadn’t used before in any chat - has been declining since 2008 and plateaued at 6% six years ago. Most of my vocabulary was locked in my early 20s.

在所有平台和年份中,清洗后的语料库包含大约 52,000 个独特的词元(lemmas)。新颖率——即我在任何聊天中从未用过的词汇占比——自 2008 年以来一直在下降,并在六年前稳定在 6%。我的大部分词汇量在 20 多岁时就已经定型了。

Which Sasha

到底是哪个 Sasha

Most people I interact with use more than one platform, and often don’t share usernames across them. If I were to maintain a profile for each known person, I’d need to map them (and mentions of them) across all chats. Cue diminutives and nicknames: the same Alexander might turn into Al, Alex, Xander, Sandy, and Alec(k). It can also be Sasha, if they’re from Eastern Europe - and in Slavic languages Sasha is gender-neutral.

我互动的大多数人都在使用多个平台,而且通常不会在不同平台间共享用户名。如果我要为每个认识的人维护一份档案,我就需要将他们(以及对他们的提及)映射到所有聊天记录中。这就涉及到了昵称和爱称:同一个 Alexander 可能会变成 Al、Alex、Xander、Sandy 和 Alec(k)。如果他们来自东欧,它也可能是 Sasha——而在斯拉夫语中,Sasha 是中性的。