I Spent an Hour on a Data Preprocessing Task Before Asking Gemini
I Spent an Hour on a Data Preprocessing Task Before Asking Gemini
在向 Gemini 求助前,我花了一小时处理数据预处理任务
Data Science: I Spent an Hour on a Data Preprocessing Task Before Asking Gemini 数据科学:在向 Gemini 求助前,我花了一小时处理数据预处理任务
How Gemini solved my Pandas problem in seconds, and why data science fundamentals still matter to spot suboptimal solutions. Gemini 如何在几秒钟内解决我的 Pandas 问题,以及为什么数据科学基础知识对于发现次优解决方案仍然至关重要。
Soner Yıldırım | Jun 23, 2026 | 7 min read Soner Yıldırım | 2026年6月23日 | 7分钟阅读
As data scientists, we spend a significant amount of time on data preparation for downstream tasks. Whether it involves data cleaning, handling missing values, feature engineering, data preprocessing, or post-processing, this phase requires a lot of time. 作为数据科学家,我们花费大量时间为后续任务进行数据准备。无论是数据清洗、处理缺失值、特征工程、数据预处理还是后处理,这一阶段都需要耗费大量时间。
So, I was working on this post-processing task where I needed to create a new column in a Pandas DataFrame by extracting values from an existing column, based on the data from two other columns. I could have directly asked an LLM to write the code (which I usually do) but this time I wanted to do it myself. It was early in the morning and I had a fresh mind so I was in the mood to handle some complex data operations. 当时我正在处理一项后处理任务,需要根据另外两列的数据,从现有列中提取值,并在 Pandas DataFrame 中创建一个新列。我本可以直接让大模型(LLM)帮我写代码(我通常都这么做),但这次我想自己尝试一下。那天清晨我头脑清醒,正有兴致处理一些复杂的数据操作。
Here is what I had to do. I had a DataFrame with predicted_categories, pred_category_id, and text_predicted_probs columns. The values in the predicted_categories column are lists of five categories in “category_id” – “category_description” format.
以下是我需要完成的工作。我有一个包含 predicted_categories、pred_category_id 和 text_predicted_probs 列的 DataFrame。predicted_categories 列中的值是五个类别的列表,格式为“category_id” – “category_description”。
['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri', '13003144 - Torna Makinesi', '13003195 - Kumpas']
The text_predicted_probs column has the predicted probabilities of these five categories in order.
text_predicted_probs 列按顺序包含了这五个类别的预测概率。
[0.943, 0.018, 0.008, 0.006, 0.004]
Hence, the first value in the text_predicted_probs is the probability of the first category in the predicted_categories, and so on. The pred_category_id column shows the predicted category id from another model. What I need is the predicted probability of the category in the pred_category_id column. I need to get the order of the pred_category_id in the predicted_categories column and then take its value from the text_predicted_probs column.
因此,text_predicted_probs 中的第一个值对应 predicted_categories 中第一个类别的概率,依此类推。pred_category_id 列显示了来自另一个模型的预测类别 ID。我需要的是 pred_category_id 列中对应类别的预测概率。我需要获取 pred_category_id 在 predicted_categories 列中的顺序,然后从 text_predicted_probs 列中提取相应的值。
If we asked Gemini, or another advanced model, we’ll probably get the answer in seconds. But, I wanted to do it on my own first and then ask Gemini. Let’s start with reading the dataset into a Pandas DataFrame. 如果我们询问 Gemini 或其他高级模型,可能几秒钟就能得到答案。但我还是想先自己动手,然后再去问 Gemini。让我们从将数据集读入 Pandas DataFrame 开始。
import pandas as pd
results = pd.read_csv("prediction_results.csv")
The values in the predicted_categories column are lists of strings with category ids and category names. It’s a list but saved as a string so we first convert it to a list object using the literal_eval function in the built-in ast module of Python.
predicted_categories 列中的值是包含类别 ID 和类别名称的字符串列表。它虽然是列表,但被保存为字符串,因此我们首先使用 Python 内置 ast 模块中的 literal_eval 函数将其转换为列表对象。
import ast
ast.literal_eval(results.loc[0, "predicted_categories"])
# output: ['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri', '13003144 - Torna Makinesi', '13003195 - Kumpas']
To extract the category ids, we can split each string in this list at the “-” character and then select the first part after splitting. Since we have a list with five categories, we should do this operation in a list comprehension as follows: 为了提取类别 ID,我们可以将列表中的每个字符串按“-”字符进行分割,然后选择分割后的第一部分。由于我们有一个包含五个类别的列表,我们应该使用列表推导式来执行此操作,如下所示:
[category.split("-")[0].strip() for category in ast.literal_eval(results.loc[0, "predicted_categories"])]
# output: ['80814001', '13003106', '80805004', '13003144', '13003195']
We’ve done it for a single value (i.e. one row). In order to do the same operation to the entire predicted_categories column, we can use a list comprehension. It will be a list comprehension inside another list comprehension (i.e. nested list comprehension):
我们已经针对单个值(即一行)完成了操作。为了对整个 predicted_categories 列执行相同的操作,我们可以使用列表推导式。这将是一个嵌套的列表推导式:
results.loc[:, "predicted_category_ids"] = [
[category.split("-")[0].strip() for category in ast.literal_eval(predicted_categories)]
for predicted_categories in results["predicted_categories"]
]
The next step is to check the order of the categories in the predicted category id lists. We will then use this order to extract the predicted probability of the category. Python list object has an index method, which returns the index (i.e. order) of the item in the list.
下一步是检查预测类别 ID 列表中类别的顺序。然后,我们将使用此顺序来提取该类别的预测概率。Python 列表对象有一个 index 方法,它会返回列表中项的索引(即顺序)。
results.loc[0, "predicted_category_ids"].index("13003106")
# output: 2
Once I find the index of a predicted category id, I can use it to get the probability of this category id from the text_predicted_probs column. What we need to do:
一旦我找到了预测类别 ID 的索引,我就可以用它从 text_predicted_probs 列中获取该类别 ID 的概率。我们需要做的是:
- Get the index of
pred_category_idin thepredicted_category_ids. - Use this index to extract the relevant value from
text_predicted_probs. - 获取
pred_category_id在predicted_category_ids中的索引。 - 使用此索引从
text_predicted_probs中提取相关值。
These steps can be done in a single operation by zipping these three columns. Let’s test it on the first row:
通过对这三列进行 zip 操作,可以在一步内完成这些步骤。让我们在第一行上测试一下:
for i, j, k in zip(results["pred_category_id"][:1], results["predicted_category_ids"][:1], results["text_predicted_probs"][:1]):
print(j.index(str(i))) # get the index of pred_category_id in predicted_category_ids
print(ast.literal_eval(k)[j.index(str(i))]) # get the value at this index in text_predicted_probs
# output: 0
# 0.943
We can basically convert the for loop in the previous code block to a list comprehension. I’ve only added a check “if str(i) in j” to handle potential missing values.
我们基本上可以将上述代码块中的 for 循环转换为列表推导式。我只是添加了一个“if str(i) in j”的检查,以处理潜在的缺失值。
results.loc[:, "pred_category_prob"] = [
float(ast.literal_eval(k)[j.index(str(i))]) if str(i) in j else 0
for i, j, k in zip(results["pred_category_id"], results["predicted_category_ids"], results["text_predicted_probs"])
]