Pandas Isn’t Going Anywhere: Why It’s Still My Go-To for Data Wrangling
Pandas Isn’t Going Anywhere: Why It’s Still My Go-To for Data Wrangling
Pandas 依然稳如泰山:为什么它仍是我数据清洗的首选工具
Billions of rows might be the exception, but for everything else, Pandas is still a highly reliable tool. 处理数十亿行数据或许是个例外,但对于其他所有场景,Pandas 依然是一个极其可靠的工具。
When I first started learning data science in 2020, Pandas was one of the most popular tools. Although new tools focus on improving Pandas’ weaknesses in handling very large datasets, I still use Pandas for many data cleaning, processing, and analysis tasks. Yes, Pandas gives me a hard time when working with billions of rows, but it is definitely more than enough for working with anything below that. I see Pandas being used in not only for EDA or in notebooks but also in production systems. 当我 2020 年刚开始学习数据科学时,Pandas 就是最流行的工具之一。尽管新工具致力于改进 Pandas 在处理超大规模数据集时的弱点,但我仍然在许多数据清洗、处理和分析任务中使用它。没错,当处理数十亿行数据时,Pandas 确实会让我感到吃力,但对于处理规模小于此的数据,它绝对绰绰有余。我不仅在探索性数据分析(EDA)或 Notebook 中看到 Pandas 的身影,它在生产系统中也同样被广泛使用。
In this article, I’ll go over some data cleaning and processing operations to demonstrate how capable Pandas is. Let’s start with the dataset, which contains stock keeping units (SKUs) and a search API responses for these SKUs. 在本文中,我将通过一些数据清洗和处理操作来展示 Pandas 的强大能力。让我们从数据集开始,其中包含了库存单位(SKU)以及针对这些 SKU 的搜索 API 响应。
import pandas as pd
search_results = pd.read_csv("search_results.csv")
search_results.head()
Search result is a list of dictionaries and looks like this: 搜索结果是一个字典列表,看起来像这样:
search_results.loc[0, "search_result"]
# "[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}}, ..., {'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}] ... and 5 entities remaining"
As we see in the output, it’s not a proper list of dictionary format because of the last part (“… and 5 entities remaining”). Also, it’s saved as a single string. In order to make better use of it, we need to convert it to a proper list of dictionaries. The following line of code removes the last part by splitting the string at “…” and takes the first split. 正如我们在输出中看到的,由于最后一部分(“… and 5 entities remaining”)的存在,它并不是一个标准的字典列表格式。此外,它被保存为一个单一的字符串。为了更好地利用它,我们需要将其转换为标准的字典列表。下面这行代码通过在“…”处分割字符串并取第一部分,从而去掉了末尾的冗余信息。
search_results.loc[0, "search_result"].split("...")[0].strip()
However, the output is still a single string. We can use the built-in ast module of Python to convert it to a list:
然而,输出结果仍然是一个字符串。我们可以使用 Python 内置的 ast 模块将其转换为列表:
import ast
res = ast.literal_eval(search_results.loc[0, "search_result"].split("...")[0].strip())
We now have the search results as a proper list of dictionaries. This was only for a single row. We need to apply the same operation to all SKUs (i.e. entire SKU column). One option is to go over all the rows in a for loop and perform the same operation. However, this is not the best option. We should prefer vectorized operations when we can. A vectorized operation basically means executing the code on all rows at once. On a single row, I used splitting to get rid of the last part of the string but it did not work in a vectorized operation. A more robust option seems to be using a regex.
现在我们得到了标准的字典列表格式的搜索结果。但这只是针对单行数据。我们需要对所有 SKU(即整个 SKU 列)应用相同的操作。一种选择是使用 for 循环遍历所有行并执行相同的操作。然而,这不是最佳方案。我们应该尽可能优先使用向量化操作。向量化操作本质上意味着一次性对所有行执行代码。在单行处理时,我使用了分割字符串的方法,但这在向量化操作中并不适用。一个更稳健的选择似乎是使用正则表达式。
search_results.loc[:, 'search_result'] = search_results['search_result'].str.replace(r"\.\.\..*", "", regex=True).str.strip()
This code selects “…” and everything that comes after it and replaces them with nothing. In other words, it removes “… and 5 entities remaining” part. We now have all the rows in the search results column as a proper list of dictionaries. 这段代码选中了“…”及其之后的所有内容,并将它们替换为空。换句话说,它移除了“… and 5 entities remaining”这一部分。现在,搜索结果列中的所有行都已成为标准的字典列表。
What I’m interested in is the SKUs returned in the search results. I’ll create a new column by extracting the SKUs in the dictionaries. I can access them using the “my_id” key of the dictionary. There are 3 parts of this operation: 我感兴趣的是搜索结果中返回的 SKU。我将通过提取字典中的 SKU 来创建一个新列。我可以使用字典的“my_id”键来访问它们。此操作包含三个部分:
-
Convert the search result string to list using the literal_eval function
-
Extract SKU from the my_id key of the dictionary
-
Do this in a list comprehension to get SKUs from all the dictionaries in the list
-
使用
literal_eval函数将搜索结果字符串转换为列表 -
从字典的
my_id键中提取 SKU -
使用列表推导式从列表中的所有字典获取 SKU
We can do all these operations by applying a lambda function to all rows as follows: 我们可以通过对所有行应用 lambda 函数来完成所有这些操作,如下所示:
search_results.loc[:, "result_skus"] = \
search_results["search_result"].apply(lambda x: [item['my_id'] for item in ast.literal_eval(x)])
search_results.head()
Each row in the result_skus column contains a list of 10 SKUs. Let’s say I need to have these 10 SKUs in different rows. For each row in the sku column, there will be 10 rows created from the list in the result_skus.
result_skus 列中的每一行都包含一个包含 10 个 SKU 的列表。假设我需要将这 10 个 SKU 分别放在不同的行中。对于 sku 列中的每一行,都将根据 result_skus 中的列表创建 10 行数据。