Data Visualization from the Comfort of your Terminal

Data Visualization from the Comfort of your Terminal

在终端中舒适地进行数据可视化

Data visualization from the comfort of your terminal. This document is a showcase & guide to data visualization in the terminal using the xan command line tool. This aspect of the tool is often overlooked because xan is first and foremost a very performant tabular data processing utility, but it can also render a large variety of typical data visualizations directly in your terminal. 在终端中舒适地进行数据可视化。本文档是使用 xan 命令行工具在终端中进行数据可视化的展示与指南。这一功能往往被忽视,因为 xan 首先是一个高性能的表格数据处理工具,但它同时也能直接在终端中渲染多种典型的数据可视化图表。

This ultimately means you never have to leave the terminal to explore the data you mangle. I say “comfort” and I mean it ;). xan will have processed and rendered your data in the terminal long before you are able to spin up your Jupyter instance and import pandas & matplotlib. No cruft. No distraction. Just raw insights, like it’s still 1970 and all you have is ASCII art, now with (true) ✨colors✨ and Unicode support (braille characters are a godsend). 这意味着你无需离开终端即可探索你所处理的数据。我说“舒适”是认真的 ;) 。在你启动 Jupyter 实例并导入 pandasmatplotlib 之前,xan 早已在终端中处理并渲染好了你的数据。没有冗余,没有干扰。只有原始的洞察力,就像回到 1970 年代,虽然只有 ASCII 艺术,但现在有了(真正的)✨色彩✨和 Unicode 支持(盲文点字字符简直是天赐之物)。

Downloading the datasets used in this guide

下载本指南中使用的数据集

You can download all datasets used throughout this guide as a single tarball: 你可以将本指南中使用的所有数据集作为一个压缩包下载:

curl -LO https://github.com/medialab/xan/raw/refs/heads/master/docs/cookbook/resources/dataviz.tar.gz
tar -xvzf dataviz.tar.gz

Here is the list of files you will find inside the tarball (~10MB): 以下是压缩包内包含的文件列表(约 10MB):

  • clusters.csv: x and y positions of nodes in a graph containing 5 well-defined clusters, as inferred by the ForceAtlas2 layout algorithm.
  • clusters.csv: 通过 ForceAtlas2 布局算法推断出的包含 5 个明确聚类的图表中节点的 x 和 y 坐标。
  • iris.csv: the famous “Iris” dataset, used in a lot of machine learning examples.
  • iris.csv: 著名的“鸢尾花”数据集,常用于机器学习示例。
  • layout.csv: x and y positions of a sample of accounts from a French defunct social network, as inferred by the ForceAltas2 layout algorithm.
  • layout.csv: 通过 ForceAltas2 布局算法推断出的法国某已倒闭社交网络中部分账户的 x 和 y 坐标。
  • les-miserables.csv: edges from a graph of characters from the novel “Les Misérables” by Victor Hugo.
  • les-miserables.csv: 维克多·雨果小说《悲惨世界》中人物关系图的边数据。
  • medias.csv: a curated corpus of French medias online.
  • medias.csv: 精选的法国在线媒体语料库。
  • pulsar.csv: data from the pulsar plot from the article “Radio Observations of the Pulse Profiles and Dispersion Measures of Twelve Pulsars by Harold D. Carft, Jr. 1970” (original data).
  • pulsar.csv: 来自 Harold D. Carft, Jr. 1970 年文章《十二颗脉冲星的脉冲轮廓和色散测量的无线电观测》中的脉冲星绘图数据(原始数据)。
  • series.csv: time series related from RIAA about music distribution formats in time and their associated gross revenues.
  • series.csv: RIAA 关于音乐发行格式随时间变化及其相关总收入的时间序列数据。
  • sotu.csv: retranscription of U.S. state of the union speeches across time (1790 to 2018) (original data).
  • sotu.csv: 美国国情咨文演讲的历史转录(1790 年至 2018 年)(原始数据)。

xan view to display tables

使用 xan view 显示表格

xan view is usually one of the first learned and most used commands of xan since it lets you take a glance at your CSV files directly in the terminal, using a very familiar tabular representation. You can forego using LibreOffice or (god forbids!) Excel and never ever have to leave the terminal again! xan view 通常是学习 xan 时最先接触且使用最频繁的命令之一,因为它让你能够使用非常熟悉的表格形式,直接在终端中预览 CSV 文件。你可以放弃使用 LibreOffice 或(天哪!)Excel,从此再也不用离开终端!

Here is how to use it: 使用方法如下:

xan view series.csv

See how different data types are colored differently, like in a code editor, to help you figure things out? xan view knows how to recognize numbers, strings, time-related information, urls, null values and booleans. If you fancy rainbows and are not much of a data type kind of person you can also use the -R/--rainbow flag to use alternating color per column instead: 看到不同数据类型是如何像在代码编辑器中那样被赋予不同颜色的吗?这有助于你快速理清数据。xan view 能够识别数字、字符串、时间相关信息、URL、空值和布尔值。如果你喜欢彩虹色,且不太在意数据类型,也可以使用 -R/--rainbow 标志,让每一列交替显示不同颜色:

xan view --rainbow series.csv

Fitting the screen

适应屏幕

In series.csv, the data is quite concise, so it is easy to print all columns losslessly in the terminal. But see what happens when we use the command, in a small terminal, on sotu.csv, containing urls and the full text of whole speeches: 在 series.csv 中,数据非常简洁,因此很容易在终端中无损地打印所有列。但看看当我们在一个较小的终端窗口中对包含 URL 和完整演讲文本的 sotu.csv 使用该命令时会发生什么:

xan view sotu.csv

First, see how some values get truncated to fit on screen? Then the command tells you we could only display 3 out of 5 columns, which is why there is a dummy column in the middle full of ellipsis characters, lest we forget it. When space is tight, the view command will always try to print a mix of columns from the beginning and from the end. 首先,看到一些值为了适应屏幕而被截断了吗?然后命令会告诉你,我们只能显示 5 列中的 3 列,这就是为什么中间有一个充满省略号 的占位列,以防我们忘记。当空间紧张时,view 命令总是会尝试打印开头和结尾列的组合。

Then, see how the first cell of the transcript column contains a highlighted leading newline character? The view command will highlight a lot of those patterns to easily spot irregularities about your data, such as empty cells (displayed as a greyed out <empty>), leading/trailing whitespace etc. 此外,看到 transcript 列的第一个单元格中包含一个高亮显示的起始换行符了吗?view 命令会高亮显示许多此类模式,以便轻松发现数据中的异常,例如空单元格(显示为灰色的 <empty>)、前导/尾随空格等。

Finally, see how last row is also a dummy one full of ellipsis characters? That’s because xan view, like most xan commands, follow a streaming approach and only displays the first rows of your data by default (my screenshots shows only 10, but the command’s default is 100). 最后,看到最后一行也是充满省略号 的占位行了吗?这是因为 xan view 和大多数 xan 命令一样,采用流式处理方式,默认只显示数据的前几行(我的截图只显示了 10 行,但该命令的默认值是 100 行)。

The command works thusly because you usually don’t need to consume all rows of a file to be able to preview it efficiently and because, as a human, you won’t be able to read more than some hundreds of rows by yourself anyway ;). What’s more xan view is usually the last step of a complex xan pipeline yielding a stream. You should not need to consume it entirely to make sure it spits out the required data, which is the reason why you used xan view in the first place instead of piping the result to a file. 该命令之所以这样工作,是因为你通常不需要读取文件的所有行就能高效预览,而且作为人类,你本身也无法一次性阅读几百行以上的数据 ;) 。此外,xan view 通常是复杂 xan 流水线的最后一步。你不需要完全读取它来确认它输出了所需的数据,这也是你最初使用 xan view 而不是将结果导出到文件的原因。

Printing more rows

打印更多行

If you want more or less rows on screen, you can always use the -l/--limit flag. 如果你想在屏幕上显示更多或更少的行,可以使用 -l/--limit 标志。