Elixir的数据框架库Explorer快速入门

2024-01-01 | 阅读：次

Elixir的数据框架库Explorer快速入门
- 简介
- 读写数据
  - 读文件
  - 写文件
- 序列操作
  - 数据类型：
  - 填写缺失值：
  - 切片切块
  - 比较
  - 计算支持
  - 兼容转换
  - 翻转
  - 排序
  - 累积
  - 去重
  - 条件运算
- 数据帧操作
  - 构建
  - 选择
  - 过滤
  - 修改
  - 排序
  - 去重
  - 重命名
  - 虚拟
  - 样本
  - 切片和切块
  - 转置
  - 连接
  - 分组
  - summarise使用的聚合操作

Elixir的数据框架库Explorer快速入门

简介

Explorer是Elixir的数据框架库。数据帧是数据分析中常用的数据结构。它是一个由列和行组成的二维表，类似于SQL表或电子表格。

Explorer的目标是提供一个简单而强大的API来操作数据帧。

读写数据

读文件

可以从分隔文件（如CSV）、NDJSON、Parquet和Arrow IPC（feather）格式读取数据。可以使用DataFrame.new/1从Map或Key List中加载数据。

CSV常见选项：

delimiter 用于分隔记录中字段的单个字符。（默认值：“，”）
dtypes [column_name:dtype]的关键字列表。如果未为列指定类型，则从前1000行开始估算。（默认值：[]）
header 文件的第一行是否有列名标题（默认值：true）
max_rows 要读取的最大行数。（默认值：nil）
nil_values 应被解释为nil值的字符串列表。（默认值：[]）
skip_rows 文件开头要跳过的行数。（默认值：0）
skip_rows_after_head 在标题行之后要跳过的行数。（默认值：0）
columns 要保留的列名列表。如果存在，则只有这些列被读入数据帧。（默认值：nil）

内置示例数据集，可从Explorer.Datasets加载。

写文件

header 列名是否应写在文件的第一行（默认值：true）
delimiter 用于分隔记录中字段的单个字符。（默认值：“，”）

alias Explorer.DataFrame
alias Explorer.Series

df = Explorer.Datasets.fossil_fuels()

input = Kino.Input.text("Filename")
filename = Kino.Input.read(input)
DataFrame.to_csv(df, filename)

序列操作

数据类型：

:null
Null
:binary
二进制（字节序列）
:boolean
布尔
:category
内部表示为整数的字符串
:date
Date类型，Elixir.Date的展开
带毫秒/微秒/纳秒的DateTime类型，Elixir.NaiveDateTime的展开。
带毫秒/微秒/纳秒时间间隔类型指定precision的Elixir.NaiveDateTime，展开到Explorer.Duration
32位或64位浮点数
8位、16位、32位或64位有符号整数
8位、16位、32位或64位无符号整数
:string
UTF-8编码二进制
:time
Time类型，Elixir.Time的展开
:list
类似Elixir.List的嵌套数据类型。
:struct
类似Elixir.Map的嵌套数据类型。具有相同的keys和匹配的数据类型。

s1 = Series.from_list([1, 2, 3])

#Explorer.Series<
  Polars[3]
  s64 [1, 2, 3]
>

s2 = Series.from_list(["a", "b", "c"])

#Explorer.Series<
  Polars[3]
  string ["a", "b", "c"]
>

s3 = Series.from_list([~D[2011-01-01], ~D[1965-01-21]])

Series.dtype(s3)
# :date

Series.size(s3)
# 2

1..100 |> Enum.to_list() |> Series.from_list()

#Explorer.Series<
  Polars[100]
  s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
   26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
   50, ...]
>

s = Series.from_list([1.0, 2.0, nil, nil, 5.0])

#Explorer.Series<
  Polars[5]
  f64 [1.0, 2.0, nil, nil, 5.0]
>

填写缺失值：

:forward 将nil替换为前值
:backward 将nil替换为后值
:max 将nil替换为最大值
:min 将nil替换为最小值
:mean 将nil替换为平均值

Series.fill_missing(s, :forward)
#Explorer.Series<
  Polars[2]
  f64 [1.0, 2.0]
>

# 在所有其他情况下，Series必须都是相同的数据类型，否则您将收到ArgumentError。

Series.from_list([1, 2, 3, "a"])

# the value "a" does not match the inferred dtype {:s, 64}

切片切块

s = 1..10 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
  Polars[10]
  s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>
s[1]
# 2
s[-1]
# 10
s[0..4]
#Explorer.Series<
  Polars[3]
  s64 [1, 5, 5]
>
s[[0, 4, 4]]
#Explorer.Series<
  Polars[3]
  s64 [1, 5, 5]
>
Series.to_list(s)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

比较

s = 1..11 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
  Polars[11]
  s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>
s1 = 11..1//-1 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
  Polars[11]
  s64 [11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
>
Series.equal(s, s1)
#Explorer.Series<
  Polars[11]
  boolean [false, false, false, false, false, true, false, false, false, false,
   false]
>
Series.equal(s, 5)
#Explorer.Series<
  Polars[11]
  boolean [false, false, false, false, true, false, false, false, false, false,
   false]
>
Series.not_equal(s, 10)
#Explorer.Series<
  Polars[11]
  boolean [true, true, true, true, true, true, true, true, true, false, true]
>
Series.greater_equal(s, 4)
#Explorer.Series<
  Polars[11]
  boolean [false, false, false, true, true, true, true, true, true, true, true]
>

计算支持

Series.add(s, s1)
#Explorer.Series<
  Polars[11]
  s64 [12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]
>
Series.subtract(s, 4)
#Explorer.Series<
  Polars[11]
  s64 [-3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]
>
Series.multiply(s, s1)
#Explorer.Series<
  Polars[11]
  s64 [11, 20, 27, 32, 35, 36, 35, 32, 27, 20, 11]
>

兼容转换

s = Series.from_list([1, 2, 3])
#Explorer.Series<
  Polars[3]
  s64 [1, 2, 3]
>

s1 = Series.from_list([1.0, 2.0, 3.0])
#Explorer.Series<
  Polars[3]
  f64 [1.0, 2.0, 3.0]
>

Series.add(s, s1)
#Explorer.Series<
  Polars[3]
  f64 [2.0, 4.0, 6.0]
>

s2 = Series.from_list(["a", "b", "c"])
Series.add(s, s2)

** (ArgumentError) cannot invoke Explorer.Series.add/2 with mismatched dtypes: {:s, 64} and :string
    (explorer 0.10.0) lib/explorer/series.ex:6814: Explorer.Series.dtype_mismatch_error/3
    iex:23: (file)

翻转

s = Series.from_list([1, 2, 3, 4])
#Explorer.Series<
  Polars[4]
  s64 [1, 2, 3, 4]
>
Series.reverse(s)
#Explorer.Series<
  Polars[4]
  s64 [4, 3, 2, 1]
>

排序

 1..100 |> Enum.to_list() |> Enum.shuffle() |> Series.from_list() |> Series.sort()
#Explorer.Series<
  Polars[100]
  s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
   21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
   40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, ...]
>

s = 1..100 |> Enum.to_list() |> Enum.shuffle() |> Series.from_list()
#Explorer.Series<
  Polars[100]
  s64 [56, 23, 37, 43, 61, 87, 81, 92, 91, 67, 18, 60, 9, 76, 65, 75, 94, 3, 90,
   21, 10, 99, 59, 30, 27, 19, 38, 22, 12, 86, 100, 96, 68, 50, 53, 25, 88, 34,
   6, 44, 95, 48, 71, 35, 62, 20, 97, 51, 40, 33, ...]
>
ids = Series.argsort(s) |> Series.to_list()
[70, 56, 17, 80, 66, 38, 73, 88, 12, 20, 57, 28, 65, 54, 67, 64, 86, 10, 25, 45,
 19, 27, 1, 83, 35, 79, 24, 98, 95, 23, 60, 78, 49, 37, 43, 91, 2, 26, 51, 48,
 52, 96, 3, 39, 97, 75, 99, 41, 76, 33, ...]

Series.slice(s, ids)
#Explorer.Series<
  Polars[100]
  s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
   21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
   40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, ...]
>

累积

s = 1..100 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
  Polars[100]
  s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
   21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
   40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, ...]
>
Series.cumulative_sum(s)
#Explorer.Series<
  Polars[100]
  s64 [1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120, 136, 153, 171,
   190, 210, 231, 253, 276, 300, 325, 351, 378, 406, 435, 465, 496, 528, 561,
   595, 630, 666, 703, 741, 780, 820, 861, 903, 946, 990, 1035, 1081, 1128,
   1176, 1225, 1275, ...]
>
Series.window_sum(s, 4)
#Explorer.Series<
  Polars[100]
  s64 [1, 3, 6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46, 50, 54, 58, 62, 66, 70,
   74, 78, 82, 86, 90, 94, 98, 102, 106, 110, 114, 118, 122, 126, 130, 134, 138,
   142, 146, 150, 154, 158, 162, 166, 170, 174, 178, 182, 186, 190, 194, ...]
>

去重

s = Series.from_list(["a", "b", "b", "c", "c", "c"])
#Explorer.Series<
  Polars[6]
  string ["a", "b", "b", "c", "c", "c"]
>
Series.distinct(s)
#Explorer.Series<
  Polars[3]
  string ["a", "b", "c"]
>
Series.n_distinct(s)
3
Series.frequencies(s)
#Explorer.DataFrame<
  Polars[3 x 2]
  values string ["c", "b", "a"]
  counts u32 [3, 2, 1]
>

条件运算

s1 = Series.from_list(["It", "was", "the", "best", "of", "times"])
#Explorer.Series<
  Polars[6]
  string ["It", "was", "the", "best", "of", "times"]
>
s1 |> Series.equal("best") |> Series.select("worst", s1)
#Explorer.Series<
  Polars[6]
  string ["It", "was", "the", "worst", "of", "times"]
>

数据帧操作

构建

DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
#Explorer.DataFrame<
  Polars[3 x 2]
  a s64 [1, 2, 3]
  b string ["a", "b", "c"]
>

df
#Explorer.DataFrame<
  Polars[1094 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

DataFrame.names(df)
["year", "country", "total", "solid_fuel", "liquid_fuel", "gas_fuel", "cement",
 "gas_flaring", "per_capita", "bunker_fuels"]

 DataFrame.dtypes(df)
%{
  "bunker_fuels" => {:s, 64},
  "cement" => {:s, 64},
  "country" => :string,
  "gas_flaring" => {:s, 64},
  "gas_fuel" => {:s, 64},
  "liquid_fuel" => {:s, 64},
  "per_capita" => {:f, 64},
  "solid_fuel" => {:s, 64},
  "total" => {:s, 64},
  "year" => {:s, 64}
}

DataFrame.shape(df)
{1094, 10}

DataFrame.head(df)
#Explorer.DataFrame<
  Polars[5 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA"]
  total s64 [2308, 1254, 32500, 141, 7924]
  solid_fuel s64 [627, 117, 332, 0, 0]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649]
  gas_fuel s64 [74, 7, 14565, 0, 374]
  cement s64 [5, 177, 2598, 0, 204]
  gas_flaring s64 [0, 0, 2623, 0, 3697]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37]
  bunker_fuels s64 [9, 7, 663, 0, 321]
>
 DataFrame.tail(df, 10)
#Explorer.DataFrame<
  Polars[10 x 10]
  year s64 [2014, 2014, 2014, 2014, 2014, ...]
  country string ["UNITED STATES OF AMERICA", "URUGUAY", "UZBEKISTAN",
   "VANUATU", "VENEZUELA", ...]
  total s64 [1432855, 1840, 28692, 42, 50510, ...]
  solid_fuel s64 [450047, 2, 1677, 0, 204, ...]
  liquid_fuel s64 [576531, 1700, 2086, 42, 28445, ...]
  gas_fuel s64 [390719, 25, 23929, 0, 12731, ...]
  cement s64 [11314, 112, 1000, 0, 1088, ...]
  gas_flaring s64 [4244, 0, 0, 0, 8042, ...]
  per_capita f64 [4.43, 0.54, 0.97, 0.16, 1.65, ...]
  bunker_fuels s64 [30722, 251, 0, 10, 1256, ...]
>

选择

require DataFrame, as: DF
Explorer.DataFrame
DF.select(df, ["year", "country"])
#Explorer.DataFrame<
  Polars[1094 x 2]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>

DF.select(df, &String.ends_with?(&1, "fuel"))
#Explorer.DataFrame<
  Polars[1094 x 3]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
>

DF.discard(df, &String.ends_with?(&1, "fuel"))
#Explorer.DataFrame<
  Polars[1094 x 7]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

过滤

DF.filter(df, country == "BRAZIL")
#Explorer.DataFrame<
  Polars[5 x 10]
  year s64 [2010, 2011, 2012, 2013, 2014]
  country string ["BRAZIL", "BRAZIL", "BRAZIL", "BRAZIL", "BRAZIL"]
  total s64 [114468, 119829, 128178, 137354, 144480]
  solid_fuel s64 [15965, 17498, 17165, 18773, 20089]
  liquid_fuel s64 [74689, 78849, 84409, 88898, 92454]
  gas_fuel s64 [14372, 13778, 16328, 19399, 21297]
  cement s64 [8040, 8717, 9428, 9517, 9691]
  gas_flaring s64 [1402, 987, 848, 767, 949]
  per_capita f64 [0.58, 0.6, 0.63, 0.67, 0.7]
  bunker_fuels s64 [5101, 5516, 5168, 4895, 4895]
>

DF.filter(df, country == "ALGERIA" and year > 2012)
#Explorer.DataFrame<
  Polars[2 x 10]
  year s64 [2013, 2014]
  country string ["ALGERIA", "ALGERIA"]
  total s64 [36669, 39651]
  solid_fuel s64 [198, 149]
  liquid_fuel s64 [14170, 14422]
  gas_fuel s64 [17863, 20151]
  cement s64 [2516, 2856]
  gas_flaring s64 [1922, 2073]
  per_capita f64 [0.96, 1.02]
  bunker_fuels s64 [687, 581]
>

DF.filter_with(df, fn ldf ->
  ldf["country"]
  |> Series.equal("ALGERIA")
  |> Series.and(Series.greater(ldf["year"], 2012))
end)

#Explorer.DataFrame<
  Polars[2 x 10]
  year s64 [2013, 2014]
  country string ["ALGERIA", "ALGERIA"]
  total s64 [36669, 39651]
  solid_fuel s64 [198, 149]
  liquid_fuel s64 [14170, 14422]
  gas_fuel s64 [17863, 20151]
  cement s64 [2516, 2856]
  gas_flaring s64 [1922, 2073]
  per_capita f64 [0.96, 1.02]
  bunker_fuels s64 [687, 581]
>

修改

DF.mutate(df, new_column: solid_fuel + cement)
#Explorer.DataFrame<
  Polars[1094 x 11]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
  new_column s64 [632, 294, 2930, 0, 204, ...]
>

DF.mutate(df,
  gas_fuel: Series.cast(gas_fuel, :float),
  gas_and_liquid_fuel: gas_fuel + liquid_fuel
)
#Explorer.DataFrame<
  Polars[1094 x 11]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel f64 [74.0, 7.0, 14565.0, 0.0, 374.0, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
  gas_and_liquid_fuel s64 [1675, 960, 26946, 141, 4023, ...]
>

DF.mutate(df, %{"gas_fuel" => gas_fuel - 10})
#Explorer.DataFrame<
  Polars[1094 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [64, -3, 14555, -10, 364, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

排序

DF.sort_by(df, year)
#Explorer.DataFrame<
  Polars[1094 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

DF.sort_by(df, asc: total, desc: year)
#Explorer.DataFrame<
  Polars[1094 x 10]
  year s64 [2010, 2013, 2012, 2011, 2011, ...]
  country string ["NIUE", "NIUE", "NIUE", "NIUE", "TUVALU", ...]
  total s64 [1, 2, 2, 2, 2, ...]
  solid_fuel s64 [0, 0, 0, 0, 0, ...]
  liquid_fuel s64 [1, 2, 2, 2, 2, ...]
  gas_fuel s64 [0, 0, 0, 0, 0, ...]
  cement s64 [0, 0, 0, 0, 0, ...]
  gas_flaring s64 [0, 0, 0, 0, 0, ...]
  per_capita f64 [0.52, 1.04, 1.04, 1.04, 0.0, ...]
  bunker_fuels s64 [0, 0, 0, 0, 0, ...]
>
DF.sort_by(df, asc: window_sum(total, 2))
#Explorer.DataFrame<
  Polars[1094 x 10]
  year s64 [2010, 2011, 2012, 2010, 2011, ...]
  country string ["FEDERATED STATES OF MICRONESIA",
   "FEDERATED STATES OF MICRONESIA", "FEDERATED STATES OF MICRONESIA", "TUVALU",
   "TUVALU", ...]
  total s64 [31, 33, 37, 2, 2, ...]
  solid_fuel s64 [0, 0, 0, 0, 0, ...]
  liquid_fuel s64 [31, 33, 37, 2, 2, ...]
  gas_fuel s64 [0, 0, 0, 0, 0, ...]
  cement s64 [0, 0, 0, 0, 0, ...]
  gas_flaring s64 [0, 0, 0, 0, 0, ...]
  per_capita f64 [0.3, 0.32, 0.36, 0.0, 0.0, ...]
  bunker_fuels s64 [1, 1, 1, 0, 0, ...]
>

去重

DF.distinct(df, ["year", "country"])
#Explorer.DataFrame<
  Polars[1094 x 2]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>
DF.distinct(df, ["country"], keep_all: true)
#Explorer.DataFrame<
  Polars[222 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

重命名

DF.rename(df, year: "year_test")
#Explorer.DataFrame<
  Polars[1094 x 10]
  year_test s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
DF.rename_with(df, &(&1 <> "_test"))
#Explorer.DataFrame<
  Polars[1094 x 10]
  year_test s64 [2010, 2010, 2010, 2010, 2010, ...]
  country_test string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA",
   ...]
  total_test s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel_test s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel_test s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel_test s64 [74, 7, 14565, 0, 374, ...]
  cement_test s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring_test s64 [0, 0, 2623, 0, 3697, ...]
  per_capita_test f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels_test s64 [9, 7, 663, 0, 321, ...]
>

虚拟

DF.dummies(df, ["year"])
#Explorer.DataFrame<
  Polars[1094 x 5]
  year_2010 u8 [1, 1, 1, 1, 1, ...]
  year_2011 u8 [0, 0, 0, 0, 0, ...]
  year_2012 u8 [0, 0, 0, 0, 0, ...]
  year_2013 u8 [0, 0, 0, 0, 0, ...]
  year_2014 u8 [0, 0, 0, 0, 0, ...]
>

样本

DF.sample(df, 10)
#Explorer.DataFrame<
  Polars[10 x 10]
  year s64 [2014, 2010, 2012, 2014, 2013, ...]
  country string ["MALDIVES", "SOLOMON ISLANDS", "UZBEKISTAN",
   "SYRIAN ARAB REPUBLIC", "AUSTRIA", ...]
  total s64 [364, 54, 31583, 8373, 17040, ...]
  solid_fuel s64 [0, 0, 1493, 1, 3402, ...]
  liquid_fuel s64 [364, 54, 2352, 5198, 8632, ...]
  gas_fuel s64 [0, 0, 26813, 2480, 4410, ...]
  cement s64 [0, 0, 925, 517, 596, ...]
  gas_flaring s64 [0, 0, 0, 178, 0, ...]
  per_capita f64 [1.02, 0.1, 1.1, 0.45, 2.01, ...]
  bunker_fuels s64 [204, 4, 0, 219, 555, ...]
>
DF.sample(df, 0.4)
#Explorer.DataFrame<
  Polars[437 x 10]
  year s64 [2014, 2010, 2014, 2014, 2011, ...]
  country string ["REPUBLIC OF MOLDOVA", "MONTENEGRO",
   "UNITED STATES OF AMERICA", "NORWAY", "PANAMA", ...]
  total s64 [1345, 704, 1432855, 12988, 2754, ...]
  solid_fuel s64 [94, 443, 450047, 879, 223, ...]
  liquid_fuel s64 [595, 261, 576531, 8542, 2291, ...]
  gas_fuel s64 [478, 0, 390719, 3085, 0, ...]
  cement s64 [177, 0, 11314, 231, 240, ...]
  gas_flaring s64 [0, 0, 4244, 251, 0, ...]
  per_capita f64 [0.33, 1.13, 4.43, 2.52, 0.75, ...]
  bunker_fuels s64 [21, 2, 30722, 545, 3267, ...]
>

DF.sample(df, 10000)
** (ArgumentError) in order to sample more rows than are in the dataframe (1094), sampling `replace` must be true
    (explorer 0.10.0) lib/explorer/data_frame.ex:4364: Explorer.DataFrame.sample/3
    iex:69: (file)

DF.sample(df, 10000, replace: true)
#Explorer.DataFrame<
  Polars[10000 x 10]
  year s64 [2013, 2011, 2012, 2014, 2013, ...]
  country string ["NEW CALEDONIA", "MAURITANIA", "ETHIOPIA", "UZBEKISTAN",
   "CURACAO", ...]
  total s64 [1157, 653, 2335, 28692, 1422, ...]
  solid_fuel s64 [366, 0, 162, 1677, 0, ...]
  liquid_fuel s64 [775, 576, 1697, 2086, 1422, ...]
  gas_fuel s64 [0, 0, 0, 23929, 0, ...]
  cement s64 [16, 77, 476, 1000, 0, ...]
  gas_flaring s64 [0, 0, 0, 0, 0, ...]
  per_capita f64 [4.51, 0.18, 0.03, 0.97, 9.21, ...]
  bunker_fuels s64 [11, 15, 272, 0, 1421, ...]
>

切片和切块

 df["year"]
#Explorer.Series<
  Polars[1094]
  s64 [2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
   2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
   2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
   2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, ...]
>

 DF.pull(df, "year")
#Explorer.Series<
  Polars[1094]
  s64 [2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
   2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
   2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
   2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, ...]
>

df[["year", "country"]]
#Explorer.DataFrame<
  Polars[1094 x 2]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>

 DF.slice(df, [1, 20, 50])
#Explorer.DataFrame<
  Polars[3 x 10]
  year s64 [2010, 2010, 2010]
  country string ["ALBANIA", "BENIN",
   "DEMOCRATIC REPUBLIC OF THE CONGO (FORMERLY ZAIRE)"]
  total s64 [1254, 1388, 551]
  solid_fuel s64 [117, 0, 0]
  liquid_fuel s64 [953, 1211, 471]
  gas_fuel s64 [7, 0, 12]
  cement s64 [177, 177, 67]
  gas_flaring s64 [0, 0, 0]
  per_capita f64 [0.43, 0.15, 0.01]
  bunker_fuels s64 [7, 127, 126]
>
DF.slice(df, -10, 5)
#Explorer.DataFrame<
  Polars[5 x 10]
  year s64 [2014, 2014, 2014, 2014, 2014]
  country string ["UNITED STATES OF AMERICA", "URUGUAY", "UZBEKISTAN",
   "VANUATU", "VENEZUELA"]
  total s64 [1432855, 1840, 28692, 42, 50510]
  solid_fuel s64 [450047, 2, 1677, 0, 204]
  liquid_fuel s64 [576531, 1700, 2086, 42, 28445]
  gas_fuel s64 [390719, 25, 23929, 0, 12731]
  cement s64 [11314, 112, 1000, 0, 1088]
  gas_flaring s64 [4244, 0, 0, 0, 8042]
  per_capita f64 [4.43, 0.54, 0.97, 0.16, 1.65]
  bunker_fuels s64 [30722, 251, 0, 10, 1256]
>
 DF.slice(df, 12..42)
#Explorer.DataFrame<
  Polars[31 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AZERBAIJAN", "BAHAMAS", "BAHRAIN", "BANGLADESH", "BARBADOS",
   ...]
  total s64 [8366, 451, 7981, 16345, 403, ...]
  solid_fuel s64 [6, 1, 0, 839, 0, ...]
  liquid_fuel s64 [2373, 450, 1123, 2881, 363, ...]
  gas_fuel s64 [4904, 0, 6696, 10753, 8, ...]
  cement s64 [174, 0, 163, 1873, 31, ...]
  gas_flaring s64 [909, 0, 0, 0, 1, ...]
  per_capita f64 [0.92, 1.25, 6.33, 0.11, 1.44, ...]
  bunker_fuels s64 [398, 179, 545, 313, 108, ...]
>

转置

DF.pivot_longer(df, ["year", "country"], select: &String.ends_with?(&1, "fuel"))
** (ArgumentError) columns to pivot must include columns with the same dtype, but found multiple dtypes: :string and {:s, 64}
    (explorer 0.10.0) lib/explorer/data_frame.ex:4744: Explorer.DataFrame.pivot_longer/3
    iex:76: (file)

连接

df1 = DF.select(df, ["year", "country", "total"])
df2 = DF.select(df, ["year", "country", "cement"])

DF.join(df1, df2)
DF.join(df1, df2)
#Explorer.DataFrame<
  Polars[1094 x 4]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
>

df3 = df |> DF.select(["year", "cement"]) |> DF.slice(0, 500)

DF.join(df1, df3, how: :left)
#Explorer.DataFrame<
  Polars[109138 x 4]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN",
   "AFGHANISTAN", ...]
  total s64 [2308, 2308, 2308, 2308, 2308, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
>

分组

grouped = DF.group_by(df, ["country"])
#Explorer.DataFrame<
  Polars[1094 x 10]
  Groups: ["country"]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

DF.groups(grouped)
["country"]

grouped |> DF.summarise(max_per_capita: max(per_capita)) |> DF.sort_by(desc: max_per_capita)

#Explorer.DataFrame<
  Polars[222 x 2]
  country string ["QATAR", "CURACAO", "TRINIDAD AND TOBAGO", "KUWAIT",
   "NETHERLAND ANTILLES", ...]
  max_per_capita f64 [13.54, 10.72, 9.84, 8.16, 7.45, ...]
>

summarise使用的聚合操作

min/1 取组内的最小值。
max/1 取组内的最大值。
sum/1 取组内序列的总和。
mean/1 取组内系列的平均值。
median/1 取组内系列的中位数
first/1 取组内的第一个值。
last/1 取组内的最后一个值。
count/1 计算每组的行数。
n_unique/1 计算每组的唯一行数。

DF.summarise(grouped, min_per_capita: min(per_capita), min_total: min(total))
#Explorer.DataFrame<
  Polars[222 x 3]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  min_per_capita f64 [0.08, 0.43, 0.9, 1.63, 0.37, ...]
  min_total s64 [2308, 1254, 32500, 126, 7924, ...]
>

DF.mutate(grouped, total_window_sum: window_sum(total, 3), rows_in_group: count(country))
#Explorer.DataFrame<
  Polars[1094 x 12]
  Groups: ["country"]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
  total_window_sum s64 [2308, 1254, 32500, 141, 7924, ...]
  rows_in_group u32 [5, 5, 5, 5, 5, ...]
>

grouped
|> DF.summarise(greater_than_9: greater(max(per_capita), 9.0), per_capita_max: max(per_capita))
|> DataFrame.sort_by(desc: per_capita_max)

#Explorer.DataFrame<
  Polars[222 x 3]
  country string ["QATAR", "CURACAO", "TRINIDAD AND TOBAGO", "KUWAIT",
   "NETHERLAND ANTILLES", ...]
  greater_than_9 boolean [true, true, true, false, false, ...]
  per_capita_max f64 [13.54, 10.72, 9.84, 8.16, 7.45, ...]
>

banshan monkey

搬山行者

Elixir的数据框架库Explorer快速入门

目录

Elixir的数据框架库Explorer快速入门

简介

读写数据

读文件

写文件

序列操作

数据类型：

填写缺失值：

切片切块

比较

计算支持

兼容转换

翻转

排序

累积

去重

条件运算

数据帧操作

构建

选择

过滤

修改

排序

去重

重命名

虚拟

样本

切片和切块

转置

连接

分组

summarise使用的聚合操作