搬山行者

无志愁压头,有志能搬山

业余程序员的学习笔记~


Elixir的数据框架库Explorer快速入门

目录

Elixir的数据框架库Explorer快速入门

简介

Explorer是Elixir的数据框架库。数据帧是数据分析中常用的数据结构。 它是一个由列和行组成的二维表,类似于SQL表或电子表格。

Explorer的目标是提供一个简单而强大的API来操作数据帧。

读写数据

读文件

可以从分隔文件(如CSV)、NDJSON、Parquet和Arrow IPC(feather)格式读取数据。 可以使用DataFrame.new/1从Map或Key List中加载数据。

CSV常见选项:

  • delimiter 用于分隔记录中字段的单个字符。(默认值:“,”)

  • dtypes [column_name:dtype]的关键字列表。 如果未为列指定类型,则从前1000行开始估算。(默认值:[])

  • header 文件的第一行是否有列名标题(默认值:true)
  • max_rows 要读取的最大行数。(默认值:nil)
  • nil_values 应被解释为nil值的字符串列表。(默认值:[])
  • skip_rows 文件开头要跳过的行数。(默认值:0)
  • skip_rows_after_head 在标题行之后要跳过的行数。(默认值:0)
  • columns 要保留的列名列表。如果存在,则只有这些列被读入数据帧。(默认值:nil)

内置示例数据集,可从Explorer.Datasets加载。

写文件

  • header 列名是否应写在文件的第一行(默认值:true)
  • delimiter 用于分隔记录中字段的单个字符。(默认值:“,”)
alias Explorer.DataFrame
alias Explorer.Series

df = Explorer.Datasets.fossil_fuels()

input = Kino.Input.text("Filename")
filename = Kino.Input.read(input)
DataFrame.to_csv(df, filename)

序列操作

数据类型:

  • :null
    Null

  • :binary
    二进制(字节序列)
  • :boolean
    布尔
  • :category
    内部表示为整数的字符串

  • :date
    Date类型,Elixir.Date的展开

  • 带毫秒/微秒/纳秒的DateTime类型,Elixir.NaiveDateTime的展开。

  • 带毫秒/微秒/纳秒时间间隔类型 指定precision的Elixir.NaiveDateTime,展开到Explorer.Duration

  • 32位或64位浮点数

  • 8位、16位、32位或64位有符号整数

  • 8位、16位、32位或64位无符号整数

  • :string
    UTF-8编码二进制

  • :time
    Time类型,Elixir.Time的展开

  • :list
    类似Elixir.List的嵌套数据类型。

  • :struct
    类似Elixir.Map的嵌套数据类型。 具有相同的keys和匹配的数据类型。
s1 = Series.from_list([1, 2, 3])

#Explorer.Series<
  Polars[3]
  s64 [1, 2, 3]
>

s2 = Series.from_list(["a", "b", "c"])

#Explorer.Series<
  Polars[3]
  string ["a", "b", "c"]
>

s3 = Series.from_list([~D[2011-01-01], ~D[1965-01-21]])

Series.dtype(s3)
# :date

Series.size(s3)
# 2

1..100 |> Enum.to_list() |> Series.from_list()

#Explorer.Series<
  Polars[100]
  s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
   26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
   50, ...]
>

s = Series.from_list([1.0, 2.0, nil, nil, 5.0])

#Explorer.Series<
  Polars[5]
  f64 [1.0, 2.0, nil, nil, 5.0]
>

填写缺失值:

  • :forward 将nil替换为前值
  • :backward 将nil替换为后值
  • :max 将nil替换为最大值
  • :min 将nil替换为最小值
  • :mean 将nil替换为平均值
Series.fill_missing(s, :forward)
#Explorer.Series<
  Polars[2]
  f64 [1.0, 2.0]
>

# 在所有其他情况下,Series必须都是相同的数据类型,否则您将收到ArgumentError。

Series.from_list([1, 2, 3, "a"])

# the value "a" does not match the inferred dtype {:s, 64}

切片切块


s = 1..10 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
  Polars[10]
  s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>
s[1]
# 2
s[-1]
# 10
s[0..4]
#Explorer.Series<
  Polars[3]
  s64 [1, 5, 5]
>
s[[0, 4, 4]]
#Explorer.Series<
  Polars[3]
  s64 [1, 5, 5]
>
Series.to_list(s)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

比较

s = 1..11 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
  Polars[11]
  s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>
s1 = 11..1//-1 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
  Polars[11]
  s64 [11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
>
Series.equal(s, s1)
#Explorer.Series<
  Polars[11]
  boolean [false, false, false, false, false, true, false, false, false, false,
   false]
>
Series.equal(s, 5)
#Explorer.Series<
  Polars[11]
  boolean [false, false, false, false, true, false, false, false, false, false,
   false]
>
Series.not_equal(s, 10)
#Explorer.Series<
  Polars[11]
  boolean [true, true, true, true, true, true, true, true, true, false, true]
>
Series.greater_equal(s, 4)
#Explorer.Series<
  Polars[11]
  boolean [false, false, false, true, true, true, true, true, true, true, true]
>

计算支持

Series.add(s, s1)
#Explorer.Series<
  Polars[11]
  s64 [12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]
>
Series.subtract(s, 4)
#Explorer.Series<
  Polars[11]
  s64 [-3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]
>
Series.multiply(s, s1)
#Explorer.Series<
  Polars[11]
  s64 [11, 20, 27, 32, 35, 36, 35, 32, 27, 20, 11]
>

兼容转换

s = Series.from_list([1, 2, 3])
#Explorer.Series<
  Polars[3]
  s64 [1, 2, 3]
>

s1 = Series.from_list([1.0, 2.0, 3.0])
#Explorer.Series<
  Polars[3]
  f64 [1.0, 2.0, 3.0]
>

Series.add(s, s1)
#Explorer.Series<
  Polars[3]
  f64 [2.0, 4.0, 6.0]
>

s2 = Series.from_list(["a", "b", "c"])
Series.add(s, s2)

** (ArgumentError) cannot invoke Explorer.Series.add/2 with mismatched dtypes: {:s, 64} and :string
    (explorer 0.10.0) lib/explorer/series.ex:6814: Explorer.Series.dtype_mismatch_error/3
    iex:23: (file)

翻转

s = Series.from_list([1, 2, 3, 4])
#Explorer.Series<
  Polars[4]
  s64 [1, 2, 3, 4]
>
Series.reverse(s)
#Explorer.Series<
  Polars[4]
  s64 [4, 3, 2, 1]
>

排序

 1..100 |> Enum.to_list() |> Enum.shuffle() |> Series.from_list() |> Series.sort()
#Explorer.Series<
  Polars[100]
  s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
   21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
   40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, ...]
>

s = 1..100 |> Enum.to_list() |> Enum.shuffle() |> Series.from_list()
#Explorer.Series<
  Polars[100]
  s64 [56, 23, 37, 43, 61, 87, 81, 92, 91, 67, 18, 60, 9, 76, 65, 75, 94, 3, 90,
   21, 10, 99, 59, 30, 27, 19, 38, 22, 12, 86, 100, 96, 68, 50, 53, 25, 88, 34,
   6, 44, 95, 48, 71, 35, 62, 20, 97, 51, 40, 33, ...]
>
ids = Series.argsort(s) |> Series.to_list()
[70, 56, 17, 80, 66, 38, 73, 88, 12, 20, 57, 28, 65, 54, 67, 64, 86, 10, 25, 45,
 19, 27, 1, 83, 35, 79, 24, 98, 95, 23, 60, 78, 49, 37, 43, 91, 2, 26, 51, 48,
 52, 96, 3, 39, 97, 75, 99, 41, 76, 33, ...]

Series.slice(s, ids)
#Explorer.Series<
  Polars[100]
  s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
   21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
   40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, ...]
>

累积

s = 1..100 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
  Polars[100]
  s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
   21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
   40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, ...]
>
Series.cumulative_sum(s)
#Explorer.Series<
  Polars[100]
  s64 [1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120, 136, 153, 171,
   190, 210, 231, 253, 276, 300, 325, 351, 378, 406, 435, 465, 496, 528, 561,
   595, 630, 666, 703, 741, 780, 820, 861, 903, 946, 990, 1035, 1081, 1128,
   1176, 1225, 1275, ...]
>
Series.window_sum(s, 4)
#Explorer.Series<
  Polars[100]
  s64 [1, 3, 6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46, 50, 54, 58, 62, 66, 70,
   74, 78, 82, 86, 90, 94, 98, 102, 106, 110, 114, 118, 122, 126, 130, 134, 138,
   142, 146, 150, 154, 158, 162, 166, 170, 174, 178, 182, 186, 190, 194, ...]
>

去重

s = Series.from_list(["a", "b", "b", "c", "c", "c"])
#Explorer.Series<
  Polars[6]
  string ["a", "b", "b", "c", "c", "c"]
>
Series.distinct(s)
#Explorer.Series<
  Polars[3]
  string ["a", "b", "c"]
>
Series.n_distinct(s)
3
Series.frequencies(s)
#Explorer.DataFrame<
  Polars[3 x 2]
  values string ["c", "b", "a"]
  counts u32 [3, 2, 1]
>

条件运算

s1 = Series.from_list(["It", "was", "the", "best", "of", "times"])
#Explorer.Series<
  Polars[6]
  string ["It", "was", "the", "best", "of", "times"]
>
s1 |> Series.equal("best") |> Series.select("worst", s1)
#Explorer.Series<
  Polars[6]
  string ["It", "was", "the", "worst", "of", "times"]
>

数据帧操作

构建

DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
#Explorer.DataFrame<
  Polars[3 x 2]
  a s64 [1, 2, 3]
  b string ["a", "b", "c"]
>

df
#Explorer.DataFrame<
  Polars[1094 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

DataFrame.names(df)
["year", "country", "total", "solid_fuel", "liquid_fuel", "gas_fuel", "cement",
 "gas_flaring", "per_capita", "bunker_fuels"]

 DataFrame.dtypes(df)
%{
  "bunker_fuels" => {:s, 64},
  "cement" => {:s, 64},
  "country" => :string,
  "gas_flaring" => {:s, 64},
  "gas_fuel" => {:s, 64},
  "liquid_fuel" => {:s, 64},
  "per_capita" => {:f, 64},
  "solid_fuel" => {:s, 64},
  "total" => {:s, 64},
  "year" => {:s, 64}
}

DataFrame.shape(df)
{1094, 10}

DataFrame.head(df)
#Explorer.DataFrame<
  Polars[5 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA"]
  total s64 [2308, 1254, 32500, 141, 7924]
  solid_fuel s64 [627, 117, 332, 0, 0]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649]
  gas_fuel s64 [74, 7, 14565, 0, 374]
  cement s64 [5, 177, 2598, 0, 204]
  gas_flaring s64 [0, 0, 2623, 0, 3697]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37]
  bunker_fuels s64 [9, 7, 663, 0, 321]
>
 DataFrame.tail(df, 10)
#Explorer.DataFrame<
  Polars[10 x 10]
  year s64 [2014, 2014, 2014, 2014, 2014, ...]
  country string ["UNITED STATES OF AMERICA", "URUGUAY", "UZBEKISTAN",
   "VANUATU", "VENEZUELA", ...]
  total s64 [1432855, 1840, 28692, 42, 50510, ...]
  solid_fuel s64 [450047, 2, 1677, 0, 204, ...]
  liquid_fuel s64 [576531, 1700, 2086, 42, 28445, ...]
  gas_fuel s64 [390719, 25, 23929, 0, 12731, ...]
  cement s64 [11314, 112, 1000, 0, 1088, ...]
  gas_flaring s64 [4244, 0, 0, 0, 8042, ...]
  per_capita f64 [4.43, 0.54, 0.97, 0.16, 1.65, ...]
  bunker_fuels s64 [30722, 251, 0, 10, 1256, ...]
>

选择


require DataFrame, as: DF
Explorer.DataFrame
DF.select(df, ["year", "country"])
#Explorer.DataFrame<
  Polars[1094 x 2]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>

DF.select(df, &String.ends_with?(&1, "fuel"))
#Explorer.DataFrame<
  Polars[1094 x 3]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
>

DF.discard(df, &String.ends_with?(&1, "fuel"))
#Explorer.DataFrame<
  Polars[1094 x 7]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

过滤


DF.filter(df, country == "BRAZIL")
#Explorer.DataFrame<
  Polars[5 x 10]
  year s64 [2010, 2011, 2012, 2013, 2014]
  country string ["BRAZIL", "BRAZIL", "BRAZIL", "BRAZIL", "BRAZIL"]
  total s64 [114468, 119829, 128178, 137354, 144480]
  solid_fuel s64 [15965, 17498, 17165, 18773, 20089]
  liquid_fuel s64 [74689, 78849, 84409, 88898, 92454]
  gas_fuel s64 [14372, 13778, 16328, 19399, 21297]
  cement s64 [8040, 8717, 9428, 9517, 9691]
  gas_flaring s64 [1402, 987, 848, 767, 949]
  per_capita f64 [0.58, 0.6, 0.63, 0.67, 0.7]
  bunker_fuels s64 [5101, 5516, 5168, 4895, 4895]
>

DF.filter(df, country == "ALGERIA" and year > 2012)
#Explorer.DataFrame<
  Polars[2 x 10]
  year s64 [2013, 2014]
  country string ["ALGERIA", "ALGERIA"]
  total s64 [36669, 39651]
  solid_fuel s64 [198, 149]
  liquid_fuel s64 [14170, 14422]
  gas_fuel s64 [17863, 20151]
  cement s64 [2516, 2856]
  gas_flaring s64 [1922, 2073]
  per_capita f64 [0.96, 1.02]
  bunker_fuels s64 [687, 581]
>

DF.filter_with(df, fn ldf ->
  ldf["country"]
  |> Series.equal("ALGERIA")
  |> Series.and(Series.greater(ldf["year"], 2012))
end)

#Explorer.DataFrame<
  Polars[2 x 10]
  year s64 [2013, 2014]
  country string ["ALGERIA", "ALGERIA"]
  total s64 [36669, 39651]
  solid_fuel s64 [198, 149]
  liquid_fuel s64 [14170, 14422]
  gas_fuel s64 [17863, 20151]
  cement s64 [2516, 2856]
  gas_flaring s64 [1922, 2073]
  per_capita f64 [0.96, 1.02]
  bunker_fuels s64 [687, 581]
>

修改


DF.mutate(df, new_column: solid_fuel + cement)
#Explorer.DataFrame<
  Polars[1094 x 11]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
  new_column s64 [632, 294, 2930, 0, 204, ...]
>

DF.mutate(df,
  gas_fuel: Series.cast(gas_fuel, :float),
  gas_and_liquid_fuel: gas_fuel + liquid_fuel
)
#Explorer.DataFrame<
  Polars[1094 x 11]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel f64 [74.0, 7.0, 14565.0, 0.0, 374.0, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
  gas_and_liquid_fuel s64 [1675, 960, 26946, 141, 4023, ...]
>

DF.mutate(df, %{"gas_fuel" => gas_fuel - 10})
#Explorer.DataFrame<
  Polars[1094 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [64, -3, 14555, -10, 364, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

排序


DF.sort_by(df, year)
#Explorer.DataFrame<
  Polars[1094 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

DF.sort_by(df, asc: total, desc: year)
#Explorer.DataFrame<
  Polars[1094 x 10]
  year s64 [2010, 2013, 2012, 2011, 2011, ...]
  country string ["NIUE", "NIUE", "NIUE", "NIUE", "TUVALU", ...]
  total s64 [1, 2, 2, 2, 2, ...]
  solid_fuel s64 [0, 0, 0, 0, 0, ...]
  liquid_fuel s64 [1, 2, 2, 2, 2, ...]
  gas_fuel s64 [0, 0, 0, 0, 0, ...]
  cement s64 [0, 0, 0, 0, 0, ...]
  gas_flaring s64 [0, 0, 0, 0, 0, ...]
  per_capita f64 [0.52, 1.04, 1.04, 1.04, 0.0, ...]
  bunker_fuels s64 [0, 0, 0, 0, 0, ...]
>
DF.sort_by(df, asc: window_sum(total, 2))
#Explorer.DataFrame<
  Polars[1094 x 10]
  year s64 [2010, 2011, 2012, 2010, 2011, ...]
  country string ["FEDERATED STATES OF MICRONESIA",
   "FEDERATED STATES OF MICRONESIA", "FEDERATED STATES OF MICRONESIA", "TUVALU",
   "TUVALU", ...]
  total s64 [31, 33, 37, 2, 2, ...]
  solid_fuel s64 [0, 0, 0, 0, 0, ...]
  liquid_fuel s64 [31, 33, 37, 2, 2, ...]
  gas_fuel s64 [0, 0, 0, 0, 0, ...]
  cement s64 [0, 0, 0, 0, 0, ...]
  gas_flaring s64 [0, 0, 0, 0, 0, ...]
  per_capita f64 [0.3, 0.32, 0.36, 0.0, 0.0, ...]
  bunker_fuels s64 [1, 1, 1, 0, 0, ...]
>

去重


DF.distinct(df, ["year", "country"])
#Explorer.DataFrame<
  Polars[1094 x 2]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>
DF.distinct(df, ["country"], keep_all: true)
#Explorer.DataFrame<
  Polars[222 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

重命名


DF.rename(df, year: "year_test")
#Explorer.DataFrame<
  Polars[1094 x 10]
  year_test s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
DF.rename_with(df, &(&1 <> "_test"))
#Explorer.DataFrame<
  Polars[1094 x 10]
  year_test s64 [2010, 2010, 2010, 2010, 2010, ...]
  country_test string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA",
   ...]
  total_test s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel_test s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel_test s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel_test s64 [74, 7, 14565, 0, 374, ...]
  cement_test s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring_test s64 [0, 0, 2623, 0, 3697, ...]
  per_capita_test f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels_test s64 [9, 7, 663, 0, 321, ...]
>

虚拟


DF.dummies(df, ["year"])
#Explorer.DataFrame<
  Polars[1094 x 5]
  year_2010 u8 [1, 1, 1, 1, 1, ...]
  year_2011 u8 [0, 0, 0, 0, 0, ...]
  year_2012 u8 [0, 0, 0, 0, 0, ...]
  year_2013 u8 [0, 0, 0, 0, 0, ...]
  year_2014 u8 [0, 0, 0, 0, 0, ...]
>

样本


DF.sample(df, 10)
#Explorer.DataFrame<
  Polars[10 x 10]
  year s64 [2014, 2010, 2012, 2014, 2013, ...]
  country string ["MALDIVES", "SOLOMON ISLANDS", "UZBEKISTAN",
   "SYRIAN ARAB REPUBLIC", "AUSTRIA", ...]
  total s64 [364, 54, 31583, 8373, 17040, ...]
  solid_fuel s64 [0, 0, 1493, 1, 3402, ...]
  liquid_fuel s64 [364, 54, 2352, 5198, 8632, ...]
  gas_fuel s64 [0, 0, 26813, 2480, 4410, ...]
  cement s64 [0, 0, 925, 517, 596, ...]
  gas_flaring s64 [0, 0, 0, 178, 0, ...]
  per_capita f64 [1.02, 0.1, 1.1, 0.45, 2.01, ...]
  bunker_fuels s64 [204, 4, 0, 219, 555, ...]
>
DF.sample(df, 0.4)
#Explorer.DataFrame<
  Polars[437 x 10]
  year s64 [2014, 2010, 2014, 2014, 2011, ...]
  country string ["REPUBLIC OF MOLDOVA", "MONTENEGRO",
   "UNITED STATES OF AMERICA", "NORWAY", "PANAMA", ...]
  total s64 [1345, 704, 1432855, 12988, 2754, ...]
  solid_fuel s64 [94, 443, 450047, 879, 223, ...]
  liquid_fuel s64 [595, 261, 576531, 8542, 2291, ...]
  gas_fuel s64 [478, 0, 390719, 3085, 0, ...]
  cement s64 [177, 0, 11314, 231, 240, ...]
  gas_flaring s64 [0, 0, 4244, 251, 0, ...]
  per_capita f64 [0.33, 1.13, 4.43, 2.52, 0.75, ...]
  bunker_fuels s64 [21, 2, 30722, 545, 3267, ...]
>

DF.sample(df, 10000)
** (ArgumentError) in order to sample more rows than are in the dataframe (1094), sampling `replace` must be true
    (explorer 0.10.0) lib/explorer/data_frame.ex:4364: Explorer.DataFrame.sample/3
    iex:69: (file)

DF.sample(df, 10000, replace: true)
#Explorer.DataFrame<
  Polars[10000 x 10]
  year s64 [2013, 2011, 2012, 2014, 2013, ...]
  country string ["NEW CALEDONIA", "MAURITANIA", "ETHIOPIA", "UZBEKISTAN",
   "CURACAO", ...]
  total s64 [1157, 653, 2335, 28692, 1422, ...]
  solid_fuel s64 [366, 0, 162, 1677, 0, ...]
  liquid_fuel s64 [775, 576, 1697, 2086, 1422, ...]
  gas_fuel s64 [0, 0, 0, 23929, 0, ...]
  cement s64 [16, 77, 476, 1000, 0, ...]
  gas_flaring s64 [0, 0, 0, 0, 0, ...]
  per_capita f64 [4.51, 0.18, 0.03, 0.97, 9.21, ...]
  bunker_fuels s64 [11, 15, 272, 0, 1421, ...]
>

切片和切块


 df["year"]
#Explorer.Series<
  Polars[1094]
  s64 [2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
   2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
   2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
   2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, ...]
>

 DF.pull(df, "year")
#Explorer.Series<
  Polars[1094]
  s64 [2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
   2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
   2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
   2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, ...]
>

df[["year", "country"]]
#Explorer.DataFrame<
  Polars[1094 x 2]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>

 DF.slice(df, [1, 20, 50])
#Explorer.DataFrame<
  Polars[3 x 10]
  year s64 [2010, 2010, 2010]
  country string ["ALBANIA", "BENIN",
   "DEMOCRATIC REPUBLIC OF THE CONGO (FORMERLY ZAIRE)"]
  total s64 [1254, 1388, 551]
  solid_fuel s64 [117, 0, 0]
  liquid_fuel s64 [953, 1211, 471]
  gas_fuel s64 [7, 0, 12]
  cement s64 [177, 177, 67]
  gas_flaring s64 [0, 0, 0]
  per_capita f64 [0.43, 0.15, 0.01]
  bunker_fuels s64 [7, 127, 126]
>
DF.slice(df, -10, 5)
#Explorer.DataFrame<
  Polars[5 x 10]
  year s64 [2014, 2014, 2014, 2014, 2014]
  country string ["UNITED STATES OF AMERICA", "URUGUAY", "UZBEKISTAN",
   "VANUATU", "VENEZUELA"]
  total s64 [1432855, 1840, 28692, 42, 50510]
  solid_fuel s64 [450047, 2, 1677, 0, 204]
  liquid_fuel s64 [576531, 1700, 2086, 42, 28445]
  gas_fuel s64 [390719, 25, 23929, 0, 12731]
  cement s64 [11314, 112, 1000, 0, 1088]
  gas_flaring s64 [4244, 0, 0, 0, 8042]
  per_capita f64 [4.43, 0.54, 0.97, 0.16, 1.65]
  bunker_fuels s64 [30722, 251, 0, 10, 1256]
>
 DF.slice(df, 12..42)
#Explorer.DataFrame<
  Polars[31 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AZERBAIJAN", "BAHAMAS", "BAHRAIN", "BANGLADESH", "BARBADOS",
   ...]
  total s64 [8366, 451, 7981, 16345, 403, ...]
  solid_fuel s64 [6, 1, 0, 839, 0, ...]
  liquid_fuel s64 [2373, 450, 1123, 2881, 363, ...]
  gas_fuel s64 [4904, 0, 6696, 10753, 8, ...]
  cement s64 [174, 0, 163, 1873, 31, ...]
  gas_flaring s64 [909, 0, 0, 0, 1, ...]
  per_capita f64 [0.92, 1.25, 6.33, 0.11, 1.44, ...]
  bunker_fuels s64 [398, 179, 545, 313, 108, ...]
>

转置


DF.pivot_longer(df, ["year", "country"], select: &String.ends_with?(&1, "fuel"))
** (ArgumentError) columns to pivot must include columns with the same dtype, but found multiple dtypes: :string and {:s, 64}
    (explorer 0.10.0) lib/explorer/data_frame.ex:4744: Explorer.DataFrame.pivot_longer/3
    iex:76: (file)

连接


df1 = DF.select(df, ["year", "country", "total"])
df2 = DF.select(df, ["year", "country", "cement"])

DF.join(df1, df2)
DF.join(df1, df2)
#Explorer.DataFrame<
  Polars[1094 x 4]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
>

df3 = df |> DF.select(["year", "cement"]) |> DF.slice(0, 500)

DF.join(df1, df3, how: :left)
#Explorer.DataFrame<
  Polars[109138 x 4]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN",
   "AFGHANISTAN", ...]
  total s64 [2308, 2308, 2308, 2308, 2308, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
>

分组


grouped = DF.group_by(df, ["country"])
#Explorer.DataFrame<
  Polars[1094 x 10]
  Groups: ["country"]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

DF.groups(grouped)
["country"]

grouped |> DF.summarise(max_per_capita: max(per_capita)) |> DF.sort_by(desc: max_per_capita)

#Explorer.DataFrame<
  Polars[222 x 2]
  country string ["QATAR", "CURACAO", "TRINIDAD AND TOBAGO", "KUWAIT",
   "NETHERLAND ANTILLES", ...]
  max_per_capita f64 [13.54, 10.72, 9.84, 8.16, 7.45, ...]
>


summarise使用的聚合操作

  • min/1 取组内的最小值。

  • max/1 取组内的最大值。

  • sum/1 取组内序列的总和。

  • mean/1 取组内系列的平均值。

  • median/1 取组内系列的中位数

  • first/1 取组内的第一个值。

  • last/1 取组内的最后一个值。

  • count/1 计算每组的行数。

  • n_unique/1 计算每组的唯一行数。

DF.summarise(grouped, min_per_capita: min(per_capita), min_total: min(total))
#Explorer.DataFrame<
  Polars[222 x 3]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  min_per_capita f64 [0.08, 0.43, 0.9, 1.63, 0.37, ...]
  min_total s64 [2308, 1254, 32500, 126, 7924, ...]
>

DF.mutate(grouped, total_window_sum: window_sum(total, 3), rows_in_group: count(country))
#Explorer.DataFrame<
  Polars[1094 x 12]
  Groups: ["country"]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
  total_window_sum s64 [2308, 1254, 32500, 141, 7924, ...]
  rows_in_group u32 [5, 5, 5, 5, 5, ...]
>

grouped
|> DF.summarise(greater_than_9: greater(max(per_capita), 9.0), per_capita_max: max(per_capita))
|> DataFrame.sort_by(desc: per_capita_max)

#Explorer.DataFrame<
  Polars[222 x 3]
  country string ["QATAR", "CURACAO", "TRINIDAD AND TOBAGO", "KUWAIT",
   "NETHERLAND ANTILLES", ...]
  greater_than_9 boolean [true, true, true, false, false, ...]
  per_capita_max f64 [13.54, 10.72, 9.84, 8.16, 7.45, ...]
>