Elixir的数据框架库Explorer快速入门
简介
Explorer是Elixir的数据框架库。数据帧是数据分析中常用的数据结构。 它是一个由列和行组成的二维表,类似于SQL表或电子表格。
Explorer的目标是提供一个简单而强大的API来操作数据帧。
读写数据
读文件
可以从分隔文件(如CSV)、NDJSON、Parquet和Arrow IPC(feather)格式读取数据。 可以使用DataFrame.new/1从Map或Key List中加载数据。
CSV常见选项:
-
delimiter 用于分隔记录中字段的单个字符。(默认值:“,”)
-
dtypes [column_name:dtype]的关键字列表。 如果未为列指定类型,则从前1000行开始估算。(默认值:[])
- header 文件的第一行是否有列名标题(默认值:true)
- max_rows 要读取的最大行数。(默认值:nil)
- nil_values 应被解释为nil值的字符串列表。(默认值:[])
- skip_rows 文件开头要跳过的行数。(默认值:0)
- skip_rows_after_head 在标题行之后要跳过的行数。(默认值:0)
- columns 要保留的列名列表。如果存在,则只有这些列被读入数据帧。(默认值:nil)
内置示例数据集,可从Explorer.Datasets加载。
写文件
- header 列名是否应写在文件的第一行(默认值:true)
- delimiter 用于分隔记录中字段的单个字符。(默认值:“,”)
alias Explorer.DataFrame
alias Explorer.Series
df = Explorer.Datasets.fossil_fuels()
input = Kino.Input.text("Filename")
filename = Kino.Input.read(input)
DataFrame.to_csv(df, filename)
序列操作
数据类型:
-
:null
Null - :binary
二进制(字节序列) - :boolean
布尔 -
:category
内部表示为整数的字符串 -
:date
Date类型,Elixir.Date的展开 -
带毫秒/微秒/纳秒的DateTime类型,Elixir.NaiveDateTime的展开。
-
带毫秒/微秒/纳秒时间间隔类型 指定precision的Elixir.NaiveDateTime,展开到Explorer.Duration
-
32位或64位浮点数
-
8位、16位、32位或64位有符号整数
-
8位、16位、32位或64位无符号整数
-
:string
UTF-8编码二进制 -
:time
Time类型,Elixir.Time的展开 -
:list
类似Elixir.List的嵌套数据类型。 - :struct
类似Elixir.Map的嵌套数据类型。 具有相同的keys和匹配的数据类型。
s1 = Series.from_list([1, 2, 3])
#Explorer.Series<
Polars[3]
s64 [1, 2, 3]
>
s2 = Series.from_list(["a", "b", "c"])
#Explorer.Series<
Polars[3]
string ["a", "b", "c"]
>
s3 = Series.from_list([~D[2011-01-01], ~D[1965-01-21]])
Series.dtype(s3)
# :date
Series.size(s3)
# 2
1..100 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
Polars[100]
s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
50, ...]
>
s = Series.from_list([1.0, 2.0, nil, nil, 5.0])
#Explorer.Series<
Polars[5]
f64 [1.0, 2.0, nil, nil, 5.0]
>
填写缺失值:
- :forward 将nil替换为前值
- :backward 将nil替换为后值
- :max 将nil替换为最大值
- :min 将nil替换为最小值
- :mean 将nil替换为平均值
Series.fill_missing(s, :forward)
#Explorer.Series<
Polars[2]
f64 [1.0, 2.0]
>
# 在所有其他情况下,Series必须都是相同的数据类型,否则您将收到ArgumentError。
Series.from_list([1, 2, 3, "a"])
# the value "a" does not match the inferred dtype {:s, 64}
切片切块
s = 1..10 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
Polars[10]
s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>
s[1]
# 2
s[-1]
# 10
s[0..4]
#Explorer.Series<
Polars[3]
s64 [1, 5, 5]
>
s[[0, 4, 4]]
#Explorer.Series<
Polars[3]
s64 [1, 5, 5]
>
Series.to_list(s)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
比较
s = 1..11 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
Polars[11]
s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>
s1 = 11..1//-1 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
Polars[11]
s64 [11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
>
Series.equal(s, s1)
#Explorer.Series<
Polars[11]
boolean [false, false, false, false, false, true, false, false, false, false,
false]
>
Series.equal(s, 5)
#Explorer.Series<
Polars[11]
boolean [false, false, false, false, true, false, false, false, false, false,
false]
>
Series.not_equal(s, 10)
#Explorer.Series<
Polars[11]
boolean [true, true, true, true, true, true, true, true, true, false, true]
>
Series.greater_equal(s, 4)
#Explorer.Series<
Polars[11]
boolean [false, false, false, true, true, true, true, true, true, true, true]
>
计算支持
Series.add(s, s1)
#Explorer.Series<
Polars[11]
s64 [12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]
>
Series.subtract(s, 4)
#Explorer.Series<
Polars[11]
s64 [-3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]
>
Series.multiply(s, s1)
#Explorer.Series<
Polars[11]
s64 [11, 20, 27, 32, 35, 36, 35, 32, 27, 20, 11]
>
兼容转换
s = Series.from_list([1, 2, 3])
#Explorer.Series<
Polars[3]
s64 [1, 2, 3]
>
s1 = Series.from_list([1.0, 2.0, 3.0])
#Explorer.Series<
Polars[3]
f64 [1.0, 2.0, 3.0]
>
Series.add(s, s1)
#Explorer.Series<
Polars[3]
f64 [2.0, 4.0, 6.0]
>
s2 = Series.from_list(["a", "b", "c"])
Series.add(s, s2)
** (ArgumentError) cannot invoke Explorer.Series.add/2 with mismatched dtypes: {:s, 64} and :string
(explorer 0.10.0) lib/explorer/series.ex:6814: Explorer.Series.dtype_mismatch_error/3
iex:23: (file)
翻转
s = Series.from_list([1, 2, 3, 4])
#Explorer.Series<
Polars[4]
s64 [1, 2, 3, 4]
>
Series.reverse(s)
#Explorer.Series<
Polars[4]
s64 [4, 3, 2, 1]
>
排序
1..100 |> Enum.to_list() |> Enum.shuffle() |> Series.from_list() |> Series.sort()
#Explorer.Series<
Polars[100]
s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, ...]
>
s = 1..100 |> Enum.to_list() |> Enum.shuffle() |> Series.from_list()
#Explorer.Series<
Polars[100]
s64 [56, 23, 37, 43, 61, 87, 81, 92, 91, 67, 18, 60, 9, 76, 65, 75, 94, 3, 90,
21, 10, 99, 59, 30, 27, 19, 38, 22, 12, 86, 100, 96, 68, 50, 53, 25, 88, 34,
6, 44, 95, 48, 71, 35, 62, 20, 97, 51, 40, 33, ...]
>
ids = Series.argsort(s) |> Series.to_list()
[70, 56, 17, 80, 66, 38, 73, 88, 12, 20, 57, 28, 65, 54, 67, 64, 86, 10, 25, 45,
19, 27, 1, 83, 35, 79, 24, 98, 95, 23, 60, 78, 49, 37, 43, 91, 2, 26, 51, 48,
52, 96, 3, 39, 97, 75, 99, 41, 76, 33, ...]
Series.slice(s, ids)
#Explorer.Series<
Polars[100]
s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, ...]
>
累积
s = 1..100 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
Polars[100]
s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, ...]
>
Series.cumulative_sum(s)
#Explorer.Series<
Polars[100]
s64 [1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120, 136, 153, 171,
190, 210, 231, 253, 276, 300, 325, 351, 378, 406, 435, 465, 496, 528, 561,
595, 630, 666, 703, 741, 780, 820, 861, 903, 946, 990, 1035, 1081, 1128,
1176, 1225, 1275, ...]
>
Series.window_sum(s, 4)
#Explorer.Series<
Polars[100]
s64 [1, 3, 6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46, 50, 54, 58, 62, 66, 70,
74, 78, 82, 86, 90, 94, 98, 102, 106, 110, 114, 118, 122, 126, 130, 134, 138,
142, 146, 150, 154, 158, 162, 166, 170, 174, 178, 182, 186, 190, 194, ...]
>
去重
s = Series.from_list(["a", "b", "b", "c", "c", "c"])
#Explorer.Series<
Polars[6]
string ["a", "b", "b", "c", "c", "c"]
>
Series.distinct(s)
#Explorer.Series<
Polars[3]
string ["a", "b", "c"]
>
Series.n_distinct(s)
3
Series.frequencies(s)
#Explorer.DataFrame<
Polars[3 x 2]
values string ["c", "b", "a"]
counts u32 [3, 2, 1]
>
条件运算
s1 = Series.from_list(["It", "was", "the", "best", "of", "times"])
#Explorer.Series<
Polars[6]
string ["It", "was", "the", "best", "of", "times"]
>
s1 |> Series.equal("best") |> Series.select("worst", s1)
#Explorer.Series<
Polars[6]
string ["It", "was", "the", "worst", "of", "times"]
>
数据帧操作
构建
DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
#Explorer.DataFrame<
Polars[3 x 2]
a s64 [1, 2, 3]
b string ["a", "b", "c"]
>
df
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
DataFrame.names(df)
["year", "country", "total", "solid_fuel", "liquid_fuel", "gas_fuel", "cement",
"gas_flaring", "per_capita", "bunker_fuels"]
DataFrame.dtypes(df)
%{
"bunker_fuels" => {:s, 64},
"cement" => {:s, 64},
"country" => :string,
"gas_flaring" => {:s, 64},
"gas_fuel" => {:s, 64},
"liquid_fuel" => {:s, 64},
"per_capita" => {:f, 64},
"solid_fuel" => {:s, 64},
"total" => {:s, 64},
"year" => {:s, 64}
}
DataFrame.shape(df)
{1094, 10}
DataFrame.head(df)
#Explorer.DataFrame<
Polars[5 x 10]
year s64 [2010, 2010, 2010, 2010, 2010]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA"]
total s64 [2308, 1254, 32500, 141, 7924]
solid_fuel s64 [627, 117, 332, 0, 0]
liquid_fuel s64 [1601, 953, 12381, 141, 3649]
gas_fuel s64 [74, 7, 14565, 0, 374]
cement s64 [5, 177, 2598, 0, 204]
gas_flaring s64 [0, 0, 2623, 0, 3697]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37]
bunker_fuels s64 [9, 7, 663, 0, 321]
>
DataFrame.tail(df, 10)
#Explorer.DataFrame<
Polars[10 x 10]
year s64 [2014, 2014, 2014, 2014, 2014, ...]
country string ["UNITED STATES OF AMERICA", "URUGUAY", "UZBEKISTAN",
"VANUATU", "VENEZUELA", ...]
total s64 [1432855, 1840, 28692, 42, 50510, ...]
solid_fuel s64 [450047, 2, 1677, 0, 204, ...]
liquid_fuel s64 [576531, 1700, 2086, 42, 28445, ...]
gas_fuel s64 [390719, 25, 23929, 0, 12731, ...]
cement s64 [11314, 112, 1000, 0, 1088, ...]
gas_flaring s64 [4244, 0, 0, 0, 8042, ...]
per_capita f64 [4.43, 0.54, 0.97, 0.16, 1.65, ...]
bunker_fuels s64 [30722, 251, 0, 10, 1256, ...]
>
选择
require DataFrame, as: DF
Explorer.DataFrame
DF.select(df, ["year", "country"])
#Explorer.DataFrame<
Polars[1094 x 2]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>
DF.select(df, &String.ends_with?(&1, "fuel"))
#Explorer.DataFrame<
Polars[1094 x 3]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
>
DF.discard(df, &String.ends_with?(&1, "fuel"))
#Explorer.DataFrame<
Polars[1094 x 7]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
过滤
DF.filter(df, country == "BRAZIL")
#Explorer.DataFrame<
Polars[5 x 10]
year s64 [2010, 2011, 2012, 2013, 2014]
country string ["BRAZIL", "BRAZIL", "BRAZIL", "BRAZIL", "BRAZIL"]
total s64 [114468, 119829, 128178, 137354, 144480]
solid_fuel s64 [15965, 17498, 17165, 18773, 20089]
liquid_fuel s64 [74689, 78849, 84409, 88898, 92454]
gas_fuel s64 [14372, 13778, 16328, 19399, 21297]
cement s64 [8040, 8717, 9428, 9517, 9691]
gas_flaring s64 [1402, 987, 848, 767, 949]
per_capita f64 [0.58, 0.6, 0.63, 0.67, 0.7]
bunker_fuels s64 [5101, 5516, 5168, 4895, 4895]
>
DF.filter(df, country == "ALGERIA" and year > 2012)
#Explorer.DataFrame<
Polars[2 x 10]
year s64 [2013, 2014]
country string ["ALGERIA", "ALGERIA"]
total s64 [36669, 39651]
solid_fuel s64 [198, 149]
liquid_fuel s64 [14170, 14422]
gas_fuel s64 [17863, 20151]
cement s64 [2516, 2856]
gas_flaring s64 [1922, 2073]
per_capita f64 [0.96, 1.02]
bunker_fuels s64 [687, 581]
>
DF.filter_with(df, fn ldf ->
ldf["country"]
|> Series.equal("ALGERIA")
|> Series.and(Series.greater(ldf["year"], 2012))
end)
#Explorer.DataFrame<
Polars[2 x 10]
year s64 [2013, 2014]
country string ["ALGERIA", "ALGERIA"]
total s64 [36669, 39651]
solid_fuel s64 [198, 149]
liquid_fuel s64 [14170, 14422]
gas_fuel s64 [17863, 20151]
cement s64 [2516, 2856]
gas_flaring s64 [1922, 2073]
per_capita f64 [0.96, 1.02]
bunker_fuels s64 [687, 581]
>
修改
DF.mutate(df, new_column: solid_fuel + cement)
#Explorer.DataFrame<
Polars[1094 x 11]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
new_column s64 [632, 294, 2930, 0, 204, ...]
>
DF.mutate(df,
gas_fuel: Series.cast(gas_fuel, :float),
gas_and_liquid_fuel: gas_fuel + liquid_fuel
)
#Explorer.DataFrame<
Polars[1094 x 11]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel f64 [74.0, 7.0, 14565.0, 0.0, 374.0, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
gas_and_liquid_fuel s64 [1675, 960, 26946, 141, 4023, ...]
>
DF.mutate(df, %{"gas_fuel" => gas_fuel - 10})
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [64, -3, 14555, -10, 364, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
排序
DF.sort_by(df, year)
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
DF.sort_by(df, asc: total, desc: year)
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2013, 2012, 2011, 2011, ...]
country string ["NIUE", "NIUE", "NIUE", "NIUE", "TUVALU", ...]
total s64 [1, 2, 2, 2, 2, ...]
solid_fuel s64 [0, 0, 0, 0, 0, ...]
liquid_fuel s64 [1, 2, 2, 2, 2, ...]
gas_fuel s64 [0, 0, 0, 0, 0, ...]
cement s64 [0, 0, 0, 0, 0, ...]
gas_flaring s64 [0, 0, 0, 0, 0, ...]
per_capita f64 [0.52, 1.04, 1.04, 1.04, 0.0, ...]
bunker_fuels s64 [0, 0, 0, 0, 0, ...]
>
DF.sort_by(df, asc: window_sum(total, 2))
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2011, 2012, 2010, 2011, ...]
country string ["FEDERATED STATES OF MICRONESIA",
"FEDERATED STATES OF MICRONESIA", "FEDERATED STATES OF MICRONESIA", "TUVALU",
"TUVALU", ...]
total s64 [31, 33, 37, 2, 2, ...]
solid_fuel s64 [0, 0, 0, 0, 0, ...]
liquid_fuel s64 [31, 33, 37, 2, 2, ...]
gas_fuel s64 [0, 0, 0, 0, 0, ...]
cement s64 [0, 0, 0, 0, 0, ...]
gas_flaring s64 [0, 0, 0, 0, 0, ...]
per_capita f64 [0.3, 0.32, 0.36, 0.0, 0.0, ...]
bunker_fuels s64 [1, 1, 1, 0, 0, ...]
>
去重
DF.distinct(df, ["year", "country"])
#Explorer.DataFrame<
Polars[1094 x 2]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>
DF.distinct(df, ["country"], keep_all: true)
#Explorer.DataFrame<
Polars[222 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
重命名
DF.rename(df, year: "year_test")
#Explorer.DataFrame<
Polars[1094 x 10]
year_test s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
DF.rename_with(df, &(&1 <> "_test"))
#Explorer.DataFrame<
Polars[1094 x 10]
year_test s64 [2010, 2010, 2010, 2010, 2010, ...]
country_test string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA",
...]
total_test s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel_test s64 [627, 117, 332, 0, 0, ...]
liquid_fuel_test s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel_test s64 [74, 7, 14565, 0, 374, ...]
cement_test s64 [5, 177, 2598, 0, 204, ...]
gas_flaring_test s64 [0, 0, 2623, 0, 3697, ...]
per_capita_test f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels_test s64 [9, 7, 663, 0, 321, ...]
>
虚拟
DF.dummies(df, ["year"])
#Explorer.DataFrame<
Polars[1094 x 5]
year_2010 u8 [1, 1, 1, 1, 1, ...]
year_2011 u8 [0, 0, 0, 0, 0, ...]
year_2012 u8 [0, 0, 0, 0, 0, ...]
year_2013 u8 [0, 0, 0, 0, 0, ...]
year_2014 u8 [0, 0, 0, 0, 0, ...]
>
样本
DF.sample(df, 10)
#Explorer.DataFrame<
Polars[10 x 10]
year s64 [2014, 2010, 2012, 2014, 2013, ...]
country string ["MALDIVES", "SOLOMON ISLANDS", "UZBEKISTAN",
"SYRIAN ARAB REPUBLIC", "AUSTRIA", ...]
total s64 [364, 54, 31583, 8373, 17040, ...]
solid_fuel s64 [0, 0, 1493, 1, 3402, ...]
liquid_fuel s64 [364, 54, 2352, 5198, 8632, ...]
gas_fuel s64 [0, 0, 26813, 2480, 4410, ...]
cement s64 [0, 0, 925, 517, 596, ...]
gas_flaring s64 [0, 0, 0, 178, 0, ...]
per_capita f64 [1.02, 0.1, 1.1, 0.45, 2.01, ...]
bunker_fuels s64 [204, 4, 0, 219, 555, ...]
>
DF.sample(df, 0.4)
#Explorer.DataFrame<
Polars[437 x 10]
year s64 [2014, 2010, 2014, 2014, 2011, ...]
country string ["REPUBLIC OF MOLDOVA", "MONTENEGRO",
"UNITED STATES OF AMERICA", "NORWAY", "PANAMA", ...]
total s64 [1345, 704, 1432855, 12988, 2754, ...]
solid_fuel s64 [94, 443, 450047, 879, 223, ...]
liquid_fuel s64 [595, 261, 576531, 8542, 2291, ...]
gas_fuel s64 [478, 0, 390719, 3085, 0, ...]
cement s64 [177, 0, 11314, 231, 240, ...]
gas_flaring s64 [0, 0, 4244, 251, 0, ...]
per_capita f64 [0.33, 1.13, 4.43, 2.52, 0.75, ...]
bunker_fuels s64 [21, 2, 30722, 545, 3267, ...]
>
DF.sample(df, 10000)
** (ArgumentError) in order to sample more rows than are in the dataframe (1094), sampling `replace` must be true
(explorer 0.10.0) lib/explorer/data_frame.ex:4364: Explorer.DataFrame.sample/3
iex:69: (file)
DF.sample(df, 10000, replace: true)
#Explorer.DataFrame<
Polars[10000 x 10]
year s64 [2013, 2011, 2012, 2014, 2013, ...]
country string ["NEW CALEDONIA", "MAURITANIA", "ETHIOPIA", "UZBEKISTAN",
"CURACAO", ...]
total s64 [1157, 653, 2335, 28692, 1422, ...]
solid_fuel s64 [366, 0, 162, 1677, 0, ...]
liquid_fuel s64 [775, 576, 1697, 2086, 1422, ...]
gas_fuel s64 [0, 0, 0, 23929, 0, ...]
cement s64 [16, 77, 476, 1000, 0, ...]
gas_flaring s64 [0, 0, 0, 0, 0, ...]
per_capita f64 [4.51, 0.18, 0.03, 0.97, 9.21, ...]
bunker_fuels s64 [11, 15, 272, 0, 1421, ...]
>
切片和切块
df["year"]
#Explorer.Series<
Polars[1094]
s64 [2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, ...]
>
DF.pull(df, "year")
#Explorer.Series<
Polars[1094]
s64 [2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, ...]
>
df[["year", "country"]]
#Explorer.DataFrame<
Polars[1094 x 2]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>
DF.slice(df, [1, 20, 50])
#Explorer.DataFrame<
Polars[3 x 10]
year s64 [2010, 2010, 2010]
country string ["ALBANIA", "BENIN",
"DEMOCRATIC REPUBLIC OF THE CONGO (FORMERLY ZAIRE)"]
total s64 [1254, 1388, 551]
solid_fuel s64 [117, 0, 0]
liquid_fuel s64 [953, 1211, 471]
gas_fuel s64 [7, 0, 12]
cement s64 [177, 177, 67]
gas_flaring s64 [0, 0, 0]
per_capita f64 [0.43, 0.15, 0.01]
bunker_fuels s64 [7, 127, 126]
>
DF.slice(df, -10, 5)
#Explorer.DataFrame<
Polars[5 x 10]
year s64 [2014, 2014, 2014, 2014, 2014]
country string ["UNITED STATES OF AMERICA", "URUGUAY", "UZBEKISTAN",
"VANUATU", "VENEZUELA"]
total s64 [1432855, 1840, 28692, 42, 50510]
solid_fuel s64 [450047, 2, 1677, 0, 204]
liquid_fuel s64 [576531, 1700, 2086, 42, 28445]
gas_fuel s64 [390719, 25, 23929, 0, 12731]
cement s64 [11314, 112, 1000, 0, 1088]
gas_flaring s64 [4244, 0, 0, 0, 8042]
per_capita f64 [4.43, 0.54, 0.97, 0.16, 1.65]
bunker_fuels s64 [30722, 251, 0, 10, 1256]
>
DF.slice(df, 12..42)
#Explorer.DataFrame<
Polars[31 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AZERBAIJAN", "BAHAMAS", "BAHRAIN", "BANGLADESH", "BARBADOS",
...]
total s64 [8366, 451, 7981, 16345, 403, ...]
solid_fuel s64 [6, 1, 0, 839, 0, ...]
liquid_fuel s64 [2373, 450, 1123, 2881, 363, ...]
gas_fuel s64 [4904, 0, 6696, 10753, 8, ...]
cement s64 [174, 0, 163, 1873, 31, ...]
gas_flaring s64 [909, 0, 0, 0, 1, ...]
per_capita f64 [0.92, 1.25, 6.33, 0.11, 1.44, ...]
bunker_fuels s64 [398, 179, 545, 313, 108, ...]
>
转置
DF.pivot_longer(df, ["year", "country"], select: &String.ends_with?(&1, "fuel"))
** (ArgumentError) columns to pivot must include columns with the same dtype, but found multiple dtypes: :string and {:s, 64}
(explorer 0.10.0) lib/explorer/data_frame.ex:4744: Explorer.DataFrame.pivot_longer/3
iex:76: (file)
连接
df1 = DF.select(df, ["year", "country", "total"])
df2 = DF.select(df, ["year", "country", "cement"])
DF.join(df1, df2)
DF.join(df1, df2)
#Explorer.DataFrame<
Polars[1094 x 4]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
>
df3 = df |> DF.select(["year", "cement"]) |> DF.slice(0, 500)
DF.join(df1, df3, how: :left)
#Explorer.DataFrame<
Polars[109138 x 4]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN",
"AFGHANISTAN", ...]
total s64 [2308, 2308, 2308, 2308, 2308, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
>
分组
grouped = DF.group_by(df, ["country"])
#Explorer.DataFrame<
Polars[1094 x 10]
Groups: ["country"]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
DF.groups(grouped)
["country"]
grouped |> DF.summarise(max_per_capita: max(per_capita)) |> DF.sort_by(desc: max_per_capita)
#Explorer.DataFrame<
Polars[222 x 2]
country string ["QATAR", "CURACAO", "TRINIDAD AND TOBAGO", "KUWAIT",
"NETHERLAND ANTILLES", ...]
max_per_capita f64 [13.54, 10.72, 9.84, 8.16, 7.45, ...]
>
summarise使用的聚合操作
-
min/1 取组内的最小值。
-
max/1 取组内的最大值。
-
sum/1 取组内序列的总和。
-
mean/1 取组内系列的平均值。
-
median/1 取组内系列的中位数
-
first/1 取组内的第一个值。
-
last/1 取组内的最后一个值。
-
count/1 计算每组的行数。
-
n_unique/1 计算每组的唯一行数。
DF.summarise(grouped, min_per_capita: min(per_capita), min_total: min(total))
#Explorer.DataFrame<
Polars[222 x 3]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
min_per_capita f64 [0.08, 0.43, 0.9, 1.63, 0.37, ...]
min_total s64 [2308, 1254, 32500, 126, 7924, ...]
>
DF.mutate(grouped, total_window_sum: window_sum(total, 3), rows_in_group: count(country))
#Explorer.DataFrame<
Polars[1094 x 12]
Groups: ["country"]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
total_window_sum s64 [2308, 1254, 32500, 141, 7924, ...]
rows_in_group u32 [5, 5, 5, 5, 5, ...]
>
grouped
|> DF.summarise(greater_than_9: greater(max(per_capita), 9.0), per_capita_max: max(per_capita))
|> DataFrame.sort_by(desc: per_capita_max)
#Explorer.DataFrame<
Polars[222 x 3]
country string ["QATAR", "CURACAO", "TRINIDAD AND TOBAGO", "KUWAIT",
"NETHERLAND ANTILLES", ...]
greater_than_9 boolean [true, true, true, false, false, ...]
per_capita_max f64 [13.54, 10.72, 9.84, 8.16, 7.45, ...]
>