常见大数据面试SQL-max_by处理缺失值
一、题目
现有用户账户表,包含日期、用户id、用户余额,其中用户余额发生了缺失,需要进行补全。补全规则:如果余额为空则取之前最近不为空值进行填补。如果截止到最早日期都为空则补0;
样例数据
+-------------+----------+---------+
| c_date | user_id | amount |
+-------------+----------+---------+
| 2024-06-01 | 1 | NULL |
| 2024-06-02 | 1 | 100 |
| 2024-06-03 | 1 | 80 |
| 2024-06-04 | 1 | NULL |
| 2024-06-05 | 1 | 50 |
| 2024-06-06 | 1 | 30 |
| 2024-06-01 | 2 | 80 |
| 2024-06-02 | 2 | NULL |
| 2024-06-03 | 2 | NULL |
| 2024-06-04 | 2 | NULL |
| 2024-06-05 | 2 | 50 |
| 2024-06-06 | 2 | 30 |
+-------------+----------+---------+
二、分析
本题类似之前的面试题目,之前题目用的较为常规解法,相对较为麻烦。今天换一种解法。使用max_by(x,y)函数进行处理。
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐️ |
业务常见度 | ⭐️⭐️⭐️⭐️ |
三、SQL
1.增加一列排序列
增加一列order_date的日期,如果amount有值则order_date为c_date,否则给一个较小的时间(1970-01-01)。
执行SQL
select c_date,
user_id,
amount,
if(amount is not null, c_date, '1970-01-01') as order_date
from t16_user_amount
查询结果
+-------------+----------+---------+-------------+
| c_date | user_id | amount | order_date |
+-------------+----------+---------+-------------+
| 2024-06-01 | 1 | NULL | 1970-01-01 |
| 2024-06-02 | 1 | 100 | 2024-06-02 |
| 2024-06-03 | 1 | 80 | 2024-06-03 |
| 2024-06-04 | 1 | NULL | 1970-01-01 |
| 2024-06-05 | 1 | 50 | 2024-06-05 |
| 2024-06-06 | 1 | 30 | 2024-06-06 |
| 2024-06-01 | 2 | 80 | 2024-06-01 |
| 2024-06-02 | 2 | NULL | 1970-01-01 |
| 2024-06-03 | 2 | NULL | 1970-01-01 |
| 2024-06-04 | 2 | NULL | 1970-01-01 |
| 2024-06-05 | 2 | 50 | 2024-06-05 |
| 2024-06-06 | 2 | 30 | 2024-06-06 |
+-------------+----------+---------+-------------+
2.使用max_by()函数开窗得到填充值
max_by() 函数是spark3.0.0之后支持的函数,max_by(x,y) 根据 y 的最大值返回与之关联的 x 的值。
我们使用max_by函数开窗,按照user_id分组,按照c_date进行排序,注意是c_date取当前用户从开始行到当前行的前一行数据。 然后找到最大的order_date取出对应的amount值new_amount。该值即为填充值。
我在最后增加了order by user_id, c_date 排序,以方便查看排序结果
执行SQL
select c_date,
user_id,
amount,
order_date,
max_by(amount,order_date) over (partition by user_id order by c_date asc rows between unbounded preceding and 1 preceding) as new_amount
from (select c_date,
user_id,
amount,
if(amount is not null, c_date, '1970-01-01') as order_date
from t16_user_amount) t1
order by user_id, c_date
查询结果
+-------------+----------+---------+-------------+-------------+
| c_date | user_id | amount | order_date | new_amount |
+-------------+----------+---------+-------------+-------------+
| 2024-06-01 | 1 | NULL | 1970-01-01 | NULL |
| 2024-06-02 | 1 | 100 | 2024-06-02 | NULL |
| 2024-06-03 | 1 | 80 | 2024-06-03 | 100 |
| 2024-06-04 | 1 | NULL | 1970-01-01 | 80 |
| 2024-06-05 | 1 | 50 | 2024-06-05 | 80 |
| 2024-06-06 | 1 | 30 | 2024-06-06 | 50 |
| 2024-06-01 | 2 | 80 | 2024-06-01 | NULL |
| 2024-06-02 | 2 | NULL | 1970-01-01 | 80 |
| 2024-06-03 | 2 | NULL | 1970-01-01 | 80 |
| 2024-06-04 | 2 | NULL | 1970-01-01 | 80 |
| 2024-06-05 | 2 | 50 | 2024-06-05 | 80 |
| 2024-06-06 | 2 | 30 | 2024-06-06 | 50 |
+-------------+----------+---------+-------------+-------------+
3.使用填充值进行填充,得到最终结果
优先取自己的amount,如果amount为空则取new_amount进行填充,如果new_amount为空,则填充0。依旧为了方便对比查看结果,我保留原值amount 和结果值amount_result,amount_result为目标值。
执行SQL
select c_date,
user_id,
amount,
coalesce(amount,
max_by(amount,order_date) over (partition by user_id order by c_date asc rows between unbounded preceding and 1 preceding),
0) as amount_result
from (select c_date,
user_id,
amount,
if(amount is not null, c_date, '1970-01-01') as order_date
from t16_user_amount) t1
order by user_id, c_date
查询结果
+-------------+----------+---------+----------------+
| c_date | user_id | amount | amount_result |
+-------------+----------+---------+----------------+
| 2024-06-01 | 1 | NULL | 0 |
| 2024-06-02 | 1 | 100 | 100 |
| 2024-06-03 | 1 | 80 | 80 |
| 2024-06-04 | 1 | NULL | 80 |
| 2024-06-05 | 1 | 50 | 50 |
| 2024-06-06 | 1 | 30 | 30 |
| 2024-06-01 | 2 | 80 | 80 |
| 2024-06-02 | 2 | NULL | 80 |
| 2024-06-03 | 2 | NULL | 80 |
| 2024-06-04 | 2 | NULL | 80 |
| 2024-06-05 | 2 | 50 | 50 |
| 2024-06-06 | 2 | 30 | 30 |
+-------------+----------+---------+----------------+
四、建表语句和数据插入
--建表语句
create table t16_user_amount
(
c_date string COMMENT '日期',
user_id bigint COMMENT '用户ID',
amount bigint COMMENT '用户'
) COMMENT '用户账户表';
-- 数据插入
insert into t16_user_amount(c_date,user_id,amount)
values
('2024-06-01',001,null),
('2024-06-02',001,100),
('2024-06-03',001,80),
('2024-06-04',001,null),
('2024-06-05',001,50),
('2024-06-06',001,30),
('2024-06-01',002,80),
('2024-06-02',002,null),
('2024-06-03',002,null),
('2024-06-04',002,null),
('2024-06-05',002,50),
('2024-06-06',002,30)
本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab) 发表;