面试真题
常见题目
16.max_by处理缺失值

常见大数据面试SQL-max_by处理缺失值

一、题目

现有用户账户表,包含日期、用户id、用户余额,其中用户余额发生了缺失,需要进行补全。补全规则:如果余额为空则取之前最近不为空值进行填补。如果截止到最早日期都为空则补0;

样例数据

+-------------+----------+---------+
|   c_date    | user_id  | amount  |
+-------------+----------+---------+
| 2024-06-01  | 1        | NULL    |
| 2024-06-02  | 1        | 100     |
| 2024-06-03  | 1        | 80      |
| 2024-06-04  | 1        | NULL    |
| 2024-06-05  | 1        | 50      |
| 2024-06-06  | 1        | 30      |
| 2024-06-01  | 2        | 80      |
| 2024-06-02  | 2        | NULL    |
| 2024-06-03  | 2        | NULL    |
| 2024-06-04  | 2        | NULL    |
| 2024-06-05  | 2        | 50      |
| 2024-06-06  | 2        | 30      |
+-------------+----------+---------+

二、分析

本题类似之前的面试题目,之前题目用的较为常规解法,相对较为麻烦。今天换一种解法。使用max_by(x,y)函数进行处理。

维度评分
题目难度⭐️⭐️⭐️⭐️
题目清晰度⭐️⭐️⭐️⭐️⭐️
业务常见度⭐️⭐️⭐️⭐️

三、SQL

1.增加一列排序列

增加一列order_date的日期,如果amount有值则order_date为c_date,否则给一个较小的时间(1970-01-01)。

执行SQL

select c_date,
       user_id,
       amount,
       if(amount is not null, c_date, '1970-01-01') as order_date
from t16_user_amount

查询结果

+-------------+----------+---------+-------------+
|   c_date    | user_id  | amount  | order_date  |
+-------------+----------+---------+-------------+
| 2024-06-01  | 1        | NULL    | 1970-01-01  |
| 2024-06-02  | 1        | 100     | 2024-06-02  |
| 2024-06-03  | 1        | 80      | 2024-06-03  |
| 2024-06-04  | 1        | NULL    | 1970-01-01  |
| 2024-06-05  | 1        | 50      | 2024-06-05  |
| 2024-06-06  | 1        | 30      | 2024-06-06  |
| 2024-06-01  | 2        | 80      | 2024-06-01  |
| 2024-06-02  | 2        | NULL    | 1970-01-01  |
| 2024-06-03  | 2        | NULL    | 1970-01-01  |
| 2024-06-04  | 2        | NULL    | 1970-01-01  |
| 2024-06-05  | 2        | 50      | 2024-06-05  |
| 2024-06-06  | 2        | 30      | 2024-06-06  |
+-------------+----------+---------+-------------+

2.使用max_by()函数开窗得到填充值

max_by() 函数是spark3.0.0之后支持的函数,max_by(x,y) 根据 y 的最大值返回与之关联的 x 的值。

我们使用max_by函数开窗,按照user_id分组,按照c_date进行排序,注意是c_date取当前用户从开始行到当前行的前一行数据。 然后找到最大的order_date取出对应的amount值new_amount。该值即为填充值。

我在最后增加了order by user_id, c_date 排序,以方便查看排序结果

执行SQL

select c_date,
       user_id,
       amount,
       order_date,
       max_by(amount,order_date) over (partition by user_id order by c_date asc rows between unbounded preceding and 1 preceding) as new_amount
from (select c_date,
             user_id,
             amount,
             if(amount is not null, c_date, '1970-01-01') as order_date
      from t16_user_amount) t1
order by user_id, c_date

查询结果

+-------------+----------+---------+-------------+-------------+
|   c_date    | user_id  | amount  | order_date  | new_amount  |
+-------------+----------+---------+-------------+-------------+
| 2024-06-01  | 1        | NULL    | 1970-01-01  | NULL        |
| 2024-06-02  | 1        | 100     | 2024-06-02  | NULL        |
| 2024-06-03  | 1        | 80      | 2024-06-03  | 100         |
| 2024-06-04  | 1        | NULL    | 1970-01-01  | 80          |
| 2024-06-05  | 1        | 50      | 2024-06-05  | 80          |
| 2024-06-06  | 1        | 30      | 2024-06-06  | 50          |
| 2024-06-01  | 2        | 80      | 2024-06-01  | NULL        |
| 2024-06-02  | 2        | NULL    | 1970-01-01  | 80          |
| 2024-06-03  | 2        | NULL    | 1970-01-01  | 80          |
| 2024-06-04  | 2        | NULL    | 1970-01-01  | 80          |
| 2024-06-05  | 2        | 50      | 2024-06-05  | 80          |
| 2024-06-06  | 2        | 30      | 2024-06-06  | 50          |
+-------------+----------+---------+-------------+-------------+

3.使用填充值进行填充,得到最终结果

优先取自己的amount,如果amount为空则取new_amount进行填充,如果new_amount为空,则填充0。依旧为了方便对比查看结果,我保留原值amount 和结果值amount_result,amount_result为目标值。

执行SQL

select c_date,
       user_id,
       amount,
       coalesce(amount,
       max_by(amount,order_date) over (partition by user_id order by c_date asc rows between unbounded preceding and 1 preceding),
           0) as amount_result
from (select c_date,
             user_id,
             amount,
             if(amount is not null, c_date, '1970-01-01') as order_date
      from t16_user_amount) t1
order by user_id, c_date

查询结果

+-------------+----------+---------+----------------+
|   c_date    | user_id  | amount  | amount_result  |
+-------------+----------+---------+----------------+
| 2024-06-01  | 1        | NULL    | 0              |
| 2024-06-02  | 1        | 100     | 100            |
| 2024-06-03  | 1        | 80      | 80             |
| 2024-06-04  | 1        | NULL    | 80             |
| 2024-06-05  | 1        | 50      | 50             |
| 2024-06-06  | 1        | 30      | 30             |
| 2024-06-01  | 2        | 80      | 80             |
| 2024-06-02  | 2        | NULL    | 80             |
| 2024-06-03  | 2        | NULL    | 80             |
| 2024-06-04  | 2        | NULL    | 80             |
| 2024-06-05  | 2        | 50      | 50             |
| 2024-06-06  | 2        | 30      | 30             |
+-------------+----------+---------+----------------+

四、建表语句和数据插入

--建表语句
create table t16_user_amount
(
c_date string COMMENT '日期',
user_id bigint COMMENT '用户ID',
amount bigint COMMENT '用户'
) COMMENT '用户账户表';
-- 数据插入
insert into t16_user_amount(c_date,user_id,amount)
values
('2024-06-01',001,null),
('2024-06-02',001,100),
('2024-06-03',001,80),
('2024-06-04',001,null),
('2024-06-05',001,50),
('2024-06-06',001,30),
('2024-06-01',002,80),
('2024-06-02',002,null),
('2024-06-03',002,null),
('2024-06-04',002,null),
('2024-06-05',002,50),
('2024-06-06',002,30)

本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab) 发表;