常见大数据面试SQL-近30天连续登陆3天以上次数-非开窗
一、题目
有用户登录记录,已经按照日期去重。求近三十天,用户连续登录超过3天的次数,一直连续登录算一次,有间隔然后重新计算次数, 要求不能用开窗
样例数据
+----------+-------------+
| user_id | login_date |
+----------+-------------+
| 001 | 2024-07-03 |
| 001 | 2024-07-04 |
| 001 | 2024-07-05 |
| 001 | 2024-07-06 |
| 001 | 2024-07-12 |
| 001 | 2024-07-13 |
| 001 | 2024-07-17 |
| 001 | 2024-07-18 |
| 001 | 2024-07-19 |
| 001 | 2024-07-20 |
| 001 | 2024-07-21 |
+----------+-------------+
期望结果
+----------+-----------+
| user_id | count(1) |
+----------+-----------+
| 001 | 2 |
+----------+-----------+
二、分析
- 这里要求求近30天,由于样例数据固定,所以我们改为求'2024-08-01'的近30天,如果在实际应用中可以使用current_date函数,实际传入参数等方式。
- 题目属于连续问题,连续问题我们做过总结,解决方案参考一文搞懂连续问题 (opens in a new tab)
- 题目中明确要求不能使用开窗函数,实际工作中确实存在不支持开窗的场景。开窗函数是一个很有帮助的技巧,但是不使用开窗,直接sql常规处理,很锻炼数据思维能力。
数据给出了用户登陆的日期,求连续登陆天数,比较难统计,如果我们把未登陆的日期找到,未登录的日期之间的日期就是登陆日期,未登录日期的差值-1则是连续登陆的天数.
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐️ |
业务常见度 | ⭐️⭐️⭐️ |
三、SQL
1.找出未登录日期
题目中没有给出日期维表,这里可以假设有,也可以直接生成函数做一个,这里我们生成一下2024-07-01至2024-08-02的日期维表。然后使用维表和用户登陆记录表进行关联。日期能关联上的是登陆记录,关联不上的是未登陆记录,找到未登陆的记录。因为要计算日期差,我们把每个用户第31天前的数据记录为未登录。
执行SQL
with t_dim_date as (SELECT explode(sequence(to_date('2024-07-03'), to_date('2024-08-01'), interval 1 day)) as dim_date),
un_login as (select t1.user_id, t2.dim_date, t3.login_date
from (
--近30天登陆用户去重
select user_id
from t18_use_login
where login_date > date_add('2024-08-01', -30)
and login_date <= to_date('2024-08-01')
group by user_id) t1
full outer join
(
-- 近30天日期维度数据
select dim_date
from t_dim_date
where dim_date > date_add('2024-08-01', -30)
and dim_date <= to_date('2024-08-01')) t2
left join
(
--近30天登陆记录
select user_id, login_date
from t18_use_login
where login_date > date_add('2024-08-01', -30)
and login_date <= to_date('2024-08-01')) t3
on t1.user_id = t3.user_id
and t2.dim_date = t3.login_date
where t3.login_date is null
union all
--用户第31天前设置为未登录
select user_id,date_add('2024-08-01', -30) as dim_date,null
from t18_use_login
where login_date > date_add('2024-08-01', -30)
and login_date <= to_date('2024-08-01')
group by user_id
)
select *
from un_login
查询结果
+----------+-------------+-------------+
| user_id | dim_date | login_date |
+----------+-------------+-------------+
| 001 | 2024-07-07 | NULL |
| 001 | 2024-07-08 | NULL |
| 001 | 2024-07-09 | NULL |
| 001 | 2024-07-10 | NULL |
| 001 | 2024-07-11 | NULL |
| 001 | 2024-07-14 | NULL |
| 001 | 2024-07-15 | NULL |
| 001 | 2024-07-16 | NULL |
| 001 | 2024-07-22 | NULL |
| 001 | 2024-07-23 | NULL |
| 001 | 2024-07-24 | NULL |
| 001 | 2024-07-25 | NULL |
| 001 | 2024-07-26 | NULL |
| 001 | 2024-07-27 | NULL |
| 001 | 2024-07-28 | NULL |
| 001 | 2024-07-29 | NULL |
| 001 | 2024-07-30 | NULL |
| 001 | 2024-07-31 | NULL |
| 001 | 2024-08-01 | NULL |
| 001 | 2024-07-02 | NULL |
+----------+-------------+-------------+
2.计算未登陆日期和前一未登陆日期
由于不能使用开窗函数,我们使用未登录记录表根据use_id自关联,限制t1的dim_date小于t2的dim_date,按照t2的dim_date分组,找到t1的最大日期--即用户t2的未登录日期的前一个未登陆日期。
执行SQL
with t_dim_date as (SELECT explode(sequence(to_date('2024-07-03'), to_date('2024-08-01'), interval 1 day)) as dim_date),
un_login as (select t1.user_id, t2.dim_date, t3.login_date
from (
--近30天登陆用户去重
select user_id
from t18_use_login
where login_date > date_add('2024-08-01', -30)
and login_date <= to_date('2024-08-01')
group by user_id) t1
full outer join
(
-- 近30天日期维度数据
select dim_date
from t_dim_date
where dim_date > date_add('2024-08-01', -30)
and dim_date <= to_date('2024-08-01')) t2
left join
(
--近30天登陆记录
select user_id, login_date
from t18_use_login
where login_date > date_add('2024-08-01', -30)
and login_date <= to_date('2024-08-01')) t3
on t1.user_id = t3.user_id
and t2.dim_date = t3.login_date
where t3.login_date is null
union all
--用户第31天前设置为未登录
select user_id, date_add('2024-08-01', -30) as dim_date, null
from t18_use_login
where login_date > date_add('2024-08-01', -30)
and login_date <= to_date('2024-08-01')
group by user_id)
select t2.user_id, t2.dim_date as unlogin_date, max(t1.dim_date) as last_unlogin_date
from un_login t1
join un_login t2
on t1.user_id = t2.user_id
where t1.dim_date < t2.dim_date
group by t2.user_id, t2.dim_date
order by t2.user_id, t2.dim_date
SQL结果
+----------+---------------+--------------------+
| user_id | unlogin_date | last_unlogin_date |
+----------+---------------+--------------------+
| 001 | 2024-07-07 | 2024-07-02 |
| 001 | 2024-07-08 | 2024-07-07 |
| 001 | 2024-07-09 | 2024-07-08 |
| 001 | 2024-07-10 | 2024-07-09 |
| 001 | 2024-07-11 | 2024-07-10 |
| 001 | 2024-07-14 | 2024-07-11 |
| 001 | 2024-07-15 | 2024-07-14 |
| 001 | 2024-07-16 | 2024-07-15 |
| 001 | 2024-07-22 | 2024-07-16 |
| 001 | 2024-07-23 | 2024-07-22 |
| 001 | 2024-07-24 | 2024-07-23 |
| 001 | 2024-07-25 | 2024-07-24 |
| 001 | 2024-07-26 | 2024-07-25 |
| 001 | 2024-07-27 | 2024-07-26 |
| 001 | 2024-07-28 | 2024-07-27 |
| 001 | 2024-07-29 | 2024-07-28 |
| 001 | 2024-07-30 | 2024-07-29 |
| 001 | 2024-07-31 | 2024-07-30 |
| 001 | 2024-08-01 | 2024-07-31 |
+----------+---------------+--------------------+
3.计算日期差,得出连续登陆天数
因为连续两天的日期差为1,日期差-1则为连续登陆天数
执行SQL
with t_dim_date as (SELECT explode(sequence(to_date('2024-07-03'), to_date('2024-08-01'), interval 1 day)) as dim_date),
un_login as (select t1.user_id, t2.dim_date, t3.login_date
from (
--近30天登陆用户去重
select user_id
from t18_use_login
where login_date > date_add('2024-08-01', -30)
and login_date <= to_date('2024-08-01')
group by user_id) t1
full outer join
(
-- 近30天日期维度数据
select dim_date
from t_dim_date
where dim_date > date_add('2024-08-01', -30)
and dim_date <= to_date('2024-08-01')) t2
left join
(
--近30天登陆记录
select user_id, login_date
from t18_use_login
where login_date > date_add('2024-08-01', -30)
and login_date <= to_date('2024-08-01')) t3
on t1.user_id = t3.user_id
and t2.dim_date = t3.login_date
where t3.login_date is null
union all
--用户第31天前设置为未登录
select user_id, date_add('2024-08-01', -30) as dim_date, null
from t18_use_login
where login_date > date_add('2024-08-01', -30)
and login_date <= to_date('2024-08-01')
group by user_id)
select user_id,unlogin_date,last_unlogin_date,datediff(unlogin_date,last_unlogin_date)-1 as con_days
from(
select t2.user_id, t2.dim_date as unlogin_date, max(t1.dim_date) as last_unlogin_date
from un_login t1
join un_login t2
on t1.user_id = t2.user_id
where t1.dim_date < t2.dim_date
group by t2.user_id, t2.dim_date) t
order by user_id,unlogin_date
SQL结果
+----------+---------------+--------------------+-----------+
| user_id | unlogin_date | last_unlogin_date | con_days |
+----------+---------------+--------------------+-----------+
| 001 | 2024-07-07 | 2024-07-02 | 4 |
| 001 | 2024-07-08 | 2024-07-07 | 0 |
| 001 | 2024-07-09 | 2024-07-08 | 0 |
| 001 | 2024-07-10 | 2024-07-09 | 0 |
| 001 | 2024-07-11 | 2024-07-10 | 0 |
| 001 | 2024-07-14 | 2024-07-11 | 2 |
| 001 | 2024-07-15 | 2024-07-14 | 0 |
| 001 | 2024-07-16 | 2024-07-15 | 0 |
| 001 | 2024-07-22 | 2024-07-16 | 5 |
| 001 | 2024-07-23 | 2024-07-22 | 0 |
| 001 | 2024-07-24 | 2024-07-23 | 0 |
| 001 | 2024-07-25 | 2024-07-24 | 0 |
| 001 | 2024-07-26 | 2024-07-25 | 0 |
| 001 | 2024-07-27 | 2024-07-26 | 0 |
| 001 | 2024-07-28 | 2024-07-27 | 0 |
| 001 | 2024-07-29 | 2024-07-28 | 0 |
| 001 | 2024-07-30 | 2024-07-29 | 0 |
| 001 | 2024-07-31 | 2024-07-30 | 0 |
| 001 | 2024-08-01 | 2024-07-31 | 0 |
+----------+---------------+--------------------+-----------+
可以看到用户的连续登陆天数数据
4.限制con_days >=3 得到最终结果
执行SQL
with t_dim_date as (SELECT explode(sequence(to_date('2024-07-03'), to_date('2024-08-01'), interval 1 day)) as dim_date),
un_login as (select t1.user_id, t2.dim_date, t3.login_date
from (
--近30天登陆用户去重
select user_id
from t18_use_login
where login_date > date_add('2024-08-01', -30)
and login_date <= to_date('2024-08-01')
group by user_id) t1
full outer join
(
-- 近30天日期维度数据
select dim_date
from t_dim_date
where dim_date > date_add('2024-08-01', -30)
and dim_date <= to_date('2024-08-01')) t2
left join
(
--近30天登陆记录
select user_id, login_date
from t18_use_login
where login_date > date_add('2024-08-01', -30)
and login_date <= to_date('2024-08-01')) t3
on t1.user_id = t3.user_id
and t2.dim_date = t3.login_date
where t3.login_date is null
union all
--用户第31天前设置为未登录
select user_id, date_add('2024-08-01', -30) as dim_date, null
from t18_use_login
where login_date > date_add('2024-08-01', -30)
and login_date <= to_date('2024-08-01')
group by user_id)
select user_id, count(1)
from (select t2.user_id, t2.dim_date as unlogin_date, max(t1.dim_date) as last_unlogin_date
from un_login t1
join un_login t2
on t1.user_id = t2.user_id
where t1.dim_date < t2.dim_date
group by t2.user_id, t2.dim_date) t
where datediff(unlogin_date, last_unlogin_date) - 1 >= 3
group by user_id
查询结果
+----------+-----------+
| user_id | count(1) |
+----------+-----------+
| 001 | 2 |
+----------+-----------+
四、建表语句和数据插入
--建表语句
CREATE TABLE IF NOT EXISTS t18_use_login
(
user_id string, -- 用户id
login_date string -- 登陆日期
)
COMMENT '用户登录记录表';
--插入数据
INSERT INTO t18_use_login VALUES
('001','2024-07-03'),
('001','2024-07-04'),
('001','2024-07-05'),
('001','2024-07-06'),
('001','2024-07-12'),
('001','2024-07-13'),
('001','2024-07-17'),
('001','2024-07-18'),
('001','2024-07-19'),
('001','2024-07-20'),
('001','2024-07-21');
本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab) 发表;