409_SQL窗口函数实战¶
一句话说明¶
窗口函数就是在不合并行的前提下,对每一行计算一个基于"窗口范围"的聚合值,像滑动的计算尺。
核心知识点¶
窗口函数语法结构¶
函数名() OVER (
PARTITION BY 分组列 -- 按哪列分组(可省略,不分组=全表)
ORDER BY 排序列 -- 窗口内如何排序
ROWS/RANGE BETWEEN ... -- 窗口范围(可省略)
)
常用窗口函数分类¶
| 类别 | 函数 | 说明 |
|---|---|---|
| 排名函数 | ROW_NUMBER() | 唯一序号,不并列 |
| 排名函数 | RANK() | 有并列,跳过名次 |
| 排名函数 | DENSE_RANK() | 有并列,不跳名次 |
| 位移函数 | LAG(col, n) | 取前n行的值 |
| 位移函数 | LEAD(col, n) | 取后n行的值 |
| 聚合函数 | SUM/AVG/MAX/MIN | 窗口内聚合 |
| 分布函数 | PERCENT_RANK() | 百分位排名 |
| 分布函数 | NTILE(n) | 分成n个桶 |
| 首末函数 | FIRST_VALUE() | 窗口第一行 |
| 首末函数 | LAST_VALUE() | 窗口最后一行 |
实战SQL示例¶
-- ========== 场景:基因表达分析数据库 ==========
-- 表结构:gene_expression(sample_id, gene_id, expression, condition, timepoint)
-- ========== 1. ROW_NUMBER:找每个样本中表达量最高的前3个基因 ==========
SELECT *
FROM (
SELECT
sample_id,
gene_id,
expression,
-- 在每个sample_id分组内,按expression降序排名
ROW_NUMBER() OVER (
PARTITION BY sample_id -- 每个样本单独排名
ORDER BY expression DESC -- 表达量从高到低
) AS rank_in_sample
FROM gene_expression
) ranked
WHERE rank_in_sample <= 3; -- 只保留每个样本排名前3的基因
-- 注意:ROW_NUMBER不会并列,表达量完全相同时随机排序
-- ========== 2. RANK vs DENSE_RANK 对比 ==========
SELECT
gene_id,
expression,
RANK() OVER (ORDER BY expression DESC) AS rank_with_gaps,
-- RANK: 1,2,2,4(并列第2后跳到4)
DENSE_RANK() OVER (ORDER BY expression DESC) AS rank_no_gaps
-- DENSE_RANK: 1,2,2,3(并列第2后是3)
FROM gene_expression
WHERE sample_id = 'S001';
-- ========== 3. LAG/LEAD:计算时序表达量变化 ==========
SELECT
sample_id,
gene_id,
timepoint,
expression,
-- 取同一样本同一基因的上一个时间点表达量
LAG(expression, 1) OVER (
PARTITION BY sample_id, gene_id -- 同一个样本的同一个基因
ORDER BY timepoint -- 按时间点排序
) AS prev_timepoint_expr,
-- 计算相比上一时间点的变化量
expression - LAG(expression, 1) OVER (
PARTITION BY sample_id, gene_id
ORDER BY timepoint
) AS expr_change,
-- 计算变化倍数(需防止除零)
CASE
WHEN LAG(expression, 1) OVER (
PARTITION BY sample_id, gene_id ORDER BY timepoint
) > 0
THEN expression / LAG(expression, 1) OVER (
PARTITION BY sample_id, gene_id ORDER BY timepoint
)
ELSE NULL
END AS fold_change
FROM gene_expression
ORDER BY sample_id, gene_id, timepoint;
-- ========== 4. 滑动窗口聚合:3个时间点滑动平均 ==========
SELECT
sample_id,
gene_id,
timepoint,
expression,
-- 计算当前时间点及前2个时间点的滑动平均(去噪)
AVG(expression) OVER (
PARTITION BY sample_id, gene_id
ORDER BY timepoint
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW -- 向前2行到当前行
) AS moving_avg_3pt,
-- 计算累积最大值(当前时间点前的最高表达量)
MAX(expression) OVER (
PARTITION BY sample_id, gene_id
ORDER BY timepoint
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- 从第一行到当前行
) AS cumulative_max
FROM gene_expression;
-- ========== 5. 百分位排名:找基因在全体样本中的相对位置 ==========
SELECT
gene_id,
sample_id,
expression,
-- PERCENT_RANK: 0到1之间,0=最低,1=最高
ROUND(PERCENT_RANK() OVER (
PARTITION BY gene_id -- 在同一基因的所有样本中排名
ORDER BY expression
) * 100, 1) AS percentile_rank,
-- NTILE(4): 分成4个四分位数(1=Q1最低,4=Q4最高)
NTILE(4) OVER (
PARTITION BY gene_id
ORDER BY expression
) AS quartile
FROM gene_expression;
-- ========== 6. 组内占比:每个基因在各样本中的表达量占该样本总量的比例 ==========
SELECT
sample_id,
gene_id,
expression,
-- 该样本的总表达量(所有基因求和)
SUM(expression) OVER (PARTITION BY sample_id) AS sample_total,
-- 该基因占该样本总量的百分比
ROUND(
expression * 100.0 / SUM(expression) OVER (PARTITION BY sample_id),
2
) AS pct_of_sample
FROM gene_expression
ORDER BY sample_id, pct_of_sample DESC;
-- ========== 7. 复合需求:找各条件下表达量排名前10%的基因 ==========
SELECT gene_id, sample_id, condition, expression, pct_rank
FROM (
SELECT
gene_id,
sample_id,
condition,
expression,
PERCENT_RANK() OVER (
PARTITION BY condition -- 在每个实验条件下排名
ORDER BY expression DESC -- 表达量从高到低
) AS pct_rank
FROM gene_expression
) t
WHERE pct_rank <= 0.1; -- 前10%(pct_rank越小=排名越靠前)
面试常问点¶
Q: ROW_NUMBER、RANK、DENSE_RANK 有什么区别?举例说明。 A: 假设分数:90, 90, 85。ROW_NUMBER: 1,2,3;RANK: 1,1,3(跳过2);DENSE_RANK: 1,1,2(不跳)。
Q: 窗口函数和GROUP BY有什么本质区别? A: GROUP BY合并行,每组输出一行;窗口函数不合并行,每行都输出,但附带了窗口内的计算结果。
Q: ROWS BETWEEN和RANGE BETWEEN有什么区别? A: ROWS按物理行数计算(精确);RANGE按值范围计算(相同值的行作为一组)。推荐优先用ROWS更直观。
Q: 如何找出每组中"最新"的一条记录? A:
ROW_NUMBER() OVER (PARTITION BY group_col ORDER BY timestamp DESC) = 1
速查表¶
-- 排名(每组内)
ROW_NUMBER() OVER (PARTITION BY grp ORDER BY val DESC)
-- 组内求和(不折叠行)
SUM(val) OVER (PARTITION BY grp)
-- 取前一行的值
LAG(val, 1, 0) OVER (PARTITION BY grp ORDER BY time) -- 默认值0
-- 滚动3行平均
AVG(val) OVER (PARTITION BY grp ORDER BY time ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
-- 累积求和
SUM(val) OVER (PARTITION BY grp ORDER BY time ROWS UNBOUNDED PRECEDING)
-- 分百分位
NTILE(100) OVER (ORDER BY val) -- 百分位号(1-100)