跳转至

409_SQL窗口函数实战


一句话说明

窗口函数就是在不合并行的前提下,对每一行计算一个基于"窗口范围"的聚合值,像滑动的计算尺。


核心知识点

窗口函数语法结构

函数名() OVER (
    PARTITION BY 分组列   -- 按哪列分组(可省略,不分组=全表)
    ORDER BY 排序列       -- 窗口内如何排序
    ROWS/RANGE BETWEEN ... -- 窗口范围(可省略)
)

常用窗口函数分类

类别函数说明
排名函数ROW_NUMBER()唯一序号,不并列
排名函数RANK()有并列,跳过名次
排名函数DENSE_RANK()有并列,不跳名次
位移函数LAG(col, n)取前n行的值
位移函数LEAD(col, n)取后n行的值
聚合函数SUM/AVG/MAX/MIN窗口内聚合
分布函数PERCENT_RANK()百分位排名
分布函数NTILE(n)分成n个桶
首末函数FIRST_VALUE()窗口第一行
首末函数LAST_VALUE()窗口最后一行

实战SQL示例

-- ========== 场景:基因表达分析数据库 ==========
-- 表结构:gene_expression(sample_id, gene_id, expression, condition, timepoint)

-- ========== 1. ROW_NUMBER:找每个样本中表达量最高的前3个基因 ==========
SELECT *
FROM (
    SELECT
        sample_id,
        gene_id,
        expression,
        -- 在每个sample_id分组内,按expression降序排名
        ROW_NUMBER() OVER (
            PARTITION BY sample_id      -- 每个样本单独排名
            ORDER BY expression DESC    -- 表达量从高到低
        ) AS rank_in_sample
    FROM gene_expression
) ranked
WHERE rank_in_sample <= 3;  -- 只保留每个样本排名前3的基因
-- 注意:ROW_NUMBER不会并列,表达量完全相同时随机排序

-- ========== 2. RANK vs DENSE_RANK 对比 ==========
SELECT
    gene_id,
    expression,
    RANK() OVER (ORDER BY expression DESC) AS rank_with_gaps,
    -- RANK: 1,2,2,4(并列第2后跳到4)
    DENSE_RANK() OVER (ORDER BY expression DESC) AS rank_no_gaps
    -- DENSE_RANK: 1,2,2,3(并列第2后是3)
FROM gene_expression
WHERE sample_id = 'S001';

-- ========== 3. LAG/LEAD:计算时序表达量变化 ==========
SELECT
    sample_id,
    gene_id,
    timepoint,
    expression,
    -- 取同一样本同一基因的上一个时间点表达量
    LAG(expression, 1) OVER (
        PARTITION BY sample_id, gene_id  -- 同一个样本的同一个基因
        ORDER BY timepoint               -- 按时间点排序
    ) AS prev_timepoint_expr,
    -- 计算相比上一时间点的变化量
    expression - LAG(expression, 1) OVER (
        PARTITION BY sample_id, gene_id
        ORDER BY timepoint
    ) AS expr_change,
    -- 计算变化倍数(需防止除零)
    CASE 
        WHEN LAG(expression, 1) OVER (
            PARTITION BY sample_id, gene_id ORDER BY timepoint
        ) > 0 
        THEN expression / LAG(expression, 1) OVER (
            PARTITION BY sample_id, gene_id ORDER BY timepoint
        )
        ELSE NULL 
    END AS fold_change
FROM gene_expression
ORDER BY sample_id, gene_id, timepoint;

-- ========== 4. 滑动窗口聚合:3个时间点滑动平均 ==========
SELECT
    sample_id,
    gene_id,
    timepoint,
    expression,
    -- 计算当前时间点及前2个时间点的滑动平均(去噪)
    AVG(expression) OVER (
        PARTITION BY sample_id, gene_id
        ORDER BY timepoint
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW  -- 向前2行到当前行
    ) AS moving_avg_3pt,
    -- 计算累积最大值(当前时间点前的最高表达量)
    MAX(expression) OVER (
        PARTITION BY sample_id, gene_id
        ORDER BY timepoint
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW  -- 从第一行到当前行
    ) AS cumulative_max
FROM gene_expression;

-- ========== 5. 百分位排名:找基因在全体样本中的相对位置 ==========
SELECT
    gene_id,
    sample_id,
    expression,
    -- PERCENT_RANK: 0到1之间,0=最低,1=最高
    ROUND(PERCENT_RANK() OVER (
        PARTITION BY gene_id      -- 在同一基因的所有样本中排名
        ORDER BY expression
    ) * 100, 1) AS percentile_rank,
    -- NTILE(4): 分成4个四分位数(1=Q1最低,4=Q4最高)
    NTILE(4) OVER (
        PARTITION BY gene_id
        ORDER BY expression
    ) AS quartile
FROM gene_expression;

-- ========== 6. 组内占比:每个基因在各样本中的表达量占该样本总量的比例 ==========
SELECT
    sample_id,
    gene_id,
    expression,
    -- 该样本的总表达量(所有基因求和)
    SUM(expression) OVER (PARTITION BY sample_id) AS sample_total,
    -- 该基因占该样本总量的百分比
    ROUND(
        expression * 100.0 / SUM(expression) OVER (PARTITION BY sample_id),
        2
    ) AS pct_of_sample
FROM gene_expression
ORDER BY sample_id, pct_of_sample DESC;

-- ========== 7. 复合需求:找各条件下表达量排名前10%的基因 ==========
SELECT gene_id, sample_id, condition, expression, pct_rank
FROM (
    SELECT
        gene_id,
        sample_id,
        condition,
        expression,
        PERCENT_RANK() OVER (
            PARTITION BY condition   -- 在每个实验条件下排名
            ORDER BY expression DESC -- 表达量从高到低
        ) AS pct_rank
    FROM gene_expression
) t
WHERE pct_rank <= 0.1;  -- 前10%(pct_rank越小=排名越靠前)

面试常问点

  1. Q: ROW_NUMBER、RANK、DENSE_RANK 有什么区别?举例说明。 A: 假设分数:90, 90, 85。ROW_NUMBER: 1,2,3;RANK: 1,1,3(跳过2);DENSE_RANK: 1,1,2(不跳)。

  2. Q: 窗口函数和GROUP BY有什么本质区别? A: GROUP BY合并行,每组输出一行;窗口函数不合并行,每行都输出,但附带了窗口内的计算结果。

  3. Q: ROWS BETWEEN和RANGE BETWEEN有什么区别? A: ROWS按物理行数计算(精确);RANGE按值范围计算(相同值的行作为一组)。推荐优先用ROWS更直观。

  4. Q: 如何找出每组中"最新"的一条记录? A: ROW_NUMBER() OVER (PARTITION BY group_col ORDER BY timestamp DESC) = 1


速查表

-- 排名(每组内)
ROW_NUMBER() OVER (PARTITION BY grp ORDER BY val DESC)

-- 组内求和(不折叠行)
SUM(val) OVER (PARTITION BY grp)

-- 取前一行的值
LAG(val, 1, 0) OVER (PARTITION BY grp ORDER BY time)  -- 默认值0

-- 滚动3行平均
AVG(val) OVER (PARTITION BY grp ORDER BY time ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)

-- 累积求和
SUM(val) OVER (PARTITION BY grp ORDER BY time ROWS UNBOUNDED PRECEDING)

-- 分百分位
NTILE(100) OVER (ORDER BY val)  -- 百分位号(1-100)