汽車工廠視頻網(wǎng)站建設(shè)怎么樣把廣告做在百度上
hive開窗函數(shù)
文章目錄
- hive開窗函數(shù)
- 1. 開窗函數(shù)概述
- 1.1 窗口函數(shù)分類
- 1.2 窗口函數(shù)和普通聚合函數(shù)的區(qū)別
- 2. 窗口函數(shù)的基本用法
- 2.1 基本用法
- 2.2 設(shè)置窗口的方法
- 2.2.1 window_name
- 2.2.2 partition by
- 2.2.3 order by 子句
- 2.2.4 rows指定窗口大小
- 窗口框架
- 2.3 開窗函數(shù)中加 order by 和不加 order by 的區(qū)別
- 3. 窗口函數(shù)用法舉例
- 3.1 序號函數(shù): row_number() / rank() / dese_rank()
- 3.2 分布函數(shù): percent_rank() / cume_dist()
- 3.2.1 percent_rank()
- 3.2.2 cume_dist()
- 3.2.3 前后函數(shù)lag(expr, n, defval) 、 lead(expr, n, defval)
- 3.2.4 頭尾函數(shù):first_value(expr) 、 last_value(expr)
- 4 聚合函數(shù)+窗口函數(shù)
1. 開窗函數(shù)概述
窗口函數(shù)也稱OLAP函數(shù),對數(shù)據(jù)庫進行實時分析處理
1.1 窗口函數(shù)分類
- 序號函數(shù):row_number() / rank() / dense_rank()
- 分布函數(shù):percent_rank() / cume_dist()
- 前后函數(shù):lag() / lead()
- 頭尾函數(shù):first_val() / last_val()
- 聚合函數(shù)+窗口函數(shù):sum() over()、 max()/min() over() 、avg() over()
- 其他函數(shù):nth_value() / nfile()
1.2 窗口函數(shù)和普通聚合函數(shù)的區(qū)別
聚合函數(shù)是將多條記錄聚合成一條,窗口函數(shù)是每條記錄都會執(zhí)行,有幾條記錄執(zhí)行完還是幾條
窗口函數(shù)兼具group by子句的分組功能和order by子句的排序功能,但是partition by 子句不具備group by的匯總功能
2. 窗口函數(shù)的基本用法
準備基礎(chǔ)數(shù)據(jù)
CREATE TABLE exam_record (uid int COMMENT '用戶ID',exam_id int COMMENT '試卷ID',start_time timestamp COMMENT '開始時間',submit_time timestamp COMMENT '提交時間',score tinyint COMMENT '得分'
)
COMMENT '考試記錄表'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES ("skip.header.line.count"="1");INSERT INTO exam_record(uid,exam_id,start_time,submit_time,score) VALUES
(1006, 9003, '2021-09-07 10:01:01', '2021-09-07 10:21:02', 84),
(1006, 9001, '2021-09-01 12:11:01', '2021-09-01 12:31:01', 89),
(1006, 9002, '2021-09-06 10:01:01', '2021-09-06 10:21:01', 81),
(1005, 9002, '2021-09-05 10:01:01', '2021-09-05 10:21:01', 81),
(1005, 9001, '2021-09-05 10:31:01', '2021-09-05 10:51:01', 81),
(1004, 9002, '2021-09-05 10:01:01', '2021-09-05 10:21:01', 71),
(1004, 9001, '2021-09-05 10:31:01', '2021-09-05 10:51:01', 91),
(1004, 9002, '2021-09-05 10:01:01', '2021-09-05 10:21:01', 80),
(1004, 9001, '2021-09-05 10:31:01', '2021-09-05 10:51:01', 80);select * from exam_record;
exam_record.uid exam_record.exam_id exam_record.start_time exam_record.submit_time exam_record.score
1006 9001 2021-09-01 12:11:01 2021-09-01 12:31:01 89
1006 9002 2021-09-06 10:01:01 2021-09-06 10:21:01 81
1005 9002 2021-09-05 10:01:01 2021-09-05 10:21:01 81
1005 9001 2021-09-05 10:31:01 2021-09-05 10:51:01 81
1004 9002 2021-09-05 10:01:01 2021-09-05 10:21:01 71
1004 9001 2021-09-05 10:31:01 2021-09-05 10:51:01 91
1004 9002 2021-09-05 10:01:01 2021-09-05 10:21:01 80
1004 9001 2021-09-05 10:31:01 2021-09-05 10:51:01 80
2.1 基本用法
窗口函數(shù)語法
<窗口函數(shù)> over[(partition by <列表清單>)] order by <排序列表清單> [rows between 開始位置 and 結(jié)束位置]
窗口函數(shù):指要使用的分析函數(shù),
over(): 用來指定窗口函數(shù)的范圍,如果括號中什么都不寫,則窗口包含where的所有行
select uidscore,sum(score) over() as sum_score
from exam_record;
運行結(jié)果
uid score sum_score
1006 89 654
1006 81 654
1005 81 654
1005 81 654
1004 71 654
1004 91 654
1004 80 654
1004 80 654
2.2 設(shè)置窗口的方法
2.2.1 window_name
給窗口指定一個別名
select uid,score,rank() over my_window_name as rk_num,row_number() over my_window_name as row_num
from exam_record
window my_window_name as (partition by uid order by score);
2.2.2 partition by
select uid,score,sum(score) over(partition by uid) as sum_score
from exam_record;
按照uid進行分組,分別求和
使用row_number()序號函數(shù),表明序號
selectuid,score,row_number() over(partition by uid) as row_num
from exam_record;
2.2.3 order by 子句
按照哪些字段進行排序,窗口函數(shù)將按照排序后的記錄進行編號
selectuid,score,row_number() over (partition by uid order by score desc) as row_num
from exam_record
單獨使用order by uid
selectuid,score,sum(score) over (order by uid desc) as row_num
from exam_record;
單獨使用partition by uid
selectuid,score,sum(score) over (partition by uid) as row_num
from exam_record;
partition by進行分組內(nèi)的求和,分區(qū)間獨立
order by 對序號相同的進行求和,對序號不同的進行累加求和
單獨使用order by score
selectuid,score,sum(score) over (order by score desc) as row_num
from exam_record;
2.2.4 rows指定窗口大小
查看score的平均值
selectuid,score,avg(score) over(order by score desc) as avg_num
from exam_record
按照score降序排列,每一行計算前一行到當前行的score的平均值
selectuid,score,avg(score) over(order by row_score) as avg_num
from(selectuid,score,row_number() over(order by score desc) as row_scorefrom exam_record)res
窗口框架
指定窗口大小,框架是對窗口的進一步分區(qū),框架有兩種限定方式:
使用rows語句,通過指定當前行之前或之后的固定數(shù)目的行來限制分區(qū)中的行數(shù)
使用range語句,按照排列序列的當前值,根據(jù)相同值來確定分區(qū)中的行數(shù)
order by 字段名 range|rows 邊界規(guī)則0 | [between 邊界規(guī)則1] and 邊界規(guī)則2
range和rows的區(qū)別
range按照值的范圍進行范圍的定義,rows按照行的范圍進行范圍的定義
- 使用框架時,必須要有order by子句,如果僅指定了order by子句未指定框架,則默認框架會使用range unbounded preceding and current row (從第一行到當前行的數(shù)據(jù))
- 如果窗口函數(shù)沒有指定order by子句,就不存在 rows|range 窗口的計算
- range 只支持使用unbounded 和 current row
查詢我與前兩名的平均值
selectuid,score,avg(score) over(order by score desc rows 2 preceding) as avg_score
from exam_record;
查詢當前行及前后一行的平均值
selectuid,score,avg(score) over(order by score desc rows between 1 preceding and 1 following) as avg_score
from exam_record;
2.3 開窗函數(shù)中加 order by 和不加 order by 的區(qū)別
當開窗函數(shù)為排序函數(shù)時,如row_number()、rank()等,over中的order by 只起到窗口內(nèi)排序的作用
當開窗函數(shù)為聚合函數(shù)時,如max、min、count等,over中的order by不僅對窗口內(nèi)排序,還起到窗口內(nèi)從當前行到之前所有行的聚合
selectuid,exam_id,start_time,sum(score) over(partition by uid) as one,sum(score) over(partition by uid order by start_time) as two
from exam_record
3. 窗口函數(shù)用法舉例
3.1 序號函數(shù): row_number() / rank() / dese_rank()
區(qū)別:rank() : 并列排序,跳過重復序號------1、1、3
? row_number() : 順序排序——1、2、3
? dese_rank() : 并列排序,不跳過重復序號——1、1、2
selectuid,score,rank() over my_window as rk_num,row_number() over my_window as row_num
from exam_record
window my_window as (partition by uid order by score);
不使用窗口函數(shù)實現(xiàn)分數(shù)排序
SELECTP1.uid,P1.score,(SELECTCOUNT(P2.score)FROM exam_record P2WHERE P2.score > P1.score) + 1 AS rank_1
FROM exam_record P1
ORDER BY rank_1;
3.2 分布函數(shù): percent_rank() / cume_dist()
3.2.1 percent_rank()
percent_rank() 函數(shù)將某個數(shù)據(jù)在數(shù)據(jù)集的排位作為數(shù)據(jù)集的百分比值返回,范圍0到1,
按照(rank - 1) / (rows - 1)進行計算,rank為rank()函數(shù)產(chǎn)生的序號,rows為當前窗口的記錄總行數(shù)
selectuid,score,rank() over my_window as rank_num,percent_rank() over my_window as prk
from exam_record
window my_window as (order by score desc)
3.2.2 cume_dist()
如果升序排列,則統(tǒng)計:小于等于當前值的行數(shù) / 總行數(shù)
如果降序排列,則統(tǒng)計:大于等于當前值的行數(shù) / 總行數(shù)
查詢小于等于當前score的比例
selectuid,score,rank() over my_window as rank_num,cume_dist() over my_window as cume
from exam_record
window my_window as (order by score asc);
3.2.3 前后函數(shù)lag(expr, n, defval) 、 lead(expr, n, defval)
lag()和lead()函數(shù)可以在同一次查詢中取出同一字段前 n 行的數(shù)據(jù)和后 n 行的數(shù)據(jù)作為獨立列
lag( exp_str,offset,defval) over(partition by .. order by …)lead(exp_str,offset,defval) over(partition by .. order by …)
- exp_str 是字段名
- offset是偏移量,即 n 的值
- defval默認值,如何當前行向前或向后 n 的位置超出表的范圍,則會將defval的值作為返回值,默認為NULL
查詢前1名同學和后一名同學的成績和當前同學成績的差值
- 先將前一名、后一名以及當前行的分數(shù)放在一起
selectuid,score,lag(score, 1, 0) over my_window as `before`,lead(score, 1, 0) over my_window as `next`
from exam_record
window my_window as (order by score desc);
- 然后做差值
selectuid,score,score - before as before,score - next as next
from (selectuid,score,lag(score, 1, 0) over my_window as before,lead(score, 1, 0) over my_window as next
from exam_record
window my_window as (order by score desc))res
3.2.4 頭尾函數(shù):first_value(expr) 、 last_value(expr)
- 返回第一個expr:first_value(expr)
- 返回第二個expr:last_value(expr)
查詢第一個和最后一個分數(shù)
selectuid,score,first_value(score) over my_window as first,last_value(score) over my_window as last
from exam_record
window my_window as (order by score desc);
4 聚合函數(shù)+窗口函數(shù)
窗口函數(shù)在where之后執(zhí)行,所以where需要用窗口函數(shù)作為條件
SELECTuid,score,sum(score) OVER my_window_name AS sum_score,max(score) OVER my_window_name AS max_score,min(score) OVER my_window_name AS min_score,avg(score) OVER my_window_name AS avg_scoreFROM exam_recordWINDOW my_window_name AS (ORDER BY score desc)